Design and develop multi-agent benchmark tasks focused on complex data analysis workflows.
Create or curate realistic datasets such as CSV, JSON, logs, reports, and financial or operational data.
Build tasks that require cross-referencing across multiple data sources.
Develop tasks that include anomaly detection, contradiction identification, statistical analysis, and interpretation.
Define task decomposition strategies across specialized sub-agents for financial, technical, and operational analysis.
Develop verification logic to validate precise analytical outputs rather than generic summaries.
Implement evaluation pipelines using Python and SQL.
Create reproducible environments using Docker.
Analyze task performance and refine tasks for clarity, difficulty, and scoring accuracy.

Requirements

5+ years of experience in data analysis or analytics-heavy roles.
Strong proficiency in Python, including pandas and NumPy, and SQL.
Experience working with real-world, messy datasets such as CSV, JSON, logs, and reports.
Ability to design analytical problems with clear, verifiable answers.
Solid understanding of statistics, including distributions, correlations, and outliers.
Familiarity with AI benchmarks or evaluation environments such as SWE-bench or similar.
Hands-on experience with Docker, including Dockerfiles, image builds, and debugging.
Experience in financial analysis, operations analytics, or risk analysis is preferred.
Exposure to data pipelines or ETL workflows is preferred.
Experience with data quality validation or anomaly detection systems is preferred.
Familiarity with AI/ML data workflows or evaluation frameworks is preferred.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Language Data Quality Reviewer for Ukrainian (Freelance/Task-based)

Volga Partners 51-250 Internet Software & Services

Volga Partners is hiring a freelance, task-based Language Data and Quality Reviewer to analyze and review Ukrainian and English data for an ongoing client project with flexible, intermittent work.

Ukraine Contract Junior Data Analyst Manual QA Tester

$0k-$0k

Machine Learning

18 minutes ago

Apply

18 minutes ago

Chemistry & Python Expert - Freelance AI Trainer

Mindrift.ai: Be the “I” in AI Internet Software & Services

Mindrift is seeking chemistry specialists for project-based AI evaluation work focused on creating and validating computational chemistry tasks for leading tech companies.

Chile Colombia Costa Rica Peru Uruguay Mexico Part-time Junior AI (Artificial Intelligence)

$0k-$0k

C MATLAB NumPy Pandas Python R SciPy SQL

1 hour, 38 minutes ago

Apply

1 hour, 38 minutes ago

Civil Engineer & Python Expert - Freelance AI Trainer

Mindrift.ai: Be the “I” in AI Internet Software & Services

Mindrift is seeking part-time engineering contributors for project-based AI work that involves creating and verifying computational problems for leading tech companies.

United States Part-time Junior AI (Artificial Intelligence)

Up to $144k

C MATLAB NumPy Pandas Python R SciPy SQL

2 hours, 12 minutes ago

Apply

2 hours, 12 minutes ago

Civil Engineer & Python Expert - Freelance AI Trainer

Mindrift.ai: Be the “I” in AI Internet Software & Services

Mindrift is seeking part-time engineering contributors for project-based AI work focused on creating, testing, and validating computational problems for real engineering workflows.

United States Part-time Junior AI (Artificial Intelligence)

Up to $144k

C MATLAB NumPy Pandas Python R SciPy SQL System Design

2 hours, 19 minutes ago

Apply

2 hours, 19 minutes ago

Gramian Consultancy Group

Tags

Links

AI Evaluation Engineer (Data Analysis & Multi-Agent Systems)

Gramian Consultancy Group

Description

Requirements

Similar Roles

Language Data Quality Reviewer for Ukrainian (Freelance/Task-based)

Chemistry & Python Expert - Freelance AI Trainer

Civil Engineer & Python Expert - Freelance AI Trainer

Civil Engineer & Python Expert - Freelance AI Trainer

You're on a roll! Sign up now to keep applying.