AI Evaluation Engineer (Data Analysis & Multi-Agent Systems)

2 weeks, 6 days ago

Description

  • Design and develop multi-agent benchmark tasks focused on complex data analysis workflows.
  • Create or curate realistic datasets such as CSV, JSON, logs, reports, and financial or operational data.
  • Build tasks that require cross-referencing across multiple data sources.
  • Develop tasks that include anomaly detection, contradiction identification, statistical analysis, and interpretation.
  • Define task decomposition strategies across specialized sub-agents for financial, technical, and operational analysis.
  • Develop verification logic to validate precise analytical outputs rather than generic summaries.
  • Implement evaluation pipelines using Python and SQL.
  • Create reproducible environments using Docker.
  • Analyze task performance and refine tasks for clarity, difficulty, and scoring accuracy.

Requirements

  • 5+ years of experience in data analysis or analytics-heavy roles.
  • Strong proficiency in Python, including pandas and NumPy, and SQL.
  • Experience working with real-world, messy datasets such as CSV, JSON, logs, and reports.
  • Ability to design analytical problems with clear, verifiable answers.
  • Solid understanding of statistics, including distributions, correlations, and outliers.
  • Familiarity with AI benchmarks or evaluation environments such as SWE-bench or similar.
  • Hands-on experience with Docker, including Dockerfiles, image builds, and debugging.
  • Experience in financial analysis, operations analytics, or risk analysis is preferred.
  • Exposure to data pipelines or ETL workflows is preferred.
  • Experience with data quality validation or anomaly detection systems is preferred.
  • Familiarity with AI/ML data workflows or evaluation frameworks is preferred.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Language Data Quality Reviewer for Ukrainian (Freelance/Task-based)

Volga Partners 51-250 Internet Software & Services

Volga Partners is hiring a freelance, task-based Language Data and Quality Reviewer to analyze and review Ukrainian and English data for an ongoing client project with flexible, intermittent work.

Machine Learning
18 minutes ago

Chemistry & Python Expert - Freelance AI Trainer

Mindrift.ai: Be the “I” in AI Internet Software & Services

Mindrift is seeking chemistry specialists for project-based AI evaluation work focused on creating and validating computational chemistry tasks for leading tech companies.

C MATLAB NumPy Pandas Python R SciPy SQL
1 hour, 38 minutes ago

Civil Engineer & Python Expert - Freelance AI Trainer

Mindrift.ai: Be the “I” in AI Internet Software & Services

Mindrift is seeking part-time engineering contributors for project-based AI work that involves creating and verifying computational problems for leading tech companies.

C MATLAB NumPy Pandas Python R SciPy SQL
2 hours, 12 minutes ago

Civil Engineer & Python Expert - Freelance AI Trainer

Mindrift.ai: Be the “I” in AI Internet Software & Services

Mindrift is seeking part-time engineering contributors for project-based AI work focused on creating, testing, and validating computational problems for real engineering workflows.

C MATLAB NumPy Pandas Python R SciPy SQL System Design
2 hours, 19 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers