AI Evaluation Engineer (Data Analysis & Multi-Agent Systems)

4 hours, 42 minutes ago

Description

  • Design and develop multi-agent benchmark tasks focused on complex data analysis workflows.
  • Create or curate realistic datasets such as CSV, JSON, logs, reports, and financial or operational data.
  • Build tasks that require cross-referencing across multiple data sources.
  • Develop tasks that include anomaly detection, contradiction identification, statistical analysis, and interpretation.
  • Define task decomposition strategies across specialized sub-agents for financial, technical, and operational analysis.
  • Develop verification logic to validate precise analytical outputs rather than generic summaries.
  • Implement evaluation pipelines using Python and SQL.
  • Create reproducible environments using Docker.
  • Analyze task performance and refine tasks for clarity, difficulty, and scoring accuracy.

Requirements

  • 5+ years of experience in data analysis or analytics-heavy roles.
  • Strong proficiency in Python, including pandas and NumPy, and SQL.
  • Experience working with real-world, messy datasets such as CSV, JSON, logs, and reports.
  • Ability to design analytical problems with clear, verifiable answers.
  • Solid understanding of statistics, including distributions, correlations, and outliers.
  • Familiarity with AI benchmarks or evaluation environments such as SWE-bench or similar.
  • Hands-on experience with Docker, including Dockerfiles, image builds, and debugging.
  • Experience in financial analysis, operations analytics, or risk analysis is preferred.
  • Exposure to data pipelines or ETL workflows is preferred.
  • Experience with data quality validation or anomaly detection systems is preferred.
  • Familiarity with AI/ML data workflows or evaluation frameworks is preferred.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

AI/ML Data Contributor

TSMG Professional Services

AI/ML Data Contributor role with a company supporting active and upcoming machine learning projects across the United States, focused on task-based data and testing work in remote and occasional on-site settings.

Machine Learning
17 minutes ago

Synthetic Data Engineer (AI Data/Training)

Hyphen Connect 1-10 staffing & recruiting

Synthetic Data Engineer at an organization building domain-specific synthetic data generation pipelines and data workflows that support model training.

Apache Airflow Apache Spark
48 minutes ago

Freelance Annotator (English) - AI Trainer

Toloka 251-1K Internet Software & Services

Toloka is seeking freelance AI annotators to support project-based online tasks that help train and improve generative AI through data review, labeling, and evaluation.

Generative AI
52 minutes ago

AI/ML Data Contributor

TSMG Professional Services

AI/ML Data Contributor is a remote, task-based contract role with a U.S.-based company supporting machine learning projects through data collection, evaluation, and testing.

Machine Learning
1 hour, 7 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers