Description

  • Design and build multi-agent benchmark tasks based on real-world code changes such as bug fixes, migrations, and refactors.
  • Work with the Harbor evaluation framework to run and validate tasks in containerized environments.
  • Write clear and precise task instructions, including file paths, function signatures, expected behavior, and constraints.
  • Develop Python-based verification scripts to validate the correctness of code changes.
  • Define task decomposition strategies across multiple specialized agents.
  • Analyze and navigate large open-source codebases to extract realistic task scenarios.
  • Run, debug, and refine tasks in Docker environments to ensure reproducibility.
  • Improve task quality, clarity, and difficulty based on evaluation results.

Requirements

  • 5+ years of experience in software development, with Python and JavaScript.
  • Strong experience working with large codebases such as Django, Flask, FastAPI, Node.js, or similar.
  • Familiarity with Git workflows, including pull requests, diffs, commits, and cherry-picking.
  • Experience writing tests or validation scripts using pytest, unittest, or similar tools.
  • Ability to write clear and precise technical specifications.
  • Familiarity with AI coding benchmarks or evaluation frameworks such as SWE-bench or similar.
  • Hands-on experience with Docker, including Dockerfiles, image builds, and debugging.
  • Experience contributing to or maintaining open-source projects is preferred.
  • Experience with code migrations or large-scale refactoring is preferred.
  • Familiarity with CI/CD pipelines and automated testing workflows is preferred.
  • Exposure to LLM-based coding tools or evaluation frameworks is preferred.
  • Availability for 8 hours per day with 4 hours of overlap with PST.
  • Ability to work as a contractor for a 4+ week assignment.
  • Location in Bangladesh, Brazil, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria, Turkey, or Vietnam.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

AI/ML Data Contributor

TSMG Professional Services

AI/ML Data Contributor role with a company supporting active and upcoming machine learning projects across the United States, focused on task-based data and testing work in remote and occasional on-site settings.

Machine Learning
17 minutes ago

Synthetic Data Engineer (AI Data/Training)

Hyphen Connect 1-10 staffing & recruiting

Synthetic Data Engineer at an organization building domain-specific synthetic data generation pipelines and data workflows that support model training.

Apache Airflow Apache Spark
48 minutes ago

Freelance Annotator (English) - AI Trainer

Toloka 251-1K Internet Software & Services

Toloka is seeking freelance AI annotators to support project-based online tasks that help train and improve generative AI through data review, labeling, and evaluation.

Generative AI
52 minutes ago

AI/ML Data Contributor

TSMG Professional Services

AI/ML Data Contributor is a remote, task-based contract role with a U.S.-based company supporting machine learning projects through data collection, evaluation, and testing.

Machine Learning
1 hour, 7 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers