AI Evaluation Engineer (Knowledge & Research)

7 hours, 31 minutes ago

Description

  • Build multi-agent benchmark tasks that require reading, analyzing, and synthesizing large document collections.
  • Curate real-world research corpora, including academic papers, case studies, and technical reports, and design questions that require comprehensive analysis.
  • Write structured ground-truth oracles in JSON with specific, verifiable answers tied to the source material.
  • Design LLM judge prompts that evaluate agent output field by field against the oracle.
  • Create decomposition guides that split research across multiple parallel sub-agents and then synthesize the results.
  • Develop datasets and evaluation frameworks for benchmarking next-generation AI systems.
  • Translate research content into measurable evaluation tasks with high precision and clear scoring criteria.

Requirements

  • 5+ years of experience in research, academic or industry, in a scientific, technical, or analytical domain.
  • Strong ability to read, analyze, and extract structured information from unstructured documents.
  • Experience designing or working with structured data formats such as JSON, schemas, and validation.
  • Proficiency in Python scripting for data processing, validation, or evaluation scripts.
  • Experience with AI evaluation, coding benchmarks, or structured reasoning tasks such as SWE-bench or Terminal-bench, or similar.
  • Experience working with Docker, including building images and debugging containers.
  • Strong attention to detail when defining exact, verifiable outputs.
  • Ability to design complex, multi-step problem-solving workflows.
  • High analytical thinking and structured problem decomposition skills.
  • Availability for 8 hours per day with 4 hours of overlap with PST.
  • Contractor assignment availability for 5 weeks+.
  • Location in one of the supported countries: Bangladesh, Brazil, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria, Turkey, or Vietnam.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

AI Evaluation & Annotation Specialist (Entry-Mid Level) - Italian (Global)

Volga Partners 51-250 Internet Software & Services

AI Evaluation & Annotation Specialists at an AI-focused company will review, annotate, and assess LLM outputs to improve accuracy and consistency in production workflows.

LLM Machine Learning
1 hour, 12 minutes ago

Senior Consultant (MBB & Top-Tier Firms) - Freelance AI Project

Mindrift.ai: Be the “I” in AI Internet Software & Services

Mindrift, powered by Toloka, is hiring experienced top-tier strategy consultants to help design realistic management consulting learning environments and evaluation frameworks for AI systems.

1 hour, 15 minutes ago

Optical Engineer - Freelance AI Trainer

Mindrift.ai: Be the “I” in AI Internet Software & Services

Mindrift is seeking optical engineers and physics specialists for project-based AI work focused on testing, evaluating, and improving AI systems through the creation of original, research-style optics and physics problems.

2 hours ago

Optical Engineer - Freelance AI Trainer

Mindrift.ai: Be the “I” in AI Internet Software & Services

Mindrift is seeking optical and physics specialists for project-based AI work focused on creating and validating challenging physics problems for leading tech companies.

2 hours, 14 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers