Omilia

Omilia

Omilia is a global leader in Conversational AI, offering AI-based self-service solutions for enhanced customer care fulfillment and success.

IT Services
251-1K
Founded 2002
$20M raised

Description

  • Own the end-to-end data architecture for the training environment, including dataset design, schema definition, and data flow from production to training systems.
  • Define and govern data selection and sampling strategies for production conversations, including diversity optimization, confidence-based filtering, edge-case prioritization, and deduplication.
  • Build and maintain the data catalog and dataset discovery infrastructure so ML teams can find, understand, and use training data efficiently.
  • Define annotation pipeline requirements for intent labeling, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation.
  • Design and maintain the closed-loop data flywheel that moves conversations from production through curation, annotation, model retraining, evaluation, and redeployment.
  • Own data pipelines and infrastructure across Snowflake, AWS S3, Airflow, and AWS SageMaker-integrated ML workflows.
  • Work directly with LLM, NLU, Speech, and Agentic teams to translate model data needs into dataset specifications and pipeline configurations.
  • Define data quality frameworks and targeted corpora extraction methods to improve model outcomes from low-confidence, no-match, and other failure-case data.
  • Evaluate and manage external data annotation vendors and ensure annotation workflows produce consistent, high-quality labels at scale.
  • Maintain documentation, dataset lineage, architecture RFCs, and best practices for the broader ML organization.

Requirements

  • 5+ years of experience in data architecture, data engineering, or LLM/ML data infrastructure with ownership of production data systems supporting model development.
  • Strong understanding of what makes training data high-quality, diverse, and useful for LLM and NLU model development.
  • Deep experience with data modeling, schema design, and data pipeline architecture.
  • Strong proficiency with Snowflake, AWS S3, and ETL/ELT orchestration tools such as Airflow, dbt, or similar.
  • Experience defining annotation requirements and managing data labeling workflows such as intent labeling, entity tagging, or dialog classification.
  • Experience with data cataloging, metadata management, and dataset discovery at scale.
  • Strong SQL and Python skills for data pipeline development and data quality analysis.
  • Experience with data quality frameworks, including deduplication, sampling strategies, and diversity optimization.
  • Master’s degree or PhD in Computer Science, Data Engineering, Information Systems, or a related field.
  • Preferred experience with LLM training data preparation, including instruction tuning, preference data, RLHF/DPO annotation, or synthetic data generation.
  • Preferred experience with data anonymization and PII/PCI redaction in ML data pipelines.
  • Preferred familiarity with AWS SageMaker integration, active learning, and data selection strategies.
  • Preferred knowledge of voice/audio data handling, storage, and processing at scale.
  • Experience with conversational AI data such as dialog transcripts, ASR outputs, and NLU annotations is a strong advantage.
  • Experience with data governance in regulated industries such as financial services or healthcare is a plus.
  • Familiarity with NER/NLU-based data processing approaches such as spaCy, HuggingFace, or custom entity recognition is desirable.

Benefits

  • Fixed compensation.
  • Long-term employment with vacation days.
  • Professional development support, including courses and training.
  • Opportunity to work on cutting-edge technology products with global impact.
  • Collaborative, fun-to-work-with colleagues.
  • Apple gear provided.
  • Equal opportunity employer commitment with a diverse and inclusive workplace.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Principal Collibra Architect

Attain Partners 251-1K Media

Attain Partners is hiring a Principal Collibra Architect (Ranger) to lead enterprise data governance and data intelligence implementations for clients across higher education, nonprofit, public sector, and commercial environments.

AWS Azure CRM Databricks ERP GCP Snowflake
16 hours, 38 minutes ago

Sr. Data Architect - Remote

TWO95 International 51-250 Internet Software & Services

Sr. Data Architect for a contract-to-hire remote role focused on data architecture and graph data solutions within an AWS-based environment.

AWS GitLab IoT Java Kafka Kubernetes Neo4j Node.js
2 days, 16 hours ago

Senior Data Architect

Goods & Services 51-250 Media

Goods & Services is seeking a Senior Data Architect to define and govern enterprise data architecture for its data engineering, analytics, data science, and decision science teams in a global, product-led environment.

Databricks dbt Feature Engineering Flink GCP Kafka MLOps Snowflake
2 days, 16 hours ago

Solution Architect - Cleveland, OH w/ REMOTE

TWO95 International 51-250 Internet Software & Services

Solution Architect for a long-term remote contract with a Cleveland, OH client, focused on redesigning enterprise data warehouse architecture for scalability and resolving current data architecture pain points.

2 days, 16 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers