Omilia

Omilia

Omilia is a global leader in Conversational AI, offering AI-based self-service solutions for enhanced customer care fulfillment and success.

IT Services
251-1K
Founded 2002
$20M raised

Description

  • Own the end-to-end data architecture for the training environment, including dataset design, schema definition, and data flow from production to training systems.
  • Define and govern data selection and sampling strategies for production conversations, including diversity optimization, confidence-based filtering, edge-case prioritization, and deduplication.
  • Build and maintain the data catalog and dataset discovery infrastructure so ML teams can find, understand, and use training data efficiently.
  • Define annotation pipeline requirements for intent labeling, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation.
  • Design and maintain the closed-loop data flywheel that moves conversations from production through curation, annotation, model retraining, evaluation, and redeployment.
  • Own data pipelines and infrastructure across Snowflake, AWS S3, Airflow, and AWS SageMaker-integrated ML workflows.
  • Work directly with LLM, NLU, Speech, and Agentic teams to translate model data needs into dataset specifications and pipeline configurations.
  • Define data quality frameworks and targeted corpora extraction methods to improve model outcomes from low-confidence, no-match, and other failure-case data.
  • Evaluate and manage external data annotation vendors and ensure annotation workflows produce consistent, high-quality labels at scale.
  • Maintain documentation, dataset lineage, architecture RFCs, and best practices for the broader ML organization.

Requirements

  • 5+ years of experience in data architecture, data engineering, or LLM/ML data infrastructure with ownership of production data systems supporting model development.
  • Strong understanding of what makes training data high-quality, diverse, and useful for LLM and NLU model development.
  • Deep experience with data modeling, schema design, and data pipeline architecture.
  • Strong proficiency with Snowflake, AWS S3, and ETL/ELT orchestration tools such as Airflow, dbt, or similar.
  • Experience defining annotation requirements and managing data labeling workflows such as intent labeling, entity tagging, or dialog classification.
  • Experience with data cataloging, metadata management, and dataset discovery at scale.
  • Strong SQL and Python skills for data pipeline development and data quality analysis.
  • Experience with data quality frameworks, including deduplication, sampling strategies, and diversity optimization.
  • Master’s degree or PhD in Computer Science, Data Engineering, Information Systems, or a related field.
  • Preferred experience with LLM training data preparation, including instruction tuning, preference data, RLHF/DPO annotation, or synthetic data generation.
  • Preferred experience with data anonymization and PII/PCI redaction in ML data pipelines.
  • Preferred familiarity with AWS SageMaker integration, active learning, and data selection strategies.
  • Preferred knowledge of voice/audio data handling, storage, and processing at scale.
  • Experience with conversational AI data such as dialog transcripts, ASR outputs, and NLU annotations is a strong advantage.
  • Experience with data governance in regulated industries such as financial services or healthcare is a plus.
  • Familiarity with NER/NLU-based data processing approaches such as spaCy, HuggingFace, or custom entity recognition is desirable.

Benefits

  • Fixed compensation.
  • Long-term employment with vacation days.
  • Professional development support, including courses and training.
  • Opportunity to work on cutting-edge technology products with global impact.
  • Collaborative, fun-to-work-with colleagues.
  • Apple gear provided.
  • Equal opportunity employer commitment with a diverse and inclusive workplace.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Databricks Architect

Stitch 11-50 Internet Software & Services

Stitch is hiring a Senior Databricks Architect to lead enterprise Lakehouse and martech data architecture work for Fortune 1000 clients, building Databricks-Braze solutions that improve customer data foundations, segmentation, and AI-driven analytics.

Apache Airflow Apache Spark AWS Azure CRM Databricks GCP Git MLflow Scala Snowflake SQL Terraform
2 hours, 8 minutes ago

[Job - 29428] Arquiteto(a) Sênior de Dados — Governança, FinOps e Segurança

CI&T 5K-10K Internet Software & Services

CI&T está contratando um(a) Arquiteto(a) Sênior de Dados para liderar a evolução de uma plataforma corporativa híbrida em Azure e on-premise, com foco em governança, FinOps, segurança e conformidade em ambiente SOX.

Apache Spark Azure Databricks SQL Server
2 hours, 23 minutes ago

Senior Manger, Data Engineering

AssistRx 251-1K Pharmaceuticals

AssistRx is hiring a Senior Manager, Data Engineering to lead a team building and operating scalable data pipelines, architectures, and platforms for client implementations and internal data programs.

Apache Spark AWS Azure dbt Hadoop Salesforce Snowflake SQL
1 day, 2 hours ago

Salesforce Data Architect, Revenue Solutions

NeuraFlash 251-1K IT Services

NeuraFlash, Part of Accenture is hiring a Data Architect – Revenue Solutions to lead Salesforce Revenue Cloud data migration work focused on CPQ and Billing for quote-to-cash transformations.

Agile Confluence ERP GitHub JIRA NetSuite Oracle Salesforce SAP SQL SQL Server
1 day, 13 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers