Staff Machine Learning Engineer, Offline Infrastructure

1 hour, 54 minutes ago
Full-time
Lead
Software Development
Unity

Unity

Unity is the top platform for real-time 3D content creation, empowering creators across industries to bring their ideas to life with interactive 2D and 3D content.

Internet Software & Services
5K-10K
Founded 2004

Description

  • Design and operate large-scale data pipelines that generate training datasets for machine learning training and experimentation.
  • Develop infrastructure that supports distributed training workflows using tools such as PyTorch, Ray Data, and Ray Train.
  • Integrate ML pipelines with workflow orchestration systems such as Flyte, Airflow, or similar tools.
  • Improve reproducibility and observability through dataset validation, monitoring, and automated testing.
  • Optimize performance and resource utilization across distributed compute systems used for data processing and model training.
  • Partner closely with ML engineers to support large-scale experimentation and model iteration.
  • Lead architectural improvements to keep offline ML pipelines scalable, reliable, and cost-efficient.
  • Shape how model datasets are prepared, validated, and delivered to distributed training systems.

Requirements

  • Strong experience building large-scale ML pipelines.
  • Experience with distributed computing frameworks such as Ray, Spark, or Flink, with familiarity in the Ray ecosystem (Ray Data, Ray Train) preferred.
  • Experience building infrastructure for training data generation, dataset preparation, or ML feature pipelines.
  • Deep experience designing and operating production-grade data pipelines.
  • Strong programming skills in Python and experience working with large-scale distributed workloads.
  • Experience with modern data infrastructure, including data lakes, warehouses, orchestration systems, and streaming platforms.
  • Strong systems thinking with the ability to reason about performance, scalability, reliability, and cost tradeoffs in distributed systems.
  • Proven ability to lead technical direction and influence architectural decisions across teams without formal authority.
  • Sufficient knowledge of English for professional verbal and written communication with global colleagues and partners.
  • Relocation support is not available for this position.

Benefits

  • Gross pay salary of $209,700 to $283,800 USD.
  • Comprehensive health, life, and disability insurance.
  • Commuter subsidy.
  • Employee stock ownership.
  • Competitive retirement or pension plans.
  • Generous vacation and personal days.
  • Support for new parents through leave and family-care programs.
  • Mental health and wellbeing programs and support.
  • Training and development programs.
  • Volunteering and donation matching program.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Software Engineer II (ML Feature Platform)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring a software engineer for its ML Feature Platform team to build and support the self-serve systems that power feature creation, storage, backfilling, and serving for machine learning and decisioning.

AWS Kotlin Kubernetes MySQL Python
1 hour, 26 minutes ago

Staff Machine Learning Engineer

Twilio 5K-10K Diversified Telecommunication Services

Twilio is hiring a remote L4 Machine Learning Engineer for its Trust Intelligence Platform team to build the cloud-native data and ML infrastructure behind real-time customer interaction intelligence.

Apache Airflow Apache Spark AWS Azure Dagster Docker Flink GCP Generative AI Kafka Kubernetes Machine Learning Microservices MLflow MLOps Pulumi Python SageMaker Snowflake SQL Terraform Vertex AI
2 hours, 1 minute ago

Sagemaker DevOps Engineer - Europe

Xenon7 Internet Software & Services

Xenon7 is hiring a remote Sagemaker DevOps Engineer in Europe to architect and automate enterprise-scale AWS SageMaker environments and streamline ML deployments from development to production.

AWS CI/CD Docker Jenkins MLOps Python
2 hours, 20 minutes ago

MLOps & Agentic Platform Engineer (AI Infrastructure)

Hyphen Connect 1-10 staffing & recruiting

MLOps & Agentic Platform Engineer at an unspecified company, focused on building and operating the infrastructure for deployed agents, model lifecycle management, and experimentation.

Docker Kubernetes Microservices MLflow MLOps Terraform
3 hours, 56 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers