Senior Machine Learning Engineer, ML Infrastructure - Offline

2 hours, 40 minutes ago
Full-time
Senior
Software Development
Unity

Unity

Unity is the top platform for real-time 3D content creation, empowering creators across industries to bring their ideas to life with interactive 2D and 3D content.

Internet Software & Services
5K-10K
Founded 2004

Description

  • Design and operate large-scale data pipelines that generate training datasets for machine learning training and experimentation.
  • Develop infrastructure that supports distributed training workflows using technologies such as PyTorch, Ray Data, and Ray Train.
  • Integrate ML pipelines with workflow orchestration systems such as Flyte, Airflow, or similar tools to enable reliable multi-stage training workflows.
  • Improve reproducibility and observability of ML pipelines through dataset validation, monitoring, and automated testing.
  • Optimize performance and resource utilization across distributed compute systems used for data processing and model training.
  • Partner closely with ML engineers to enable efficient large-scale experimentation and model iteration.
  • Lead architectural improvements to ensure offline ML pipelines remain scalable, reliable, and cost-efficient.
  • Shape how model datasets are prepared, validated, and delivered to distributed training systems.
  • Support production ML systems by maintaining reliable infrastructure for training workflows.
  • Work across ML engineering and platform teams to handle growing data volumes and increasingly complex training workloads.

Requirements

  • Experience working with distributed computing frameworks such as Ray, Spark, or Flink, with familiarity in the Ray ecosystem including Ray Data and Ray Train.
  • Experience building and optimizing large-scale distributed ML training pipelines with Torch Compilation, Quantization, CUDA, or GPU kernel optimization.
  • Experience building infrastructure for training data generation, dataset preparation, or ML feature pipelines.
  • Deep experience designing and operating production-grade data pipelines.
  • Strong programming skills in Python and experience working with large-scale distributed workloads.
  • Experience with modern data infrastructure such as data lakes, warehouses, orchestration systems, and streaming platforms.
  • Strong systems thinking and the ability to reason about performance, scalability, reliability, and cost tradeoffs in distributed systems.
  • Proven ability to lead technical direction and influence architectural decisions across teams without formal authority.
  • Experience with workflow orchestration systems such as Flyte or Airflow, or similar tools.
  • Strong English communication skills for frequent written and verbal collaboration with global colleagues and partners.
  • Relocation support is not available for this position.
  • Work visa or immigration sponsorship is not available for this position.

Benefits

  • Comprehensive health, life, and disability insurance.
  • Commute subsidy.
  • Employee stock ownership.
  • Competitive retirement or pension plans.
  • Generous vacation and personal days.
  • Support for new parents through leave and family-care programs.
  • Mental health and wellbeing programs and support.
  • Training and development programs.
  • Office food snacks.
  • Employee Resource Groups.
  • Global Employee Assistance Program.
  • Volunteering and donation matching program.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Staff Systems Engineer - Perception

Apptronik 51-250 Aerospace & Defense

Apptronik is hiring a Staff Systems Engineer to define and own the perception system architecture, requirements, validation, and safety strategy for its Apollo humanoid robot as it moves toward large-scale deployment in human environments.

MATLAB Python
1 hour, 35 minutes ago

Principal Technical Director, AI-Enabled Spectrum Dominance

Voyager Life Sciences Tools & Services

Voyager Technologies is seeking a Principal Technical Director, AI-Enabled Spectrum Dominance to lead advanced RF, DSP, and agentic AI efforts for space-based and terrestrial electromagnetic warfare and sensing programs.

Machine Learning Microservices SOC
5 hours, 56 minutes ago

Full-Stack AI Engineer

Pavago IT Services

Pavago is hiring a remote Full-Stack AI Engineer to build and scale production AI-powered applications, connecting AI models, backend systems, data pipelines, and frontend experiences for real business use cases.

Apache Airflow AWS Azure CI/CD Dagster Docker FastAPI Flask GCP HIPAA Hugging Face JavaScript Kubeflow Kubernetes Microservices MLflow Next.js Node.js Prefect Python PyTorch React SageMaker Serverless SQL TensorFlow TypeScript Vertex AI Vue.js
6 hours, 11 minutes ago

Sr. Machine Learning Engineer, Applied Science

Pinterest 5K-10K Internet Software & Services

Pinterest Labs is hiring a research engineer or scientist to develop and ship internal text-to-image generative models for Pinterest Canvas, working closely with a small visual modeling team and production-focused multimodal systems.

Computer Vision LLM Machine Learning SQL
6 hours, 54 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers