Staff Machine Learning Engineer, ML Infrastructure - Online

2 hours, 9 minutes ago
Full-time
Senior
Software Development
Unity

Unity

Unity is the top platform for real-time 3D content creation, empowering creators across industries to bring their ideas to life with interactive 2D and 3D content.

Internet Software & Services
5K-10K
Founded 2004

Description

  • Design and operate large-scale online inference infrastructure for low-latency, high-reliability production model serving.
  • Build and improve model serving systems using frameworks and platforms such as PyTorch, Triton Inference Server, Kubernetes, GKE, and Ray.
  • Optimize inference performance through batching, model compilation, GPU/CPU utilization improvements, request scheduling, and runtime tuning.
  • Develop infrastructure for model deployment, canary testing, A/B experimentation, traffic splitting, rollback, and production validation.
  • Improve observability for online ML systems through monitoring of latency, throughput, error rates, cost, saturation, and model health.
  • Build self-healing and autoscaling capabilities to support dynamic traffic patterns and production reliability needs.
  • Partner with ML engineers to enable faster model iteration while maintaining safety, scalability, and cost efficiency.
  • Improve the reliability and reproducibility of model serving workflows, including packaging, artifact validation, compatibility testing, and deployment automation.
  • Lead architectural improvements to make the online ML platform more robust, user-friendly, scalable, and cost-efficient.

Requirements

  • Strong experience building and operating production-grade online ML inference systems.
  • Experience with model serving frameworks such as NVIDIA Triton Inference Server, TorchServe, Ray Serve, TensorFlow Serving, or similar systems.
  • Experience optimizing inference workloads using techniques such as dynamic batching, model compilation, quantization, GPU acceleration, GPU kernel optimization, caching, or runtime tuning.
  • Strong experience with distributed systems, Kubernetes, autoscaling, service reliability, and production observability.
  • Strong programming skills in Python, with practical experience working on production ML systems and high-scale services.
  • Experience with PyTorch and modern model deployment workflows, including model packaging, validation, and serving lifecycle management.
  • Experience designing infrastructure for safe model rollout, canary testing, A/B experimentation, and automated rollback.
  • Strong systems thinking with the ability to reason about latency, throughput, reliability, scalability, and cost tradeoffs in online systems.
  • Proven ability to lead technical direction and influence architectural decisions across teams without formal authority.
  • Must have sufficient English proficiency for frequent professional verbal and written communication with colleagues and partners worldwide.
  • Relocation support is not available for this position.
  • Work visa or immigration sponsorship is not available for this position.

Benefits

  • Comprehensive health, life, and disability insurance.
  • Commute subsidy.
  • Employee stock ownership.
  • Competitive retirement or pension plans.
  • Generous vacation and personal days.
  • Support for new parents through leave and family-care programs.
  • Mental health and wellbeing programs and support.
  • Training and development programs.
  • Volunteering and donation matching program.
  • Employee Resource Groups.
  • Global Employee Assistance Program.
  • Office food snacks.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Machine Learning Engineer - Artist-First AI Music Lab

Spotify Media

Spotify’s Music Mission team is hiring a Machine Learning Engineer to help build production AI music experiences that center artists and deepen fan connections.

AWS Azure GCP Generative AI Java LLM Machine Learning Python Scala
3 hours, 47 minutes ago

Sr. Machine Learning Engineer

Mitek Systems 251-1K Communications Equipment

Mitek is hiring a remote Sr. Machine Learning Engineer to lead computer vision and image-based ML work for its identity verification and fraud prevention platform.

AWS CI/CD Computer Vision Docker DynamoDB Machine Learning Matplotlib MongoDB OpenCV Pandas Pillow Python PyTorch SageMaker Scikit-learn TensorFlow
6 hours, 37 minutes ago

Senior Machine Learning Engineer, AI Platform

Mozilla 251-1K Internet Software & Services

Mozilla is hiring a Machine Learning Engineer to build and operate the AI platform that powers model training, deployment, and inference for its products at global scale.

CI/CD Docker Kubernetes Machine Learning Python
6 hours, 41 minutes ago

Sr. Software Engineer III (6519)

MetroStar 251-1K IT Services

MetroStar is hiring a Sr. Software Engineer III to support federal-government technology work by operationalizing AI and data pipelines, deploying Python-native ML systems, and advising on secure identity management architecture.

Angular AWS DevSecOps Go Java JavaScript Machine Learning Microservices Next.js Python React TypeScript
6 hours, 41 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers