Senior Machine Learning Engineer, ML Infrastructure - Online

23 hours, 37 minutes ago
Full-time
Senior
Software Development
Unity

Unity

Unity is the top platform for real-time 3D content creation, empowering creators across industries to bring their ideas to life with interactive 2D and 3D content.

Internet Software & Services
5K-10K
Founded 2004

Description

  • Design and operate large-scale online inference infrastructure that serves production ML models with low latency and high reliability.
  • Build and improve model serving systems using frameworks such as PyTorch, Triton Inference Server, Kubernetes, GKE, Ray, or similar distributed serving technologies.
  • Optimize inference performance through batching, model compilation, GPU/CPU utilization improvements, request scheduling, and runtime tuning.
  • Develop infrastructure for model deployment, canary testing, A/B experimentation, traffic splitting, rollback, and production validation.
  • Improve observability for online ML systems through latency, throughput, error-rate, cost, saturation, and model-health monitoring.
  • Build self-healing and autoscaling capabilities to support dynamic experiment traffic and production reliability requirements.
  • Partner closely with ML engineers to support faster model iteration while maintaining production safety, scalability, and cost efficiency.
  • Improve the reliability and reproducibility of model serving workflows, including packaging, artifact validation, compatibility testing, and deployment automation.
  • Lead architectural improvements that make the online ML platform more robust, user-friendly, scalable, and cost-efficient.

Requirements

  • Strong experience building and operating production-grade online ML inference systems.
  • Experience with model serving frameworks such as NVIDIA Triton Inference Server, TorchServe, Ray Serve, TensorFlow Serving, or similar systems.
  • Experience optimizing inference workloads using dynamic batching, model compilation, quantization, GPU acceleration, GPU kernel optimization, caching, or runtime tuning.
  • Strong experience with distributed systems, Kubernetes, autoscaling, service reliability, and production observability.
  • Strong programming skills in Python, with practical experience working on production ML systems and high-scale services.
  • Experience with PyTorch and modern model deployment workflows, including packaging, validation, and serving lifecycle management.
  • Experience designing infrastructure for safe model rollout, canary testing, A/B experimentation, and automated rollback.
  • Strong systems thinking with the ability to reason about latency, throughput, reliability, scalability, and cost tradeoffs in online systems.
  • Proven ability to lead technical direction and influence architectural decisions across teams without formal authority.
  • Relocation support is not available for this position.
  • Work visa or immigration sponsorship is not available for this position.

Benefits

  • Comprehensive health, life, and disability insurance.
  • Commute subsidy.
  • Employee stock ownership.
  • Competitive retirement or pension plans.
  • Generous vacation and personal days.
  • Support for new parents through leave and family-care programs.
  • Mental health and wellbeing programs and support.
  • Training and development programs.
  • Office food snacks.
  • Employee Resource Groups.
  • Global Employee Assistance Program.
  • Volunteering and donation matching program.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Software Engineer II, Backend (ML Training & Serving)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring a Software Engineer II for its ML Training & Serving engineering team to build the infrastructure that trains and serves machine learning models across the company.

AWS Kotlin Kubernetes Machine Learning MySQL Python
14 hours, 22 minutes ago

Ssr. Fullstack Engineer

Resilient Co 11-50 Professional Services

Resilient Co. is hiring a semi-senior Fullstack Engineer in Argentina or Brazil to build AI-driven full-stack solutions for enterprise workflows, with a focus on agentic AI, machine learning, backend services, and cloud integration.

Angular Azure C# CI/CD Django Docker Entity Framework FastAPI Flask Git JavaScript Microservices .NET NumPy Pandas Python RabbitMQ React Scikit-learn Terraform Vue.js YAML
14 hours, 37 minutes ago

[Job 29881] Senior Machine Learning Engineer, Brazil

CI&T 5K-10K Internet Software & Services

CI&T is hiring a Senior Machine Learning Engineer in Brazil to develop and deploy production ML solutions that turn data and AI capabilities into measurable business impact.

Apache Airflow Apache Spark CI/CD dbt Git Machine Learning OpenSearch Python PyTorch Scikit-learn Snowflake SQL TensorFlow XGBoost
14 hours, 52 minutes ago

AI Native Engineer

CookUnity 251-1K Hotels, Restaurants & Leisure

CookUnity is hiring a dedicated AI engineer to redesign, automate, and own high-value internal workflows across the company’s cross-functional teams.

AWS dbt Git JIRA Kotlin Linear NetSuite Notion PostgreSQL Python Snowflake SQL TypeScript Vercel
14 hours, 52 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers