Nebius

Nebius enables B2B companies to build local hyperscaling cloud platforms with cost-effective GPUs, InfiniBand network, and 50% less compute cost. They offer managed Kubernetes and a launch-ready business model for innovative cloud solutions.

Internet Software & Services

Information Technology

51-250 (120)

180 open positions

Links

View All Jobs

ML Infrastructure Engineer

2 weeks, 5 days ago

Europe, United States

Senior

Infrastructure Engineer

DevOps and Infrastructure

AWS Docker GCP Kubernetes Machine Learning Python PyTorch

Apply Now

Nebius

Internet Software & Services

51-250

View All Jobs 180

Description

Work with hardware and development teams to profile and analyze GPU performance at the system and kernel level.
Evaluate and compare GPU performance across different platforms, architectures, and software stacks such as CUDA and ROCm.
Debug and optimize ML workloads to run efficiently on GPU hardware by identifying and resolving performance bottlenecks.
Perform acceptance testing for new GPU clusters to verify performance, stability, and compatibility for AI workloads.
Run experiments across diverse GPU system configurations to assess the impact of interconnect strategies and system-level optimizations.
Develop tools and dashboards to visualize performance metrics, bottlenecks, and trends.
Contribute to internal tooling, frameworks, and best practices for GPU benchmarking and optimization.

Requirements

A strong understanding of the theoretical foundations of machine learning.
Deep understanding of performance aspects of large neural network training and inference, including data, tensor, context, and expert parallelism, offloading, custom kernels, hardware features, attention optimizations, and dynamic batching.
Deep experience with modern deep learning frameworks such as PyTorch, JAX, Megatron-LM, and TensorRT-LLM.
Good understanding of the GPU stack, including CUDA, NCCL, drivers, and relevant libraries.
Familiarity with containerized environments such as Docker and Kubernetes.
Strong communication skills and the ability to work independently.
Familiarity with modern LLM inference frameworks such as vLLM, SGLang, and TensorRT, preferred.
Experience with Python and performance profiling tools such as Nsight, nvprof, and perf, preferred.
Familiarity with cloud ML platforms such as AWS, GCP, and Azure ML, preferred.
Contributions to open-source ML benchmarking tools, preferred.
Authorization to work in the country of application, with proof of employment eligibility required at hire.

Benefits

Competitive compensation.
Career growth and learning opportunities.
Flexibility and work-life balance.
Collaborative and innovative culture.
Opportunity to work on impactful AI projects.
International environment and talented teams.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Infrastructure Engineer (AWS)

uMed 51-250 Pharmaceuticals

uMed is seeking a Senior Infrastructure Engineer (AWS) to own and scale its cloud and hybrid infrastructure supporting secure, compliant data and application workflows for clinical research.

United Kingdom Full-time Senior Infrastructure Engineer

AWS Bash CloudFormation Datadog EC2 Linux Python Terraform

1 hour, 2 minutes ago

Apply

1 hour, 2 minutes ago

Principal Architect, Infrastructure

Zencore Group 11-50 Internet Software & Services

Zencore is hiring a Principal Architect, Infrastructure US (Remote) to lead technical delivery and customer engagements for Google Cloud modernization projects in a fully remote professional services environment.

United States Full-time Lead Infrastructure Engineer Solutions Architect

CI/CD DevSecOps GCP Kubernetes Serverless

1 hour, 17 minutes ago

Apply

1 hour, 17 minutes ago

Cloud Infrastructure Engineer

Welo Global Professional Services

Welocalize is hiring a remote Cloud Platform Engineer II in Mexico to design and optimize cloud infrastructure, CI/CD, and DevOps operations that support reliable, secure software delivery across development and production environments.

Mexico Full-time Mid Level Infrastructure Engineer

Agile AWS Azure Bash CI/CD GCP Git GitHub Actions Jenkins Microservices PowerShell Python Serverless SQL TeamCity Terraform

2 hours, 49 minutes ago

Apply

2 hours, 49 minutes ago

Senior Staff Machine Learning Engineer, Consumer

DoorDash 10K-50K Air Freight & Logistics

DoorDash is hiring a Senior Staff Machine Learning Engineer to lead personalization strategy and modernize recommendation systems across the consumer shopping journey, improving search and discovery experiences at scale.

United States Full-time Lead Machine Learning Engineer

$243k-$357k

Deep Learning JSON LLM Machine Learning

3 hours, 37 minutes ago

Apply

3 hours, 37 minutes ago

Nebius

Tags

Links

ML Infrastructure Engineer

Nebius

Description

Requirements

Benefits

Similar Roles

Senior Infrastructure Engineer (AWS)

Principal Architect, Infrastructure

Cloud Infrastructure Engineer

Senior Staff Machine Learning Engineer, Consumer

You're on a roll! Sign up now to keep applying.