ML Infrastructure Engineer

3 hours, 29 minutes ago
Senior
DevOps and Infrastructure
Nebius

Nebius

Nebius enables B2B companies to build local hyperscaling cloud platforms with cost-effective GPUs, InfiniBand network, and 50% less compute cost. They offer managed Kubernetes and a launch-ready business model for innovative cloud solutions.

Internet Software & Services
51-250

Description

  • Work with hardware and development teams to profile and analyze GPU performance at the system and kernel level.
  • Evaluate and compare GPU performance across different platforms, architectures, and software stacks such as CUDA and ROCm.
  • Debug and optimize ML workloads to run efficiently on GPU hardware by identifying and resolving performance bottlenecks.
  • Perform acceptance testing for new GPU clusters to verify performance, stability, and compatibility for AI workloads.
  • Run experiments across diverse GPU system configurations to assess the impact of interconnect strategies and system-level optimizations.
  • Develop tools and dashboards to visualize performance metrics, bottlenecks, and trends.
  • Contribute to internal tooling, frameworks, and best practices for GPU benchmarking and optimization.

Requirements

  • A strong understanding of the theoretical foundations of machine learning.
  • Deep understanding of performance aspects of large neural network training and inference, including data, tensor, context, and expert parallelism, offloading, custom kernels, hardware features, attention optimizations, and dynamic batching.
  • Deep experience with modern deep learning frameworks such as PyTorch, JAX, Megatron-LM, and TensorRT-LLM.
  • Good understanding of the GPU stack, including CUDA, NCCL, drivers, and relevant libraries.
  • Familiarity with containerized environments such as Docker and Kubernetes.
  • Strong communication skills and the ability to work independently.
  • Familiarity with modern LLM inference frameworks such as vLLM, SGLang, and TensorRT, preferred.
  • Experience with Python and performance profiling tools such as Nsight, nvprof, and perf, preferred.
  • Familiarity with cloud ML platforms such as AWS, GCP, and Azure ML, preferred.
  • Contributions to open-source ML benchmarking tools, preferred.
  • Authorization to work in the country of application, with proof of employment eligibility required at hire.

Benefits

  • Competitive compensation.
  • Career growth and learning opportunities.
  • Flexibility and work-life balance.
  • Collaborative and innovative culture.
  • Opportunity to work on impactful AI projects.
  • International environment and talented teams.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Machine Learning Engineer, Conversion Modeling

Unity 5K-10K Internet Software & Services

Unity is hiring a Senior ML Engineer to build and improve large-scale ad ranking, recommendation, and bidding optimization systems that power Unity Ads.

C++ Go Machine Learning Python Reinforcement Learning Scala Statistics
1 hour, 47 minutes ago

Principal Infrastructure Engineer

Ooma, Inc. 1001-5000 telecommunications

Ooma is hiring a Principal Infrastructure Engineer to lead infrastructure architecture, reliability, and automation across production network and systems environments supporting its cloud-based communications platform.

Ansible AWS Azure Bash Docker GCP Kubernetes Linux Load Balancing Python TCP/IP Terraform
5 hours, 33 minutes ago

Staff Machine Learning Engineer, Credit Products (Square Financial Services)

Block 10K-50K Capital Markets

Block’s Credit and Lending team is seeking a Machine Learning Engineer for Square Financial Services to own the credit decisioning system that powers underwriting for underbanked customers in a regulated banking environment.

Machine Learning Neural Networks
6 hours, 6 minutes ago

Senior Cloud Security Engineer (Hybrid/Multi-Cloud)

GuidePoint Security 251-1K Internet Software & Services

GuidePoint Security is hiring a Senior Cloud Security Engineer to secure a large hybrid and multi-cloud environment by building automated controls, detection, and remediation across cloud, on-premises, identity, and AI systems.

Ansible Apache Spark AWS Azure CI/CD CloudFormation Cybersecurity Databricks Docker GCP GitOps Kubernetes LLM Machine Learning Pandas Python SAML SIEM Terraform WAF
7 hours, 26 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers