Nebius

Nebius

Nebius enables B2B companies to build local hyperscaling cloud platforms with cost-effective GPUs, InfiniBand network, and 50% less compute cost. They offer managed Kubernetes and a launch-ready business model for innovative cloud solutions.

Internet Software & Services
51-250

Description

  • Work with hardware and development teams to profile and analyze GPU performance at the system and kernel level.
  • Evaluate and compare GPU performance across different platforms, architectures, and software stacks such as CUDA and ROCm.
  • Debug and optimize ML workloads to run efficiently on GPU hardware by identifying and resolving performance bottlenecks.
  • Perform acceptance testing for new GPU clusters to verify performance, stability, and compatibility for AI workloads.
  • Run experiments across diverse GPU system configurations to assess the impact of interconnect strategies and system-level optimizations.
  • Develop tools and dashboards to visualize performance metrics, bottlenecks, and trends.
  • Contribute to internal tooling, frameworks, and best practices for GPU benchmarking and optimization.

Requirements

  • A strong understanding of the theoretical foundations of machine learning.
  • Deep understanding of performance aspects of large neural network training and inference, including data, tensor, context, and expert parallelism, offloading, custom kernels, hardware features, attention optimizations, and dynamic batching.
  • Deep experience with modern deep learning frameworks such as PyTorch, JAX, Megatron-LM, and TensorRT-LLM.
  • Good understanding of the GPU stack, including CUDA, NCCL, drivers, and relevant libraries.
  • Familiarity with containerized environments such as Docker and Kubernetes.
  • Strong communication skills and the ability to work independently.
  • Familiarity with modern LLM inference frameworks such as vLLM, SGLang, and TensorRT, preferred.
  • Experience with Python and performance profiling tools such as Nsight, nvprof, and perf, preferred.
  • Familiarity with cloud ML platforms such as AWS, GCP, and Azure ML, preferred.
  • Contributions to open-source ML benchmarking tools, preferred.
  • Authorization to work in the country of application, with proof of employment eligibility required at hire.

Benefits

  • Competitive compensation.
  • Career growth and learning opportunities.
  • Flexibility and work-life balance.
  • Collaborative and innovative culture.
  • Opportunity to work on impactful AI projects.
  • International environment and talented teams.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Infrastructure Engineer (AWS)

uMed 51-250 Pharmaceuticals

uMed is seeking a Senior Infrastructure Engineer (AWS) to own and scale its cloud and hybrid infrastructure supporting secure, compliant data and application workflows for clinical research.

AWS Bash CloudFormation Datadog EC2 Linux Python Terraform
1 hour, 2 minutes ago

Principal Architect, Infrastructure

Zencore Group 11-50 Internet Software & Services

Zencore is hiring a Principal Architect, Infrastructure US (Remote) to lead technical delivery and customer engagements for Google Cloud modernization projects in a fully remote professional services environment.

CI/CD DevSecOps GCP Kubernetes Serverless
1 hour, 17 minutes ago

Cloud Infrastructure Engineer

Welo Global Professional Services

Welocalize is hiring a remote Cloud Platform Engineer II in Mexico to design and optimize cloud infrastructure, CI/CD, and DevOps operations that support reliable, secure software delivery across development and production environments.

Agile AWS Azure Bash CI/CD GCP Git GitHub Actions Jenkins Microservices PowerShell Python Serverless SQL TeamCity Terraform
2 hours, 49 minutes ago

Senior Staff Machine Learning Engineer, Consumer

DoorDash 10K-50K Air Freight & Logistics

DoorDash is hiring a Senior Staff Machine Learning Engineer to lead personalization strategy and modernize recommendation systems across the consumer shopping journey, improving search and discovery experiences at scale.

Deep Learning JSON LLM Machine Learning
3 hours, 37 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers