Nebius

Nebius

Nebius enables B2B companies to build local hyperscaling cloud platforms with cost-effective GPUs, InfiniBand network, and 50% less compute cost. They offer managed Kubernetes and a launch-ready business model for innovative cloud solutions.

Internet Software & Services
51-250

Description

  • Architect and implement scalable HPC clusters for AI, simulation, and distributed training using orchestration frameworks and schedulers such as Kubernetes and Slurm.
  • Design and integrate GPU-accelerated compute infrastructure built on NVIDIA Hopper and Blackwell architectures, NVLink/NVSwitch, and InfiniBand/RoCE interconnects.
  • Deploy and manage GPU Operator and Network Operator stacks for automated lifecycle management of GPU and high-speed networking components.
  • Design and validate cloud HPC environments with low-latency networking, high bandwidth, multi-GPU scaling, and efficient workload scheduling.
  • Lead reference architectures for AI/ML model training, data pipelines, and MLOps integrations using observability and CI/CD tooling.
  • Collaborate with hardware vendors and cloud providers to evaluate and optimize emerging HPC and GPU technologies.
  • Benchmark system performance, identify bottlenecks, and tune resource utilization across compute, network, and storage tiers.
  • Provide expert technical guidance to customers, internal teams, and partners on HPC architecture patterns, operational excellence reviews, and customer engagements.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field; Ph.D. is a plus.
  • 3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
  • Expertise in Linux systems, Kubernetes, container runtimes such as containerd, CRI-O, and Docker, and related CI/CD practices.
  • Strong understanding of HPC networking protocols and RDMA stacks, including InfiniBand and NVLink/NVSwitch.
  • Deep understanding of storage and I/O optimization for large datasets, including Ceph, Lustre, NFS, and GPUDirect Storage.
  • Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
  • Strong scripting skills in Python or Bash for automation and tool integration.
  • Excellent communication and documentation skills, including the ability to lead design reviews and customer engagements.
  • Proficiency with the NVIDIA GPU ecosystem, including GPU Operator, MIG, DCGM, NCCL, Nsight, and CUDA stack management, is a strong plus.
  • Experience designing or managing AI/ML pipelines with MLflow, Kubeflow, NeMo, or similar frameworks is a plus.
  • Experience with cloud-native HPC offerings such as Slurm, LFS, and PBS is preferred.
  • Background in designing multi-tenant GPU infrastructures or AI training farms is a plus.
  • Exposure to distributed ML frameworks such as PyTorch DDP, DeepSpeed, and Megatron is a plus.
  • Knowledge of HPC observability tools such as Prometheus, DCGM Exporter, Grafana, and NVIDIA NGC monitoring tools is a plus.
  • Contribution to open-source HPC, CUDA, or Kubernetes projects is a strong plus.
  • Remote work eligibility in the United States or Canada.

Benefits

  • Competitive salary of 225k–315k OTE, based on experience, skills, and location, plus equity.
  • 100% company-paid medical, dental, and vision coverage for employees and families.
  • Up to 4% company match on the 401(k) plan with immediate vesting.
  • 20 weeks of paid parental leave for primary caregivers and 12 weeks for secondary caregivers.
  • Up to $85 per month in remote work reimbursement for mobile and internet costs.
  • Company-paid short-term, long-term, and life insurance coverage.
  • Flexible working arrangements, including remote work.
  • Opportunities for professional growth within Nebius.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Machine Learning Infrastructure Engineer

Unity 5K-10K Internet Software & Services

Unity is hiring a Senior Machine Learning Infrastructure Engineer for its Vector Ads team to build and operate the real-time infrastructure that powers ML-driven advertising at global, high-scale, low-latency performance.

Go Grafana Kubernetes Machine Learning OpenTelemetry Prometheus Python Terraform
26 minutes ago

Senior AI Platform Engineer

Wellhub 1-10 Gas Utilities

Wellhub is hiring a Senior AI Platform Engineer in Brazil to help build and evolve the cloud-native ML development platform that enables engineers and data scientists to develop and deploy AI at scale.

Apache Spark AWS CI/CD Kubeflow Kubernetes MLOps Python Terraform
3 hours, 46 minutes ago

Senior Software Engineer (Typescript / FrontEnd) - AI/ML

ClickHouse 51-250 IT Services

ClickHouse is hiring a Senior Software Engineer to build AI/ML-powered features for ClickHouse Cloud, connecting its high-performance database platform with end-to-end AI integrations and user-facing experiences.

AWS Azure ClickHouse GCP JavaScript Python React TypeScript
5 hours, 34 minutes ago

Mid-Senior IT Professional (Multiple Opportunities)

Hire Resolve US Internet Software & Services

Hire Resolve is assisting Australian IT organisations in hiring mid- to senior-level IT professionals for multi-disciplinary roles supporting infrastructure, cloud, cybersecurity, enterprise systems, and service delivery.

Active Directory AWS Azure Bash Cybersecurity DHCP DNS GCP PowerShell Python SIEM Terraform
6 hours, 41 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers