Nebius

Nebius

Nebius enables B2B companies to build local hyperscaling cloud platforms with cost-effective GPUs, InfiniBand network, and 50% less compute cost. They offer managed Kubernetes and a launch-ready business model for innovative cloud solutions.

Internet Software & Services
51-250

Description

  • Architect and implement scalable HPC clusters for AI, simulation, and distributed training using orchestration frameworks and schedulers such as Kubernetes and Slurm.
  • Design and integrate GPU-accelerated infrastructure using NVIDIA Hopper and Blackwell architectures, NVLink/NVSwitch, and InfiniBand/RoCE interconnects.
  • Deploy and manage GPU Operator and Network Operator stacks for automated lifecycle management.
  • Design and validate cloud HPC environments with low-latency networking, multi-GPU scaling, and efficient workload scheduling.
  • Lead reference architectures for AI/ML model training, data pipelines, and MLOps integrations.
  • Collaborate with hardware vendors and cloud providers to evaluate and optimize HPC and GPU technologies.
  • Benchmark system performance, identify bottlenecks, and tune utilization across compute, network, and storage layers.
  • Provide technical guidance to customers, internal teams, and partners on HPC architecture, operational reviews, and customer engagements.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field; Ph.D. is a plus.
  • 3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
  • Expertise in Linux systems, Kubernetes, container runtimes such as containers, CRI-O, and Docker, and related CI/CD practices.
  • Strong understanding of HPC networking protocols and RDMA stacks, including InfiniBand and NVLink/NVSwitch.
  • Deep understanding of storage and I/O optimization for large datasets, including Ceph, Lustre, NFS, and GPUDirect Storage.
  • Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
  • Strong scripting skills in Python or Bash for automation and tool integration.
  • Excellent communication and documentation skills, with the ability to lead design reviews and customer engagements.
  • Proficiency with the NVIDIA GPU ecosystem, including GPU Operator, MIG, DCGM, NCCL, Nsight, and CUDA stack management, is an added bonus.
  • Experience designing or managing AI/ML pipelines with tools such as MLflow, Kubeflow, or NeMo is a plus.
  • Experience with cloud-native HPC offerings such as Slurm, LFS, and PBS is preferred.
  • Background in designing multi-tenant GPU infrastructures or AI training farms is a plus.
  • Exposure to distributed ML frameworks such as PyTorch DDP, DeepSpeed, and Megatron is preferred.
  • Knowledge of HPC observability tools such as Prometheus, DCGM Exporter, Grafana, and NVIDIA NGC monitoring tools is a plus.
  • Contribution to open-source HPC, CUDA, or Kubernetes projects is a strong plus.

Benefits

  • 100% company-paid medical, dental, and vision coverage for employees and families.
  • Up to 4% company match in the 401(k) plan with immediate vesting.
  • 20 weeks of paid parental leave for primary caregivers and 12 weeks for secondary caregivers.
  • Up to $85 per month in remote work reimbursement for mobile and internet expenses.
  • Company-paid short-term, long-term, and life insurance coverage.
  • Competitive salary of $225k–$315k OTE, plus equity based on experience, skills, and location.
  • Flexible working arrangements, including remote work from the United States or Canada.
  • Opportunities for professional growth within Nebius.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Specialist Solutions Architect - Cloud Infrastructure & Security

Databricks 1K-5K IT Services

Databricks is seeking a Specialist Solutions Architect focused on Cloud Infrastructure and Security to help customers design, deploy, and secure Databricks environments across public cloud platforms.

Apache Spark AWS Azure Databricks Encryption GCP Hadoop Java Kafka Network Security OAuth Python SAML Scala SQL Terraform
1 day, 2 hours ago

Staff Machine Learning Engineer, AI Researcher

Cribl 251-1K IT Services

Cribl is hiring a remote-first machine learning engineer to help build AI-enabled security and observability products that solve real customer problems.

Computer Vision Feature Engineering Kubeflow Machine Learning MLflow MLOps NLP Python PyTorch Reinforcement Learning TensorFlow
1 day, 3 hours ago

Staff Machine Learning Engineer - Platform (Core AI Automation)

Coinbase 1K-5K Capital Markets

Coinbase is hiring a Machine Learning Engineer for its Core Automation Team to build AI infrastructure and automation that improve customer support, compliance operations, and AI-powered customer interactions on its onchain platform.

Apache Airflow Apache Spark Blockchain Computer Vision Databricks Deep Learning Flink Generative AI Kafka LLM Machine Learning NLP Python Snowflake
1 day, 3 hours ago

Software Engineer - ML Platform

Veriff 51-250 IT Services

Veriff’s ML Platform team is hiring a software or ML engineer to build the systems that support machine learning development, experimentation, observability, and scalable model deployment.

Apache Spark dbt Grafana Kubeflow MLflow MLOps Prometheus Python Snowflake SQL
1 day, 3 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers