Nebius

Nebius enables B2B companies to build local hyperscaling cloud platforms with cost-effective GPUs, InfiniBand network, and 50% less compute cost. They offer managed Kubernetes and a launch-ready business model for innovative cloud solutions.

Internet Software & Services

Information Technology

51-250 (120)

182 open positions

Links

View All Jobs

HPC Specialist Solutions Architect

1 day, 17 hours ago

United States, Canada

Full-time

Senior

Infrastructure Engineer

DevOps and Infrastructure

Ansible Bash CI/CD CRI-O Docker GitOps Grafana Helm Kubeflow Kubernetes Linux Machine Learning MLflow Prometheus Python PyTorch Terraform

Apply Now

Nebius

Internet Software & Services

51-250

View All Jobs 182

Description

Architect and implement scalable HPC clusters for AI, simulation, and distributed training using orchestration frameworks and schedulers such as Kubernetes and Slurm.
Design and integrate GPU-accelerated infrastructure using NVIDIA Hopper and Blackwell architectures, NVLink/NVSwitch, and InfiniBand/RoCE interconnects.
Deploy and manage GPU Operator and Network Operator stacks for automated lifecycle management.
Design and validate cloud HPC environments with low-latency networking, multi-GPU scaling, and efficient workload scheduling.
Lead reference architectures for AI/ML model training, data pipelines, and MLOps integrations.
Collaborate with hardware vendors and cloud providers to evaluate and optimize HPC and GPU technologies.
Benchmark system performance, identify bottlenecks, and tune utilization across compute, network, and storage layers.
Provide technical guidance to customers, internal teams, and partners on HPC architecture, operational reviews, and customer engagements.

Requirements

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field; Ph.D. is a plus.
3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
Expertise in Linux systems, Kubernetes, container runtimes such as containers, CRI-O, and Docker, and related CI/CD practices.
Strong understanding of HPC networking protocols and RDMA stacks, including InfiniBand and NVLink/NVSwitch.
Deep understanding of storage and I/O optimization for large datasets, including Ceph, Lustre, NFS, and GPUDirect Storage.
Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
Strong scripting skills in Python or Bash for automation and tool integration.
Excellent communication and documentation skills, with the ability to lead design reviews and customer engagements.
Proficiency with the NVIDIA GPU ecosystem, including GPU Operator, MIG, DCGM, NCCL, Nsight, and CUDA stack management, is an added bonus.
Experience designing or managing AI/ML pipelines with tools such as MLflow, Kubeflow, or NeMo is a plus.
Experience with cloud-native HPC offerings such as Slurm, LFS, and PBS is preferred.
Background in designing multi-tenant GPU infrastructures or AI training farms is a plus.
Exposure to distributed ML frameworks such as PyTorch DDP, DeepSpeed, and Megatron is preferred.
Knowledge of HPC observability tools such as Prometheus, DCGM Exporter, Grafana, and NVIDIA NGC monitoring tools is a plus.
Contribution to open-source HPC, CUDA, or Kubernetes projects is a strong plus.

Benefits

100% company-paid medical, dental, and vision coverage for employees and families.
Up to 4% company match in the 401(k) plan with immediate vesting.
20 weeks of paid parental leave for primary caregivers and 12 weeks for secondary caregivers.
Up to $85 per month in remote work reimbursement for mobile and internet expenses.
Company-paid short-term, long-term, and life insurance coverage.
Competitive salary of $225k–$315k OTE, plus equity based on experience, skills, and location.
Flexible working arrangements, including remote work from the United States or Canada.
Opportunities for professional growth within Nebius.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Specialist Solutions Architect - Cloud Infrastructure & Security

Databricks 1K-5K IT Services

Databricks is seeking a Specialist Solutions Architect focused on Cloud Infrastructure and Security to help customers design, deploy, and secure Databricks environments across public cloud platforms.

United States Full-time Senior Infrastructure Engineer Security Engineer

$264k-$363k

Apache Spark AWS Azure Databricks Encryption GCP Hadoop Java Kafka Network Security OAuth Python SAML Scala SQL Terraform

1 day, 2 hours ago

Apply

1 day, 2 hours ago

Staff Machine Learning Engineer, AI Researcher

Cribl 251-1K IT Services

Cribl is hiring a remote-first machine learning engineer to help build AI-enabled security and observability products that solve real customer problems.

United States Full-time Senior Machine Learning Engineer

$230k-$275k

Computer Vision Feature Engineering Kubeflow Machine Learning MLflow MLOps NLP Python PyTorch Reinforcement Learning TensorFlow

1 day, 3 hours ago

Apply

1 day, 3 hours ago

Staff Machine Learning Engineer - Platform (Core AI Automation)

Coinbase 1K-5K Capital Markets

Coinbase is hiring a Machine Learning Engineer for its Core Automation Team to build AI infrastructure and automation that improve customer support, compliance operations, and AI-powered customer interactions on its onchain platform.

United States Full-time Lead Machine Learning Engineer

$218k-$256k

Apache Airflow Apache Spark Blockchain Computer Vision Databricks Deep Learning Flink Generative AI Kafka LLM Machine Learning NLP Python Snowflake

1 day, 3 hours ago

Apply

1 day, 3 hours ago

Software Engineer - ML Platform

Veriff 51-250 IT Services

Veriff’s ML Platform team is hiring a software or ML engineer to build the systems that support machine learning development, experimentation, observability, and scalable model deployment.

Spain Full-time Mid Level Machine Learning Engineer Software Engineer

Apache Spark dbt Grafana Kubeflow MLflow MLOps Prometheus Python Snowflake SQL

1 day, 3 hours ago

Apply

1 day, 3 hours ago

Nebius

Tags

Links

HPC Specialist Solutions Architect

Nebius

Description

Requirements

Benefits

Similar Roles

Specialist Solutions Architect - Cloud Infrastructure & Security

Staff Machine Learning Engineer, AI Researcher

Staff Machine Learning Engineer - Platform (Core AI Automation)

Software Engineer - ML Platform

You're on a roll! Sign up now to keep applying.