GPU Cluster Architect

2 months ago
Full-time
Senior
DevOps and Infrastructure
Nebius

Nebius

Nebius enables B2B companies to build local hyperscaling cloud platforms with cost-effective GPUs, InfiniBand network, and 50% less compute cost. They offer managed Kubernetes and a launch-ready business model for innovative cloud solutions.

Internet Software & Services
51-250

Description

  • Architect scalable GPU cluster topologies, including compute nodes, interconnects, storage, and control planes.
  • Make end-to-end architectural decisions for AI infrastructure across compute, networking, storage, cooling, and power.
  • Analyze AI/ML workloads such as LLM training and inference to guide tradeoffs in latency, bandwidth, and GPU density.
  • Validate low-latency, high-throughput interconnect designs at POD and data center scale.
  • Work with storage teams to optimize performance for training datasets, checkpointing, and related workflows.
  • Monitor infrastructure signals and use them to identify and address design issues.
  • Partner with site reliability, networking, storage, and data center engineering teams to operationalize and scale the architecture.

Requirements

  • 5+ years of experience designing clusters.
  • Deep understanding of modern GPU architectures, including NVIDIA and AMD.
  • Experience with HPC interconnects such as InfiniBand and RoCE.
  • Solid background in systems architecture, networking, and hardware reliability.
  • Experience scripting automation and telemetry pipelines using Python, Go, or similar languages.
  • Experience with InfiniBand HDR/NDR and RoCEv2 is preferred.
  • Remote work eligibility from the USA.

Benefits

  • Competitive base salary of $150k-$180k plus quarterly performance bonuses.
  • 100% company-paid medical, dental, and vision coverage for employees and families.
  • 401(k) plan with up to 4% company match and immediate vesting.
  • Paid parental leave: 20 weeks for primary caregivers and 12 weeks for secondary caregivers.
  • Company-paid short-term, long-term, and life insurance coverage.
  • Remote work reimbursement of up to $85 per month for mobile and internet.
  • Hybrid or flexible working arrangements.
  • Opportunities for professional growth within Nebius.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Broadcast Engineer Lead

TEGNA 5K-10K Media

TEGNA-owned WTHR in Indianapolis is hiring a Broadcast Engineer Lead to oversee broadcast and IT infrastructure, guide technical direction, and lead major engineering projects that support a 24/7 news operation.

System Design
4 hours, 53 minutes ago

Director / Vice President, IT/OT (Global)

Submer 51-250 IT Services

Rubix is hiring a Director or VP of IT/OT to lead the convergence of information and operational technologies across its global AI data center portfolio and drive reliable, scalable, and secure infrastructure growth.

Cybersecurity
5 hours, 8 minutes ago

Senior Data Center Operations Engineer

Colovore 1-10 IT Services

Colovore is hiring a Senior Engineer, Data Center Operations to lead site-level operations in its high-density AI colocation facilities, ensuring uptime, efficiency, compliance, and strong customer support as the company expands nationally.

5 hours, 23 minutes ago

Staff Operations Engineer

Mozilla 251-1K Internet Software & Services

Mozilla is hiring a Staff Operations Engineer to lead the design, reliability, and evolution of hybrid-cloud and workplace infrastructure across teams.

Ansible DNS Linux Puppet Python TCP/IP Unix
1 day, 4 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers