Senior HPC Cluster Engineer

7 hours, 5 minutes ago
Full-time
Senior
DevOps and Infrastructure
Nebius

Nebius

Nebius enables B2B companies to build local hyperscaling cloud platforms with cost-effective GPUs, InfiniBand network, and 50% less compute cost. They offer managed Kubernetes and a launch-ready business model for innovative cloud solutions.

Internet Software & Services
51-250

Description

  • Tune the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC environments.
  • Analyze and troubleshoot root causes of GPU and InfiniBand issues and propose corrective actions.
  • Integrate new hardware into the existing infrastructure, including support for new GPU hardware through Kubernetes, QEMU, and KVM.
  • Enhance automation systems for proactive monitoring, detection, and resolution of issues in GPU and InfiniBand environments.
  • Configure and manage GPU devices and InfiniBand fabrics to ensure efficient and reliable operation.
  • Work with hardware virtualization and device emulation technologies in multi-GPU HPC environments.
  • Support the development and optimization of core cloud platform components.
  • Improve infrastructure performance and security across clustered systems.

Requirements

  • 5+ years of professional experience in system-level software development focused on performance optimization and low-level programming.
  • 3+ years of hands-on experience with Linux systems, including administration, troubleshooting, and performance tuning.
  • In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing systems.
  • Strong proficiency in one or more performance-oriented programming languages such as C/C++, Go, or Python.
  • Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking (preferred).
  • Proven track record of analyzing and optimizing the performance of HPC workloads such as simulations, data analysis, or AI/ML workloads (preferred).
  • Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication (preferred).
  • Background in Software-Defined Networking and experience with HPC cluster networking (preferred).
  • Understanding of QEMU/KVM virtualization and managing virtualized environments (preferred).
  • Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems (preferred).
  • Familiarity with collective communication libraries like MPI and NCCL for distributed computing (preferred).

Benefits

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior DevOps Engineer - Payments Infrastructure

Your Business Diversified Financial Services

A remote contract DevOps/Cloud Engineering role with a high-growth digital commerce platform, focused on modernizing payments infrastructure through deployment automation, observability, security hardening, and high-availability architecture.

AWS CI/CD Datadog Prometheus Secrets Management Terraform
50 minutes ago

Senior Technical Consultant – VMware Cloud Foundation (VCF)

AHEAD 1K-5K IT Services

AHEAD is hiring a Senior Technical Consultant for VMware Cloud Foundation to design, deploy, and optimize enterprise hybrid cloud and data center solutions for clients.

AWS
6 hours, 5 minutes ago

Infrastructure and Endpoint Security Engineer

Devoted Studios 51-250 Internet Software & Services

This role at an international game production company focuses on securing network, infrastructure, endpoints, and cloud environments while also leading security awareness, incident response, and client-facing security assurance activities.

AWS Azure Bash DNS GCP Linux macOS Network Security PowerShell Python SIEM TCP/IP
6 hours, 35 minutes ago

Senior Infrastructure Software Engineer, DevFleet

Dropbox 1K-5K Internet Software & Services

Dropbox is hiring an Infrastructure Engineer for its Developer Platform team to build and evolve the backend systems that support flagship products, scale global data infrastructure, and improve reliability for millions of users.

C C++ Go Java Python
6 hours, 35 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers