Nebius

Nebius

Nebius enables B2B companies to build local hyperscaling cloud platforms with cost-effective GPUs, InfiniBand network, and 50% less compute cost. They offer managed Kubernetes and a launch-ready business model for innovative cloud solutions.

Internet Software & Services
51-250

Description

  • Tune the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments.
  • Analyze and troubleshoot the root cause of GPU and InfiniBand issues and propose corrective actions.
  • Integrate new hardware into the existing infrastructure, including support for new GPU hardware through Kubernetes, QEMU, and KVM.
  • Enhance automation systems for proactive monitoring, detection, and resolution of issues in GPU and InfiniBand environments.
  • Configure and manage GPU devices and InfiniBand fabrics to ensure efficient and reliable operation.
  • Work with hardware virtualization and device emulation technologies to maintain high performance and security in multi-GPU environments.
  • Improve core cloud infrastructure components that support GPU computing, networking, and the KVM/QEMU stack.

Requirements

  • 5+ years of professional experience in system-level software development focused on performance optimization and low-level programming.
  • 3+ years of hands-on experience with Linux systems, including administration, troubleshooting, and performance tuning.
  • In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/kernel, and high-performance computing systems.
  • Strong proficiency in one or more performance-oriented programming languages such as C, C++, Go, or Python.
  • Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking (preferred).
  • Proven track record of analyzing and optimizing the performance of HPC workloads such as simulations, data analysis, or AI/ML workloads (preferred).
  • Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication (preferred).
  • Background in Software-Defined Networking and HPC cluster networking (preferred).
  • Understanding of QEMU/KVM virtualization and managing virtualized environments (preferred).
  • Experience with deep learning frameworks such as PyTorch and TensorFlow and their integration with HPC systems (preferred).
  • Familiarity with collective communication libraries like MPI and NCCL for distributed computing (preferred).
  • Coding interviews are part of the hiring process.

Benefits

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Licensed Civil Engineer - Data Center

Olsson 1K-5K Construction & Engineering

Olsson is hiring a Licensed Civil Engineer to support its Data Center Civil team on large hyperscale and colocation data center projects across the U.S., with a focus on designing critical infrastructure for complex engineering-driven developments.

5 hours, 13 minutes ago

Sr. Data Center Engineer II (6384)

MetroStar 251-1K IT Services

MetroStar is hiring a Sr. Data Center Engineer II to design and sustain secure, high-availability data center infrastructure supporting mission-critical federal government operations.

Agile
6 hours, 32 minutes ago

IT Infra Lead

Weekday 11-50 Construction & Engineering

Weekday’s UK-based life sciences technology client is hiring a remote IT Infrastructure Lead in India to own and strengthen the company’s global IT environment across cloud, security, compliance, and workplace systems.

Azure CI/CD Cisco DHCP DNS Fortinet JIRA macOS Palo Alto PowerShell Python
6 hours, 47 minutes ago

Infrastructure Engineer

Jito Labs 1-10 Internet Software & Services

Jito Labs is hiring an Infrastructure Engineer to own and harden the globally distributed infrastructure behind JTX, its Solana-based consumer trading terminal, with a focus on reliability, security, and low-latency operations.

Ansible CDN CI/CD ClickHouse Cloudflare DevSecOps DNS GitHub Actions Grafana HashiCorp Vault InfluxDB Linux Load Balancing PostgreSQL Prometheus Rust Secrets Management Solana SQL Terraform TLS Ubuntu WAF
7 hours, 55 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers