Senior Site Reliability Engineer (SRE, Compute Node Team)

2 hours, 6 minutes ago
Full-time
Senior
DevOps and Infrastructure
Nebius

Nebius

Nebius enables B2B companies to build local hyperscaling cloud platforms with cost-effective GPUs, InfiniBand network, and 50% less compute cost. They offer managed Kubernetes and a launch-ready business model for innovative cloud solutions.

Internet Software & Services
51-250

Description

  • Ensure the reliability, availability, and performance of compute nodes running virtual machines.
  • Analyze and debug Linux systems across user space and kernel space, including trade-offs and boundaries at each layer.
  • Troubleshoot complex production issues involving CPU, memory, NUMA, cgroups, and scheduling.
  • Work hands-on with virtualization and containerization technologies, primarily QEMU/KVM and Linux-native tools.
  • Design and evolve observability for the node layer, including metrics, logs, traces, alerts, SLIs, and SLOs.
  • Lead incident response, root-cause analysis, and postmortems to drive long-term reliability improvements.
  • Collaborate with platform, kernel/hypervisor, GPU, and infrastructure teams to improve system design and operability.

Requirements

  • Strong Linux expertise, including deep understanding of Linux user space and kernel space.
  • Knowledge of kernel subsystems such as scheduler, memory management, filesystems, cgroups, and namespaces.
  • Hands-on experience with QEMU/KVM and understanding of VM lifecycle, performance characteristics, and failure modes.
  • Practical experience with containers, namespaces, and cgroups, with strong understanding of resource isolation and control.
  • Strong debugging skills and a structured, hypothesis-driven approach to incident analysis.
  • Clear understanding of the SRE role in system design and operations.
  • Experience building and operating observability stacks, not just consuming them.
  • Ability to turn system behavior into actionable reliability signals.
  • Experience with Kubernetes internals or node-level components is a plus.
  • Experience with low-level Linux debugging tools such as perf, eBPF, ftrace, strace, and kernel crash dumps is preferred.
  • Familiarity with large-scale compute or bare-metal platforms is preferred.
  • Contributions to open-source infrastructure or system software are preferred.
  • Experience debugging hardware and driver-level issues, including GPUs, NVLink, and InfiniBand, is preferred.

Benefits

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Director, Software Engineering (Site Reliability Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking a senior Reliability Engineering leader to build and scale resilience, incident response, and risk management programs across its global engineering organization.

36 minutes ago

Site Reliability Engineer

Orcrist Technologies Internet Software & Services

Orcrist is hiring a Site Reliability Engineer to deploy and operate its Kubernetes-based data intelligence platform in on-prem, hybrid, and agency-controlled environments for defense, law-enforcement, and enterprise customers.

Ansible Argo CD Elasticsearch Flux GitOps Grafana Helm Kubernetes Prometheus SAML SIEM Splunk Terraform
1 hour, 21 minutes ago

Site Reliability Engineer-SkillBridge Intern

Zscaler 1K-5K Internet Software & Services

Zscaler is hiring a Site Reliability Engineer SkillBridge Intern to support its Zero Trust Exchange team in a remote role based in San Jose or Bellevue, helping operate and improve the cloud security platform behind its global cybersecurity services.

Ansible AWS DNS HTTP Kubernetes Python SQL Terraform
1 hour, 36 minutes ago

Senior Site Reliability Engineer I

instacart.careers 1K-5K Internet Software & Services

Instacart is hiring a Senior Site Reliability Engineer I to help maintain and improve the reliability, performance, and scalability of its grocery delivery platform and supporting services.

AWS Azure Docker GCP Go Kubernetes Ruby
2 hours, 51 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers