Nebius

Nebius enables B2B companies to build local hyperscaling cloud platforms with cost-effective GPUs, InfiniBand network, and 50% less compute cost. They offer managed Kubernetes and a launch-ready business model for innovative cloud solutions.

Internet Software & Services

Information Technology

51-250 (120)

66 open positions

Links

View All Jobs

Senior Site Reliability Engineer (SRE, Compute Node Team)

2 hours, 6 minutes ago

Europe, Netherlands

Full-time

Senior

Site Reliability Engineer (SRE)

DevOps and Infrastructure

Kubernetes Linux System Design

Apply Now

Nebius

Internet Software & Services

51-250

View All Jobs 66

Description

Ensure the reliability, availability, and performance of compute nodes running virtual machines.
Analyze and debug Linux systems across user space and kernel space, including trade-offs and boundaries at each layer.
Troubleshoot complex production issues involving CPU, memory, NUMA, cgroups, and scheduling.
Work hands-on with virtualization and containerization technologies, primarily QEMU/KVM and Linux-native tools.
Design and evolve observability for the node layer, including metrics, logs, traces, alerts, SLIs, and SLOs.
Lead incident response, root-cause analysis, and postmortems to drive long-term reliability improvements.
Collaborate with platform, kernel/hypervisor, GPU, and infrastructure teams to improve system design and operability.

Requirements

Strong Linux expertise, including deep understanding of Linux user space and kernel space.
Knowledge of kernel subsystems such as scheduler, memory management, filesystems, cgroups, and namespaces.
Hands-on experience with QEMU/KVM and understanding of VM lifecycle, performance characteristics, and failure modes.
Practical experience with containers, namespaces, and cgroups, with strong understanding of resource isolation and control.
Strong debugging skills and a structured, hypothesis-driven approach to incident analysis.
Clear understanding of the SRE role in system design and operations.
Experience building and operating observability stacks, not just consuming them.
Ability to turn system behavior into actionable reliability signals.
Experience with Kubernetes internals or node-level components is a plus.
Experience with low-level Linux debugging tools such as perf, eBPF, ftrace, strace, and kernel crash dumps is preferred.
Familiarity with large-scale compute or bare-metal platforms is preferred.
Contributions to open-source infrastructure or system software are preferred.
Experience debugging hardware and driver-level issues, including GPUs, NVLink, and InfiniBand, is preferred.

Benefits

Competitive salary and comprehensive benefits package.
Opportunities for professional growth within Nebius.
Flexible working arrangements.
A dynamic and collaborative work environment that values initiative and innovation.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Director, Software Engineering (Site Reliability Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking a senior Reliability Engineering leader to build and scale resilience, incident response, and risk management programs across its global engineering organization.

Canada Full-time Executive Director of Engineering Site Reliability Engineer (SRE)

$238k-$298k

36 minutes ago

Apply

36 minutes ago

Site Reliability Engineer

Orcrist Technologies Internet Software & Services

Orcrist is hiring a Site Reliability Engineer to deploy and operate its Kubernetes-based data intelligence platform in on-prem, hybrid, and agency-controlled environments for defense, law-enforcement, and enterprise customers.

Germany Full-time Senior Site Reliability Engineer (SRE)

Ansible Argo CD Elasticsearch Flux GitOps Grafana Helm Kubernetes Prometheus SAML SIEM Splunk Terraform

1 hour, 21 minutes ago

Apply

1 hour, 21 minutes ago

Site Reliability Engineer-SkillBridge Intern

Zscaler 1K-5K Internet Software & Services

Zscaler is hiring a Site Reliability Engineer SkillBridge Intern to support its Zero Trust Exchange team in a remote role based in San Jose or Bellevue, helping operate and improve the cloud security platform behind its global cybersecurity services.

United States Internship Senior Site Reliability Engineer (SRE)

Ansible AWS DNS HTTP Kubernetes Python SQL Terraform

1 hour, 36 minutes ago

Apply

1 hour, 36 minutes ago

Senior Site Reliability Engineer I

instacart.careers 1K-5K Internet Software & Services

Instacart is hiring a Senior Site Reliability Engineer I to help maintain and improve the reliability, performance, and scalability of its grocery delivery platform and supporting services.

United States Full-time Senior Site Reliability Engineer (SRE)

$155k-$196k

AWS Azure Docker GCP Go Kubernetes Ruby

2 hours, 51 minutes ago

Apply

2 hours, 51 minutes ago

Nebius

Tags

Links

Senior Site Reliability Engineer (SRE, Compute Node Team)

Nebius

Description

Requirements

Benefits

Similar Roles

Director, Software Engineering (Site Reliability Engineering)

Site Reliability Engineer

Site Reliability Engineer-SkillBridge Intern

Senior Site Reliability Engineer I

You're on a roll! Sign up now to keep applying.