Network Site Reliability Engineer (NetSRE)

2 hours, 11 minutes ago
Full-time
Senior
DevOps and Infrastructure
Nebius

Nebius

Nebius enables B2B companies to build local hyperscaling cloud platforms with cost-effective GPUs, InfiniBand network, and 50% less compute cost. They offer managed Kubernetes and a launch-ready business model for innovative cloud solutions.

Internet Software & Services
51-250

Description

  • Define and own reliability goals for network services and critical paths, including SLIs, SLOs, availability targets, and error budgets where appropriate.
  • Drive reliability improvements across the network, including services, site readiness, inter-site connectivity, and operational standards.
  • Own incident response for your areas, lead investigations and postmortems, and turn failures into durable fixes.
  • Build and improve observability with actionable metrics, logs, traces, alerting, and faster debugging during and after incidents.
  • Design safer change workflows for network operations, including automation, CI/CD, testing, staging, canarying, rollbacks, and auditability.
  • Work closely with network engineers and platform teams to embed operability into designs and keep operations practical and efficient.

Requirements

  • Strong production Linux fundamentals and a structured approach to debugging complex systems.
  • Solid understanding of networking basics and how real networks fail, including control plane vs. data plane, latency, loss, and failure domains.
  • Hands-on experience operating high-availability systems and improving them over time.
  • Ability to write and maintain software and automation; Go is common and Python is also welcome.
  • Experience with modern infrastructure tooling such as IaC, CI/CD, and container platforms, with comfort automating operational workflows.
  • Experience with high-throughput traffic processing, such as load balancers, tunneling/decap, NAT64, or similar datapath-heavy systems, is a plus.
  • Low-level networking performance and debugging experience, such as eBPF/XDP, DPDK, perf/ftrace, or kernel networking internals, is a plus.
  • Experience building network-safe delivery pipelines, such as testing labs, staged rollouts, automated verification, and drift detection, is a plus.
  • Background in large-scale network observability and telemetry, such as routing or flow telemetry and regression detection at scale, is a plus.

Benefits

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

Orcrist Technologies Internet Software & Services

Orcrist is hiring a Site Reliability Engineer to deploy and operate its Kubernetes-based data intelligence platform in on-prem, hybrid, and agency-controlled environments for defense, law-enforcement, and enterprise customers.

Ansible Argo CD Elasticsearch Flux GitOps Grafana Helm Kubernetes Prometheus SAML SIEM Splunk Terraform
26 minutes ago

Site Reliability Engineer-SkillBridge Intern

Zscaler 1K-5K Internet Software & Services

Zscaler is hiring a Site Reliability Engineer SkillBridge Intern to support its Zero Trust Exchange team in a remote role based in San Jose or Bellevue, helping operate and improve the cloud security platform behind its global cybersecurity services.

Ansible AWS DNS HTTP Kubernetes Python SQL Terraform
41 minutes ago

Senior Network Engineer

ZipRecruiter 1K-5K Internet Software & Services

IntelliDyne is seeking a Senior Network Engineer in Tysons, VA to maintain a client network environment for CONUS operations and serve as the lead technical owner for network design, security, and incident resolution.

Cisco IDS Load Balancing
1 hour, 11 minutes ago

Senior Site Reliability Engineer I

instacart.careers 1K-5K Internet Software & Services

Instacart is hiring a Senior Site Reliability Engineer I to help maintain and improve the reliability, performance, and scalability of its grocery delivery platform and supporting services.

AWS Azure Docker GCP Go Kubernetes Ruby
1 hour, 56 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers