Senior Site Reliability Engineer (Resilience) - Platform Resilience

3 days, 14 hours ago
Full-time
Senior
Software Development
Elastic

Elastic

Elastic is a leading platform for search-powered solutions, providing real-time insights and making data usable for developers and enterprises worldwide.

Internet Software & Services
1K-5K
Founded 2010

Description

  • Design, build, scale and mature the multi-cloud platform for hosting internal and external services (e.g., Elastic Cloud Hosted, Serverless).
  • Lead technical initiatives to automate system engineering efforts that guarantee reliability of global Elastic infrastructure.
  • Develop and maintain software, tooling, and automations to support platform growth and meet scaling demands.
  • Develop and extend internal infrastructure tools so products across Elastic can be deployed rapidly and reliably.
  • Respond to major incidents, drive prioritized problem management, and implement preventative actions to avoid repeated customer impact.
  • Participate in a follow-the-sun on-call rotation and ensure operational readiness and incident response across time zones.
  • Champion collaboration, operational excellence, inclusive communication and mentor/uplift team and partner relationships.

Requirements

  • Proven experience in Site Reliability Engineering or operations with a customer-first approach to solving operational problems.
  • Background in software engineering to collaborate with engineering teams and implement engineering solutions.
  • Experience with public cloud platforms and managed Kubernetes services (advantageous).
  • Experience operating a SaaS product in public cloud and using Infrastructure-as-Code tools such as Crossplane or Terraform (preferred).
  • Experience building or operating Kubernetes-at-scale infrastructure, ideally across multiple cloud providers (preferred).
  • Proficiency writing non-trivial programs in Golang or other programming languages (preferred).
  • Experience with containerized services (e.g., Docker).
  • Experience leading and improving alerting, major incident management, and metrics systems (e.g., Elastic Stack, Graphite, Prometheus, Influx).
  • System administration experience on Linux for distributed systems at scale.
  • Experience working in distributed or remote teams, thriving in self-organizing environments, and coaching or mentoring teammates.

Benefits

  • Typical starting base salary range: $154,800 — $195,600 USD.
  • Eligible to participate in Elastic's stock program.
  • Company-matched 401(k) with dollar-for-dollar matching up to 6% of eligible earnings.
  • Competitive pay based on the work performed (not previous salary).
  • Health coverage for you and your family in many locations.
  • Flexible locations and schedules for many roles (distributed/remote-friendly).
  • Generous vacation allowance and ability to craft your calendar.
  • Up to $2,000 match for donations, up to 40 volunteer hours per year, and minimum 16 weeks parental leave.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

TextNow 51-250 Wireless Telecommunication Services

TextNow is hiring a remote Site Reliability Engineer in Canada to own infrastructure, monitoring, logging, CI/CD, and reliability for the systems supporting its free phone service platform.

Ansible AWS CI/CD GitHub System Design Terraform
4 hours, 39 minutes ago

Senior Application Engineer

Warner Music Group is hiring a Senior Application Engineer to support, improve, and modernize the software systems behind its global music operations.

Angular AWS CI/CD GitHub Actions Java Oracle PostgreSQL Python React SQL
4 hours, 54 minutes ago

Site Reliability Engineer - Backstage

Spotify Media

Site Reliability Engineer for Spotify’s Backstage team in New York City, focused on building and operating cloud infrastructure for an external developer portal and internal AI-driven coding workflows.

AWS GCP Go Java LLM Microservices Python React Terraform TypeScript
6 hours, 9 minutes ago

Blockchain Site Reliability Engineer

InfStones 51-250 Internet Software & Services

InfStones is hiring a remote Blockchain Site Reliability Engineer in Dallas to ensure the reliability, availability, and performance of its blockchain node infrastructure.

Docker Ethereum Go Grafana JavaScript Kubernetes Linux Prometheus Python Rust Solana
6 hours, 54 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers