Site Reliability Specialist (Observability & Kubernetes)

18 hours, 58 minutes ago
Full-time
Senior
DevOps and Infrastructure
Everbridge

Everbridge

Everbridge provides a comprehensive software platform that automates and enhances organizations' responses to critical events, ensuring the safety of individuals and the continuity of business operations during emergencies such as natural disasters, cy...

Internet Software & Services
1K-5K
Founded 2002

Description

  • Own the design, operation, and evolution of the observability platform.
  • Build and maintain a highly available and scalable observability stack.
  • Standardize instrumentation, dashboards, alerts, and SLOs across engineering teams.
  • Support incident response, root cause analysis, and capacity planning.
  • Operate and scale Grafana and its telemetry components, including Loki, Mimir, Tempo, and Alerting.
  • Maintain the reliability and security of EKS clusters supporting observability services.
  • Manage Kubernetes cluster lifecycle activities, including upgrades.
  • Provision infrastructure using Terraform and automate platform operations.
  • Work with GitLab CI/CD pipelines at scale.
  • Collaborate across teams in AWS and GCP environments.

Requirements

  • 6+ years of experience in SRE or Platform Engineering.
  • Strong experience with the Grafana ecosystem.
  • Hands-on expertise with Kubernetes and Amazon EKS.
  • Proficiency with Terraform.
  • Experience with OpenTelemetry (preferred).
  • Experience operating large-scale observability systems (preferred).
  • Cost optimization experience (preferred).
  • Experience with infrastructure provisioning and automation tools such as HashiCorp Packer and GitLab CI/CD.
  • Ability to work remotely in the United States.
  • Strong communication, collaboration, and professionalism with cross-functional teams.

Benefits

  • Salary range of $118,700 to $145,000 per year, plus possible variable compensation.
  • Healthcare and dental coverage.
  • Parental planning benefits.
  • Mental health benefits.
  • Disability income benefits.
  • Life and AD&D insurance.
  • 401(k) plan with company match.
  • Paid time off and fitness reimbursements.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
3 hours, 53 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking an Engineering Manager to lead its Resilience Engineering team, building production load testing and chaos engineering capabilities that improve the safety and reliability of production systems.

AWS Java Kotlin Kubernetes Microservices Python
4 hours, 2 minutes ago

Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)

MongoDB 1K-5K Internet Software & Services

MongoDB’s Storage Layer Services team is hiring a Site Reliability Engineer to help re-architect the cloud storage layer for Atlas and ensure the reliability and operational safety of its distributed storage infrastructure.

AWS Azure DNS GCP Go Kubernetes Linux Python TCP/IP TLS
4 hours, 50 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring an Engineering Manager to lead its Resilience Engineering team in building production load testing and chaos engineering capabilities that improve the safety and reliability of its production systems.

AWS Java Kotlin Kubernetes Python
7 hours, 6 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers