Staff Platform Site Reliability Specialist (Observability & Kubernetes) (copy)

12 hours, 6 minutes ago
Full-time
Lead
DevOps and Infrastructure
Everbridge

Everbridge

Everbridge provides a comprehensive software platform that automates and enhances organizations' responses to critical events, ensuring the safety of individuals and the continuity of business operations during emergencies such as natural disasters, cy...

Internet Software & Services
1K-5K
Founded 2002

Description

  • Own the design, operation, and ongoing evolution of Everbridge’s observability stack.
  • Build and maintain a highly available, scalable observability platform.
  • Standardize instrumentation, dashboards, alerts, and SLOs across engineering teams.
  • Support incident response, root cause analysis, and capacity planning.
  • Operate and scale Grafana and related telemetry services, including Loki, Mimir, Tempo, and Alerting.
  • Maintain the reliability and security of EKS clusters running the observability platform.
  • Manage Kubernetes cluster lifecycle activities, including upgrades.
  • Use Terraform to provision infrastructure as code.
  • Support automation and CI/CD workflows using HashiCorp Packer and GitLab CI/CD.
  • Collaborate professionally with other teams to keep systems running smoothly and move work forward.

Requirements

  • 6+ years of experience in SRE or Platform Engineering.
  • Strong experience with the Grafana ecosystem.
  • Experience with Kubernetes and Amazon EKS.
  • Proficiency with Terraform.
  • Experience working with cloud technologies in AWS and GCP.
  • Familiarity with observability tooling such as Grafana Loki, Grafana Mimir, Grafana Tempo, and Grafana Alerting (preferred).
  • Experience with HashiCorp Packer and GitLab CI/CD at scale (preferred).
  • Ability to communicate clearly, collaborate effectively, and work respectfully with cross-functional teams.
  • Comfort supporting incident response and reliability-focused operations.
  • Experience in large-scale, cloud-native environments (preferred).

Benefits

  • Salary range of CAD $135,000 to $165,000, with possible variable compensation.
  • Comprehensive healthcare and dental care.
  • Mental health benefits.
  • Disability income benefits.
  • Life and AD&D insurance.
  • Retirement savings plan with employer match.
  • Paid time off.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

TextNow 51-250 Wireless Telecommunication Services

TextNow is hiring a remote Site Reliability Engineer in Canada to own infrastructure, monitoring, logging, CI/CD, and reliability for the systems supporting its free phone service platform.

Ansible AWS CI/CD GitHub System Design Terraform
4 hours, 21 minutes ago

Senior Application Engineer

Warner Music Group is hiring a Senior Application Engineer to support, improve, and modernize the software systems behind its global music operations.

Angular AWS CI/CD GitHub Actions Java Oracle PostgreSQL Python React SQL
4 hours, 36 minutes ago

Site Reliability Engineer - Backstage

Spotify Media

Site Reliability Engineer for Spotify’s Backstage team in New York City, focused on building and operating cloud infrastructure for an external developer portal and internal AI-driven coding workflows.

AWS GCP Go Java LLM Microservices Python React Terraform TypeScript
5 hours, 51 minutes ago

Blockchain Site Reliability Engineer

InfStones 51-250 Internet Software & Services

InfStones is hiring a remote Blockchain Site Reliability Engineer in Dallas to ensure the reliability, availability, and performance of its blockchain node infrastructure.

Docker Ethereum Go Grafana JavaScript Kubernetes Linux Prometheus Python Rust Solana
6 hours, 36 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers