Staff Platform Site Reliability Specialist (Observability & Kubernetes)

2 hours, 30 minutes ago
Full-time
Senior
DevOps and Infrastructure
Everbridge

Everbridge

Everbridge provides a comprehensive software platform that automates and enhances organizations' responses to critical events, ensuring the safety of individuals and the continuity of business operations during emergencies such as natural disasters, cy...

Internet Software & Services
1K-5K
Founded 2002

Description

  • Own the design, operation, and evolution of Everbridge’s observability stack.
  • Build and maintain a highly available and scalable observability platform.
  • Standardize instrumentation, dashboards, alerts, and service level objectives (SLOs).
  • Support incident response, root cause analysis, and capacity planning.
  • Operate and scale the Grafana ecosystem, including Grafana Loki, Mimir, Tempo, and Alerting.
  • Maintain the reliability and security of EKS clusters that support the observability platform.
  • Manage Kubernetes cluster lifecycle activities, including upgrades.
  • Provision infrastructure using Terraform.
  • Support infrastructure automation with HashiCorp Packer and GitLab CI/CD at scale.
  • Collaborate professionally with other teams to keep work moving forward and build trust.

Requirements

  • 6+ years of experience in SRE or Platform Engineering.
  • Strong experience with the Grafana ecosystem.
  • Hands-on experience with Kubernetes and Amazon EKS.
  • Proficiency with Terraform.
  • Experience working with AWS and GCP cloud technologies.
  • Experience with infrastructure provisioning and automation tools such as HashiCorp Packer and GitLab CI/CD (preferred).
  • Ability to communicate clearly and collaborate effectively across teams.
  • Comfort working in a highly visible, enterprise-scale cloud-native environment.

Benefits

  • Salary range of CAD $135,000 to $165,000, with possible variable compensation.
  • Comprehensive healthcare and dental care benefits.
  • Mental health benefits.
  • Disability income benefits.
  • Life and AD&D insurance.
  • Retirement savings plan with employer match.
  • Paid time off.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Lead Engineer - Platform Performance & Reliability

HighLevel 251-1K Internet Software & Services

HighLevel is hiring a Lead Engineer for its Platform Performance & Reliability team to improve the speed, stability, and operational health of a high-traffic global SaaS platform.

AWS ClickHouse Firestore GCP Grafana Kubernetes Microservices MongoDB MySQL Node.js OpenTelemetry PostgreSQL Prometheus Redis
15 minutes ago

Senior Cluster Site Reliability Engineer

The Voleon Group 51-250 Capital Markets

Senior Cluster Site Reliability Engineer at Voleon, responsible for scaling and operating the company’s research compute cluster that supports machine learning research and investment management workloads across on-prem and cloud environments.

Ansible Apache Airflow Apache Spark AWS Docker GCP Grafana Kubeflow Kubernetes Machine Learning OpenTelemetry Podman Prometheus Python PyTorch Ruby TensorFlow Terraform
30 minutes ago

Ingénieur fiabilité des infrastructures

Tecsys 251-1K Air Freight & Logistics

Tecsys recherche un ingénieur fiabilité des infrastructures pour son NOC afin d’assurer la fiabilité, la performance et l’évolution de ses plateformes SaaS critiques sur AWS et Kubernetes.

AWS Datadog Kubernetes Terraform
45 minutes ago

Sr. Site Reliability Engineer (Remote, Mexico)

IO Connect Services is seeking a remote Senior Site Reliability Engineer in Mexico to help design, automate, and scale cloud infrastructure and production services for customer deployments across a LATAM engineering team.

Ansible AWS Azure C++ Chef CI/CD Datadog GCP HDFS Java JavaScript Kubernetes PowerShell Puppet Python Ruby Terraform
1 hour ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers