Staff Site Reliability & DevOps Engineer - Observability

2 hours, 20 minutes ago
Full-time
Lead
DevOps and Infrastructure
Brandwatch

Brandwatch

Brandwatch is the world’s leading provider of social media monitoring and analysis solutions, offering real-time coverage and reliable data to help global brands and agencies monitor online conversations, gain business insights, conduct market research...

Professional Services
1K-5K
Founded 2005
$83M raised

Description

  • Design, build, and operate observability platforms based on Grafana and Prometheus.
  • Define and maintain metrics standards, dashboards, alerts, and service-level objectives (SLOs).
  • Improve signal quality by reducing alert noise, tuning thresholds, and strengthening runbooks.
  • Support incident response with actionable telemetry and post-incident analysis.
  • Integrate metrics, logs, and traces across distributed systems.
  • Work with engineering teams to instrument services correctly.
  • Automate observability configuration using infrastructure as code.
  • Contribute to reliability improvements through capacity planning and performance analysis.

Requirements

  • Strong experience with Prometheus, including scraping, federation, recording rules, and alerting.
  • Strong experience with Grafana, including dashboards, alerting, templating, and RBAC.
  • Solid Linux and networking fundamentals.
  • Experience running observability stacks in Kubernetes environments.
  • Infrastructure as code experience, with Terraform preferred.
  • Familiarity with incident management and on-call practices.
  • Ability to debug production systems using metrics and logs.
  • Experience with logs and traces tools such as Loki, Tempo, or OpenTelemetry is preferred.
  • Experience operating large-scale or multi-cluster Kubernetes platforms is preferred.
  • Experience with cloud platforms such as GCP, AWS, or OCI is preferred.
  • Exposure to SRE concepts such as error budgets and SLO-driven prioritisation is preferred.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer II ( Remote )

LivePerson 1K-5K Internet Software & Services

LivePerson is seeking a Mid-Level Site Reliability Engineer to join its global Platform Engineering team in India, focused on keeping cloud-native production systems reliable, scalable, and performant.

Agile Argo CD AWS Datadog Flux GCP GitOps Go Grafana Helm Kubernetes Linux PagerDuty Prometheus Python Scrum Shell Scripting Terraform
5 minutes ago

Senior Site Reliability Engineer, AI Research

Algolia 251-1K Internet Software & Services

Algolia is hiring an embedded Senior Site Reliability Engineer to support its AI Research team by ensuring the reliability and operability of cloud infrastructure that powers research and customer-facing AI systems.

Argo CD CI/CD Datadog GCP GitOps Go Kubernetes Python Terraform
1 hour, 35 minutes ago

Software Engineer - Search Platform

Algolia 251-1K Internet Software & Services

Algolia is hiring a Software Engineer to join the Metis team and help build and operate the cloud-based architecture behind its NeuralSearch AI search engine for large-scale distributed search and indexing.

Go Kubernetes
1 hour, 35 minutes ago

Staff Site Reliability Engineer

Alphasense 51-250 Industrial Conglomerates

AlphaSense is seeking a Staff Site Reliability Engineer to shape reliability, scalability, and performance for its AI-driven market intelligence platform and to advance operational excellence across a global engineering organization.

AWS Azure Datadog DNS GCP Go Grafana Kubernetes Load Balancing OpenTelemetry Prometheus Python TCP/IP
1 hour, 41 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers