Staff Site Reliability & DevOps Engineer - Observability

4 weeks ago
Full-time
Lead
DevOps and Infrastructure
Brandwatch

Brandwatch

Brandwatch is the world’s leading provider of social media monitoring and analysis solutions, offering real-time coverage and reliable data to help global brands and agencies monitor online conversations, gain business insights, conduct market research...

Professional Services
1K-5K
Founded 2005
$83M raised

Description

  • Design, build, and operate observability platforms based on Grafana and Prometheus.
  • Define and maintain metrics standards, dashboards, alerts, and service-level objectives (SLOs).
  • Improve signal quality by reducing alert noise, tuning thresholds, and strengthening runbooks.
  • Support incident response with actionable telemetry and post-incident analysis.
  • Integrate metrics, logs, and traces across distributed systems.
  • Work with engineering teams to instrument services correctly.
  • Automate observability configuration using infrastructure as code.
  • Contribute to reliability improvements through capacity planning and performance analysis.

Requirements

  • Strong experience with Prometheus, including scraping, federation, recording rules, and alerting.
  • Strong experience with Grafana, including dashboards, alerting, templating, and RBAC.
  • Solid Linux and networking fundamentals.
  • Experience running observability stacks in Kubernetes environments.
  • Infrastructure as code experience, with Terraform preferred.
  • Familiarity with incident management and on-call practices.
  • Ability to debug production systems using metrics and logs.
  • Experience with logs and traces tools such as Loki, Tempo, or OpenTelemetry is preferred.
  • Experience operating large-scale or multi-cluster Kubernetes platforms is preferred.
  • Experience with cloud platforms such as GCP, AWS, or OCI is preferred.
  • Exposure to SRE concepts such as error budgets and SLO-driven prioritisation is preferred.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is hiring a Site Reliability Engineer for its Mission Autonomy team to support the reliability and operational excellence of autonomous systems used across cloud, hardware-in-the-loop, and air-gapped environments.

Ansible AWS Azure DNS Docker GCP Go HTTP Kubernetes Linux Load Balancing Puppet Python Splunk TCP/IP Terraform
5 hours, 30 minutes ago

Senior Site Reliability Engineer (SRE) - (GCP)

Devsu 51-250 Internet Software & Services

Devsu is hiring a Site Reliability Engineer to own monitoring, observability, and reliability operations for systems running across on-premises infrastructure and Google Cloud Platform, with backup support for application incidents when needed.

Bash GCP Grafana Kubernetes Linux PagerDuty Prometheus Python
9 hours, 10 minutes ago

Sr. Site Reliability Engineer

Obsidian Security 51-250 Internet Software & Services

Obsidian Security is hiring a Sr. Site Reliability Engineer to support the reliability and operational excellence of its multi-tenant SaaS security platform for enterprise and financial customers.

Argo CD AWS Datadog GCP GitHub Actions GitOps Grafana Helm Kubernetes Microservices Prometheus
9 hours, 28 minutes ago

DevOps - SRE Engineer - Argentina

Coderio 51-250 Internet Software & Services

Coderio is hiring a remote DevOps/SRE Engineer in Argentina to ensure the stability, scalability, and efficient operation of the infrastructure supporting its digital platforms.

Argo CD Flux GitHub Actions Helm Jenkins Kubernetes OpenShift Terraform
20 hours, 37 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers