Staff Software Engineer - Grafana Databases, Managed Services | Canada | Remote

1 day, 3 hours ago
Full-time
Lead
DevOps and Infrastructure
Grafana

Grafana

Grafana is the open observability platform providing analytics, monitoring, and visualization solutions with a focus on user control and cost efficiency.

IT Services
1K-5K
Founded 2014
$535M raised

Description

  • Operate and evolve 100+ multi-cloud streaming clusters and related database infrastructure.
  • Diagnose and resolve cross-layer failures involving storage latency, noisy neighbors, control-plane bottlenecks, and query regressions.
  • Design safe upgrade, rollout, migration, and partitioning strategies at scale.
  • Improve observability, automation, and day-to-day operational ergonomics.
  • Partner with database and platform teams to support safe scaling, consumer fan-out, and query performance.
  • Work hands-on with distributed systems behavior, Kubernetes scheduling, storage engines, and compression trade-offs.
  • Serve as a primary escalation point and participate in on-call incident response.
  • Own relationships with system vendors, including WarpStream Labs.
  • Define and evolve technical direction for operating WarpStream and adjacent shared database systems.
  • Mentor engineers and help mature the team’s technical practices.

Requirements

  • 8+ years of engineering experience, including time in SRE, platform engineering, production engineering, infrastructure engineering, or distributed systems roles.
  • Experience with high-throughput streaming systems, analytical or storage backends, or large-scale database infrastructure such as Kafka, Redpanda, WarpStream, Postgres, ClickHouse, Snowflake, or Cassandra.
  • Strong Kubernetes experience in AWS, GCP, or Azure, plus familiarity with infrastructure-as-code tools such as Helm, Terraform, or Jsonnet.
  • Experience leading or driving complex technical efforts, even without formal management responsibilities.
  • Strong understanding of distributed systems failure modes in multi-cloud environments.
  • Proficiency in at least one systems-oriented language; Go is preferred.
  • Working knowledge of Linux internals, networking, cloud storage, and performance/scaling behavior.
  • Experience participating in blameless incident response and writing high-quality post-incident reviews.
  • Clear communication skills and the ability to collaborate across teams while working autonomously.
  • Must be located in Canadian time zones; role is remote-first.

Benefits

  • Base salary range in Canada: CAD 186,368 to CAD 223,642.
  • Equity and bonus eligibility, where applicable.
  • All roles include Restricted Stock Units (RSUs).
  • 100% remote, global work environment.
  • Global annual leave policy of 30 days per year, including 3 Grafana Shutdown Days.
  • In-person onboarding to help new hires get started.
  • Access to company-funded modern AI coding assistants within security guidelines.
  • Access to frontier AI models for daily development work.
  • Career growth pathways and development opportunities.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

The Voleon Group 51-250 Capital Markets

Voleon is hiring a Site Reliability Engineer to improve the reliability, operations, and efficiency of production-critical infrastructure and data pipelines supporting its AI- and ML-driven investment systems.

Apache Airflow CI/CD Git Go Grafana gRPC Jenkins Kubernetes Linux Microservices Pandas PostgreSQL Prometheus Python R SQL
1 day, 3 hours ago

Senior SRE/DevOps Engineer

Metabase 51-250 IT Services

Metabase is hiring a Senior SRE/DevOps Engineer to own the infrastructure and operations behind its fast-growing Metabase Cloud hosted analytics product.

AWS CI/CD Datadog Go Grafana Kubernetes Prometheus Python Terraform
1 day, 3 hours ago

Lead Site Reliability Engineer - 10929

Coupa Software 1K-5K Internet Software & Services

Coupa is hiring a Lead Site Reliability Engineer in Mexico City to build and operate reliable cloud and GenAI infrastructure for its spend management platform.

AWS Azure Bash Chef DNS GCP Generative AI Git GitHub Actions Helm Kubernetes Linux LLM Machine Learning Microservices MySQL New Relic PagerDuty Python SageMaker Terraform
1 day, 3 hours ago

Site Reliability Engineer

Binance 5K-10K Capital Markets

Binance is hiring a Senior Site Reliability Engineer to improve the reliability and performance of its internal distributed test and validation environment for web, API, and Android testing.

Android Android Development Appium CI/CD Microservices Node.js Playwright Puppeteer Selenium
1 day, 4 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers