Staff Site Reliability Engineer, Production Engineering

4 weeks ago
Full-time
Lead
DevOps and Infrastructure
Dropbox

Dropbox

Dropbox is a technology company that builds simple, powerful products for individuals and businesses. With over 700 million registered users worldwide, Dropbox offers file sync, sharing, online backup, cloud storage, collaboration tools, and more to st...

Internet Software & Services
1K-5K
Founded 2007

Description

  • Define and evolve Dropbox’s company-wide technical reliability strategy for an AI-assisted engineering environment.
  • Set multi-year goals, standards, and roadmaps for observability, debugging, incident management, service health, and operational readiness.
  • Lead cross-team efforts to reduce reliability risk as deployment velocity, pull request volume, service complexity, and incident volume increase.
  • Partner with engineering leaders and platform teams to improve monitoring, alerting, debugging, SLOs, SLAs, and incident response systems.
  • Identify emerging reliability risks from AI-enabled workflows and design scalable systems, processes, and guardrails to mitigate them.
  • Provide technical leadership and mentorship to engineers across teams to raise engineering quality and operational excellence.
  • Drive communication and alignment with senior stakeholders on reliability priorities, tradeoffs, risks, and execution progress.

Requirements

  • BS degree in Computer Science or a related technical field involving coding, or equivalent technical experience.
  • 12+ years of experience in software engineering, site reliability engineering, infrastructure engineering, or related technical roles.
  • Proven ability to define and deliver multi-year, multi-team reliability, infrastructure, or platform strategies with measurable business and customer impact.
  • Deep experience with distributed systems, production operations, observability, incident response, SLOs/SLAs, debugging, and reliability risk management.
  • Demonstrated ability to diagnose complex technical problems, debug production systems, automate operational workflows, and design resilient software components.
  • Experience influencing engineering roadmaps across multiple teams and making technical decisions for the broader engineering organization.
  • Strong communication and collaboration skills, with the ability to align cross-functional stakeholders through ambiguity and drive execution across teams.
  • Experience adapting reliability strategies, developer tooling, or operational processes for AI-assisted software development workflows (preferred).
  • Experience building or scaling observability, debugging, incident management, or developer productivity platforms for large engineering organizations (preferred).
  • Experience leading reliability improvements in environments with high deployment velocity, complex service dependencies, and large-scale production systems (preferred).
  • Track record of mentoring senior engineers, setting technical standards, and spreading reliability best practices through documentation, reviews, talks, or architecture guidance (preferred).
  • Familiarity with AI-enabled tooling, agentic development workflows, or operational risks introduced by rapid automation in the software development lifecycle (preferred).

Benefits

  • Canada pay range: $204,900 to $277,200 CAD.
  • On-call rotations may be part of the role, with availability during both core and non-core business hours.
  • Opportunity to work on company-wide reliability strategy for a major engineering organization.
  • Exposure to shaping long-term platform investments and operational practices at scale.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

[Job 30278] SRE (DevOps)

CI&T 5K-10K Internet Software & Services

CI&T is hiring a senior SRE/DevOps to evolve the infrastructure behind critical digital products, with a focus on resilient multi-region AWS architecture and mobile delivery pipelines.

Android Ansible API Gateway AWS Bash CI/CD DynamoDB GitHub Actions GitLab CI Grafana iOS Jenkins Kubernetes Prometheus Python Secrets Management Terraform
3 minutes ago

Senior Manager, Engineering

Sumo Logic 251-1K Internet Software & Services

Sumo Logic is hiring a Senior Manager, Engineering for Application Security to lead global programs that improve product security, reliability, and operational efficiency across its cloud platform.

Agile AWS C++ Docker GCP Java Kafka Kubernetes OWASP Ruby Scala SIEM
1 day ago

Staff Software Engineer - Databases SRE | Sweden | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring a Staff Software Engineer, SRE to improve the reliability and scalability of Grafana Cloud’s database products for high-value customers across AWS, GCP, and Azure.

AWS Azure GCP Go Helm Java Kubernetes Linux Microservices Python Terraform
1 day, 23 hours ago

Senior Site Reliability Engineer (SRE)

Oowlish 51-250 Internet Software & Services

Oowlish is hiring a Senior Site Reliability Engineer to own the reliability and operational excellence of business-critical production systems for international clients in a remote, collaborative environment.

AWS Datadog Go Heroku Kubernetes PostgreSQL Python SQL Server TypeScript
1 day, 23 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers