Staff Site Reliability Engineer, Production Engineering

19 hours, 26 minutes ago
Full-time
Lead
DevOps and Infrastructure
Dropbox

Dropbox

Dropbox is a technology company that builds simple, powerful products for individuals and businesses. With over 700 million registered users worldwide, Dropbox offers file sync, sharing, online backup, cloud storage, collaboration tools, and more to st...

Internet Software & Services
1K-5K
Founded 2007

Description

  • Define and evolve Dropbox’s company-wide technical reliability strategy for an AI-assisted engineering environment.
  • Set multi-year goals, standards, and roadmaps for observability, debugging, incident management, service health, and operational readiness.
  • Lead cross-team efforts to reduce reliability risk as deployment velocity, pull request volume, service complexity, and incident volume increase.
  • Partner with engineering leaders and platform teams to improve monitoring, alerting, debugging, SLOs, SLAs, and incident response systems.
  • Identify emerging reliability risks from AI-enabled workflows and design scalable systems, processes, and guardrails to mitigate them.
  • Provide technical leadership and mentorship to engineers across teams to raise engineering quality and operational excellence.
  • Drive communication and alignment with senior stakeholders on reliability priorities, tradeoffs, risks, and execution progress.

Requirements

  • BS degree in Computer Science or a related technical field involving coding, or equivalent technical experience.
  • 12+ years of experience in software engineering, site reliability engineering, infrastructure engineering, or related technical roles.
  • Proven ability to define and deliver multi-year, multi-team reliability, infrastructure, or platform strategies with measurable business and customer impact.
  • Deep experience with distributed systems, production operations, observability, incident response, SLOs/SLAs, debugging, and reliability risk management.
  • Demonstrated ability to diagnose complex technical problems, debug production systems, automate operational workflows, and design resilient software components.
  • Experience influencing engineering roadmaps across multiple teams and making technical decisions for the broader engineering organization.
  • Strong communication and collaboration skills, with the ability to align cross-functional stakeholders through ambiguity and drive execution across teams.
  • Experience adapting reliability strategies, developer tooling, or operational processes for AI-assisted software development workflows (preferred).
  • Experience building or scaling observability, debugging, incident management, or developer productivity platforms for large engineering organizations (preferred).
  • Experience leading reliability improvements in environments with high deployment velocity, complex service dependencies, and large-scale production systems (preferred).
  • Track record of mentoring senior engineers, setting technical standards, and spreading reliability best practices through documentation, reviews, talks, or architecture guidance (preferred).
  • Familiarity with AI-enabled tooling, agentic development workflows, or operational risks introduced by rapid automation in the software development lifecycle (preferred).

Benefits

  • Canada pay range: $204,900 to $277,200 CAD.
  • On-call rotations may be part of the role, with availability during both core and non-core business hours.
  • Opportunity to work on company-wide reliability strategy for a major engineering organization.
  • Exposure to shaping long-term platform investments and operational practices at scale.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Lead Site Reliability Engineer - 10929

Coupa Software 1K-5K Internet Software & Services

Coupa is hiring a Lead Site Reliability Engineer to support and evolve its cloud and GenAI platform infrastructure, with a focus on reliability, automation, and scalable operations.

AWS Azure Bash Chef DNS GCP Git GitHub Actions Helm Kubernetes Linux LLM MySQL New Relic PagerDuty Python SageMaker Terraform
4 hours, 1 minute ago

Site Reliability Engineer (Remote)

Libertex Group 251-1K Capital Markets

Libertex Group is hiring an SRE Engineer to support and improve the reliability, performance, and availability of its large-scale production systems for its online trading platform.

Ansible Apache Airflow AWS Azure Bash CDN CI/CD DNS Docker GCP GitLab Grafana HTTP Jenkins Kubernetes PowerShell Prometheus Python SQL SQL Server
4 hours, 22 minutes ago

Senior AIOps Engineer, Incident Response [Remote-US]

Quanata 201-500 information technology & services

Quanata is hiring an experienced production operations and reliability leader to oversee production health, incident response, and operational support for its AI-driven insurance technology platform.

AWS Confluence JIRA
7 hours, 17 minutes ago

[Job - 29712] Senior Devops / SRE

CI&T 5K-10K Internet Software & Services

CI&T is hiring a Senior DevOps/SRE to support remote delivery of scalable .NET and Next.js products with a strong focus on CI/CD, infrastructure reliability, observability, and incident response.

AWS AWS CDK Azure C# CI/CD Datadog Docker Gatling GitHub Actions GitLab CI Grafana Jaeger K6 Kubernetes .NET Next.js OpenTelemetry Prometheus Pulumi Terraform TypeScript WAF
9 hours, 25 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers