Staff Site Reliability Engineer, Production Engineering

1 hour, 39 minutes ago
Full-time
Lead
DevOps and Infrastructure
Dropbox

Dropbox

Dropbox is a technology company that builds simple, powerful products for individuals and businesses. With over 700 million registered users worldwide, Dropbox offers file sync, sharing, online backup, cloud storage, collaboration tools, and more to st...

Internet Software & Services
1K-5K
Founded 2007

Description

  • Define and evolve Dropbox’s company-wide technical reliability strategy for AI-assisted and agentic software development.
  • Set multi-year reliability goals, standards, and roadmaps across observability, debugging, incident management, service health, and operational readiness.
  • Lead cross-team initiatives that reduce reliability risk as delivery velocity, pull request volume, service complexity, and incident volume increase.
  • Partner with engineering leaders and platform teams to improve monitoring, alerting, debugging, SLOs, SLAs, and incident response systems.
  • Identify emerging reliability risks in AI-enabled development workflows and design scalable systems, processes, and guardrails to mitigate them.
  • Provide technical leadership and mentorship to engineers across teams to improve engineering quality and operational excellence.
  • Drive communication and alignment with senior stakeholders on reliability priorities, tradeoffs, risks, and execution progress.
  • Participate in on-call rotations as required for teams that operate services on call.

Requirements

  • BS degree in Computer Science or a related technical field involving coding, or equivalent technical experience.
  • 12+ years of experience in software engineering, site reliability engineering, infrastructure engineering, or related technical roles.
  • Proven ability to define and deliver multi-year, multi-team reliability, infrastructure, or platform strategies with measurable business and customer impact.
  • Deep experience with distributed systems, production operations, observability, incident response, SLOs/SLAs, debugging, and reliability risk management.
  • Demonstrated ability to diagnose complex technical problems, debug production systems, automate operational workflows, and design resilient software components.
  • Experience influencing engineering roadmaps across multiple teams and making technical decisions for the broader organization.
  • Strong communication and collaboration skills with the ability to align cross-functional stakeholders through ambiguity and drive execution.
  • Experience adapting reliability strategies, developer tooling, or operational processes for AI-assisted software development workflows (preferred).
  • Experience building or scaling observability, debugging, incident management, or developer productivity platforms for large engineering organizations (preferred).
  • Experience leading reliability improvements in environments with high deployment velocity, complex service dependencies, and large-scale production systems (preferred).
  • Track record of mentoring senior engineers, setting technical standards, and spreading reliability best practices (preferred).
  • Familiarity with AI-enabled tooling, agentic development workflows, or operational risks introduced by rapid automation in the software development lifecycle (preferred).

Benefits

  • Competitive salary range of $223,400–$302,200 USD for US Zone 2.
  • Competitive salary range of $198,600–$268,600 USD for US Zone 3.
  • Opportunity to work on company-wide reliability strategy for a major engineering organization.
  • Role is focused on shaping reliability practices in an AI-enabled development environment.
  • On-call rotations may be part of the role for teams operating services.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer

Amwell 1K-5K Diversified Telecommunication Services

Amwell is hiring a Senior Systems Engineer to support and automate infrastructure across its data center and cloud environments for telehealth services.

Active Directory Ansible AWS Azure Bash Elasticsearch ELK Stack GCP Kibana Linux Logstash PowerShell Puppet Python TCP/IP Terraform
2 hours, 13 minutes ago

Senior AIOps Engineer, Incident Response [Remote-US]

Quanata 201-500 information technology & services

Quanata is hiring an experienced production operations and reliability leader to oversee production health, incident response, and operational support for its AI-driven insurance technology platform.

AWS Confluence JIRA
18 hours, 13 minutes ago

Lead Site Reliability Engineer - 10929

Coupa Software 1K-5K Internet Software & Services

Coupa is hiring a Lead Site Reliability Engineer to support and evolve its cloud and GenAI platform infrastructure, with a focus on reliability, automation, and scalable operations.

AWS Azure Bash Chef DNS GCP Git GitHub Actions Helm Kubernetes Linux LLM MySQL New Relic PagerDuty Python SageMaker Terraform
22 hours, 4 minutes ago

Site Reliability Engineer (Remote)

Libertex Group 251-1K Capital Markets

Libertex Group is hiring an SRE Engineer to support and improve the reliability, performance, and availability of its large-scale production systems for its online trading platform.

Ansible Apache Airflow AWS Azure Bash CDN CI/CD DNS Docker GCP GitLab Grafana HTTP Jenkins Kubernetes PowerShell Prometheus Python SQL SQL Server
22 hours, 25 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers