Director of Site Reliability Engineering (SRE)

3 days, 17 hours ago
Full-time
Executive
DevOps and Infrastructure
Backblaze

Backblaze

Backblaze is a pioneer in robust, scalable low-cost cloud backup and storage services, offering enterprise hot storage, low-cost backup and archive solutions. With the easiest way to back up all files, Backblaze provides unlimited, unthrottled, and unc...

IT Services
251-1K
Founded 2007

Description

  • Lead and mentor a globally distributed SRE organization (15+ technical team members), including Sr. SRE and SRE Level 1 services teams.
  • Own the state of production and be accountable for production infrastructure and performance against key SLOs.
  • Provide 24/7 SRE services and centrally manage incident, change, and asset management processes.
  • Drive continuous improvement by leveraging operational data to prioritize work and enhance core operational competencies.
  • Manage demand forecasts and make strategic decisions regarding infrastructure expansion.
  • Oversee the budget for operational tooling and observability and manage the department budget.
  • Lead and coordinate strategic initiatives to evolve production support, incident/change/asset management, and related processes.
  • Recruit, coach, and develop team members to meet Backblaze and individual career objectives.
  • Build and maintain strong cross-functional relationships with Infrastructure Engineering, Customer Support, Data Center Operations, Supply Chain, Vendor Management, and Legal.
  • Represent Cloud Operations leadership as an engaged, visible leader and participate in contract renewal and vendor management cycles.

Requirements

  • Proven experience in a leadership role within MSP or Infrastructure-as-a-Service environments.
  • 6+ years of management experience, with at least 3 years at the Director level.
  • 5+ years of hands-on technical experience in a field related to the team’s focus.
  • Significant experience with cloud-scale data center systems, services, and managing mission-critical operations of complex global infrastructure.
  • Experience being accountable for production SLOs and measuring performance against those objectives.
  • Demonstrated experience with incident, change, and operational/process management and continuous improvement.
  • Strong analytical and data-driven decision-making skills and experience establishing department-level objectives/OKRs.
  • Experience managing department budgets and budgets for operational tooling and observability.
  • Excellent collaboration and communication skills with experience building high-performing, distributed teams.
  • Ability to travel domestically and internationally as needed; remote within Continental USA is acceptable with experience managing remotely.
  • Six Sigma training and/or certification is a plus.

Benefits

  • RSU grants for full-time employees
  • Annual company bonus plan
  • Healthcare for family, including dental and vision
  • 401(k) plan
  • Employee Stock Purchase Plan (ESPP)
  • Flexible vacation policy
  • Maternity and paternity leave
  • MacBook Pro plus a generous stipend to personalize your workstation

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

TextNow 51-250 Wireless Telecommunication Services

TextNow is hiring a remote Site Reliability Engineer in Canada to own infrastructure, monitoring, logging, CI/CD, and reliability for the systems supporting its free phone service platform.

Ansible AWS CI/CD GitHub System Design Terraform
6 hours, 28 minutes ago

Senior Application Engineer

Warner Music Group is hiring a Senior Application Engineer to support, improve, and modernize the software systems behind its global music operations.

Angular AWS CI/CD GitHub Actions Java Oracle PostgreSQL Python React SQL
6 hours, 43 minutes ago

Site Reliability Engineer - Backstage

Spotify Media

Site Reliability Engineer for Spotify’s Backstage team in New York City, focused on building and operating cloud infrastructure for an external developer portal and internal AI-driven coding workflows.

AWS GCP Go Java LLM Microservices Python React Terraform TypeScript
7 hours, 58 minutes ago

Blockchain Site Reliability Engineer

InfStones 51-250 Internet Software & Services

InfStones is hiring a remote Blockchain Site Reliability Engineer in Dallas to ensure the reliability, availability, and performance of its blockchain node infrastructure.

Docker Ethereum Go Grafana JavaScript Kubernetes Linux Prometheus Python Rust Solana
8 hours, 43 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers