Backblaze

Backblaze

Backblaze is a pioneer in robust, scalable low-cost cloud backup and storage services, offering enterprise hot storage, low-cost backup and archive solutions. With the easiest way to back up all files, Backblaze provides unlimited, unthrottled, and unc...

IT Services
251-1K
Founded 2007

Description

  • Own the availability, durability, and performance of critical services across production environments.
  • Lead complex cross-functional projects from problem discovery through resolution.
  • Define and enforce service health standards, including SLIs, SLOs, and error budget policies.
  • Lead incident response and post-incident reviews, turning findings into long-term service improvements.
  • Mentor others and evolve ITIL/OSS processes for incident, change, problem, and capacity management.
  • Design and implement scalable automation to reduce toil and improve operational efficiency.
  • Drive monitoring, logging, and alerting strategy and integrate observability tooling.
  • Build, maintain, and secure CI/CD pipelines, configuration management, and infrastructure as code solutions.
  • Write production-grade code to create reliability tools and improve existing systems.
  • Partner with engineering, product, and operations on resilient system design, production readiness, capacity planning, and disaster recovery.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
  • 8+ years of progressive experience in site reliability, systems engineering, or operations.
  • Experience designing, scaling, and operating large-scale production-grade distributed systems.
  • Expert-level Linux systems administration and advanced troubleshooting skills.
  • Strong security-minded operations experience, including patching, hardening, and vulnerability identification.
  • Deep knowledge of monitoring, alerting, incident response, and root cause analysis.
  • Advanced proficiency in at least one modern scripting or programming language; Python or Go strongly preferred.
  • Experience designing and operating Kubernetes, Docker, and microservices in production.
  • Expert experience with HashiCorp tools such as Nomad, Vault, and Terraform in a production environment.
  • Preferred: experience in SaaS, service provider, or hyperscale distributed systems environments, with familiarity in ITIL/OSS and SLO/SLA practices, plus cloud platforms such as AWS, GCP, or Azure.

Benefits

  • Competitive compensation with an expected salary range of $150,000 - $200,000.
  • 401(k) retirement plan.
  • RSU grants for full-time employees.
  • ESPP program.
  • Healthcare for family, including dental and vision.
  • Flexible vacation policy.
  • Maternity and paternity leave.
  • MacBook Pro for work plus a generous workstation stipend.
  • Childcare bonus and fertility treatment support.
  • Learning and development program.
  • Commuter benefits.
  • Culture that supports healthy work-life balance.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Database Reliability Engineer - Core Team

ClickHouse 51-250 IT Services

ClickHouse is hiring a Site Reliability Engineering team member in Core to help build and lead reliability efforts for ClickHouse Cloud, with a focus on improving the availability, scalability, performance, and operational resilience of its database platform.

AWS Azure C++ ClickHouse GCP Python SQL
1 hour, 14 minutes ago

Senior SRE - Data

Lightspeed 1K-5K Professional Services

Lightspeed is hiring a data infrastructure and platform engineer to support its data and AI ecosystem by building secure, reliable, highly available cloud infrastructure and governance foundations.

Ansible Bash CI/CD Docker GCP GitHub Actions Go Kubernetes Linux MySQL PostgreSQL Puppet Terraform Unix
1 hour, 29 minutes ago

Sr. Site Reliability Engineer I

Axon 1K-5K Professional Services

Axon is hiring a Senior Site Reliability Engineer in Canada to strengthen cloud-native identity and security systems that support mission-critical services and reliable product delivery.

AWS Azure C# CI/CD Go Java Kubernetes OpenID Connect Python SAML Secrets Management
1 hour, 29 minutes ago

Senior Site Reliability Engineer- Remote

ClickHouse 51-250 IT Services

ClickHouse is hiring a Site Reliability Engineer to help build and lead the reliability, availability, scalability, and performance of its cloud infrastructure for ClickHouse Cloud.

Ansible AWS Azure ClickHouse GCP Go Kubernetes Puppet Python SQL Terraform
2 hours, 44 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers