Backblaze

Backblaze is a pioneer in robust, scalable low-cost cloud backup and storage services, offering enterprise hot storage, low-cost backup and archive solutions. With the easiest way to back up all files, Backblaze provides unlimited, unthrottled, and unc...

IT Services

Information Technology

251-1K (393)

Founded 2007

16 open positions

Links

View All Jobs

Sr. Site Reliability Engineer

1 hour, 14 minutes ago

United States

Full-time

Lead

Site Reliability Engineer (SRE)

DevOps and Infrastructure

Ansible AWS Azure Bash Docker ELK Stack GCP Go Grafana HashiCorp Vault Jenkins Kubernetes Linux Microservices Prometheus Python Terraform

Apply Now

Backblaze

IT Services

251-1K

Founded 2007

View All Jobs 16

Description

Own the availability, durability, and performance of critical services across production environments.
Lead complex cross-functional projects from problem discovery through resolution.
Define and enforce service health standards, including SLIs, SLOs, and error budget policies.
Lead incident response and post-incident reviews, turning findings into long-term service improvements.
Mentor others and evolve ITIL/OSS processes for incident, change, problem, and capacity management.
Design and implement scalable automation to reduce toil and improve operational efficiency.
Drive monitoring, logging, and alerting strategy and integrate observability tooling.
Build, maintain, and secure CI/CD pipelines, configuration management, and infrastructure as code solutions.
Write production-grade code to create reliability tools and improve existing systems.
Partner with engineering, product, and operations on resilient system design, production readiness, capacity planning, and disaster recovery.

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
8+ years of progressive experience in site reliability, systems engineering, or operations.
Experience designing, scaling, and operating large-scale production-grade distributed systems.
Expert-level Linux systems administration and advanced troubleshooting skills.
Strong security-minded operations experience, including patching, hardening, and vulnerability identification.
Deep knowledge of monitoring, alerting, incident response, and root cause analysis.
Advanced proficiency in at least one modern scripting or programming language; Python or Go strongly preferred.
Experience designing and operating Kubernetes, Docker, and microservices in production.
Expert experience with HashiCorp tools such as Nomad, Vault, and Terraform in a production environment.
Preferred: experience in SaaS, service provider, or hyperscale distributed systems environments, with familiarity in ITIL/OSS and SLO/SLA practices, plus cloud platforms such as AWS, GCP, or Azure.

Benefits

Competitive compensation with an expected salary range of $150,000 - $200,000.
401(k) retirement plan.
RSU grants for full-time employees.
ESPP program.
Healthcare for family, including dental and vision.
Flexible vacation policy.
Maternity and paternity leave.
MacBook Pro for work plus a generous workstation stipend.
Childcare bonus and fertility treatment support.
Learning and development program.
Commuter benefits.
Culture that supports healthy work-life balance.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Database Reliability Engineer - Core Team

ClickHouse 51-250 IT Services

ClickHouse is hiring a Site Reliability Engineering team member in Core to help build and lead reliability efforts for ClickHouse Cloud, with a focus on improving the availability, scalability, performance, and operational resilience of its database platform.

Netherlands United Kingdom Germany Full-time Senior Site Reliability Engineer (SRE)

AWS Azure C++ ClickHouse GCP Python SQL

1 hour, 14 minutes ago

Apply

1 hour, 14 minutes ago

Senior SRE - Data

Lightspeed 1K-5K Professional Services

Lightspeed is hiring a data infrastructure and platform engineer to support its data and AI ecosystem by building secure, reliable, highly available cloud infrastructure and governance foundations.

Canada Full-time Senior DevOps Engineer Site Reliability Engineer (SRE)

Ansible Bash CI/CD Docker GCP GitHub Actions Go Kubernetes Linux MySQL PostgreSQL Puppet Terraform Unix

1 hour, 29 minutes ago

Apply

1 hour, 29 minutes ago