Backblaze

Backblaze

Backblaze is a pioneer in robust, scalable low-cost cloud backup and storage services, offering enterprise hot storage, low-cost backup and archive solutions. With the easiest way to back up all files, Backblaze provides unlimited, unthrottled, and unc...

IT Services
251-1K
Founded 2007

Description

  • Support the availability and durability of critical services across production environments.
  • Monitor service health using SLIs, SLOs, and error budgets, and escalate issues when thresholds are at risk.
  • Participate in on-call rotations, incident response, and post-incident reviews to drive service improvements.
  • Follow established ITIL/OSS processes for incident, change, problem, and capacity management.
  • Develop automation for common operational tasks to reduce manual intervention and toil.
  • Contribute to monitoring, logging, and alerting frameworks such as Prometheus, Grafana, Catchpoint, and ELK.
  • Work with CI/CD pipelines, configuration management, and infrastructure-as-code tools including Terraform, Ansible, and Jenkins.
  • Write scripts in languages such as Bash, Python, or Go to improve reliability and efficiency.
  • Partner with engineering, product, and operations teams to support resilient system design and operations.
  • Assist in capacity planning, disaster recovery exercises, and vendor troubleshooting; document systems and create runbooks and playbooks.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
  • 2–4 years of experience in site reliability, systems engineering, or operations.
  • Exposure to large-scale, production-grade systems.
  • Solid Linux systems administration and troubleshooting skills.
  • Familiarity with service reliability concepts, including monitoring, alerting, incident response, and root cause analysis.
  • Proficiency in at least one scripting language such as Python, Bash, or Go.
  • Understanding of containers such as Kubernetes and Docker, and microservices concepts.
  • Knowledge of incident response and operational best practices.
  • Experience in a SaaS, service provider, or distributed systems environment (preferred).
  • Familiarity with ITIL/OSS practices and SLO/SLA concepts (preferred).
  • Experience with cloud platforms such as AWS, GCP, or Azure (preferred).
  • Ability to work independently, take ownership, and drive projects from problem discovery through resolution (preferred).

Benefits

  • Backblaze encourages applicants even if they do not meet every requirement.
  • Learning, development, and growth are emphasized as part of the company culture.
  • The company promotes an inclusive workplace where employees can feel comfortable being themselves.
  • Backblaze is committed to diversity, equity, and inclusion across its workforce.
  • Backblaze is an Equal Opportunity Employer.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Database Reliability Engineer

Sporty Group 51-250 Media

Sporty is seeking a Database Reliability Engineer to own and improve its database infrastructure supporting multiple platforms and international expansion.

Ansible Argo CD Elasticsearch GitHub Actions Go Grafana Helm Jenkins Kubernetes MongoDB MySQL PostgreSQL Prometheus Python RabbitMQ Terraform
10 hours, 10 minutes ago

Senior Site Reliability Engineer

Moniepoint 1K-5K Diversified Financial Services

Moniepoint is hiring an experienced Site Reliability Engineer to improve the reliability, scalability, and observability of its highly distributed financial platform serving emerging markets.

AWS Azure Datadog GCP Go Java Kafka Kubernetes Microservices MySQL New Relic OpenTelemetry PostgreSQL Prometheus Python RabbitMQ Rust
10 hours, 55 minutes ago

Senior Site Reliability Engineer, Identity Platform

Coinbase 1K-5K Capital Markets

Coinbase is hiring an experienced Site Reliability Engineer to build and scale identity and access management tooling for its IT Operations Corporate Engineering team supporting cloud-based, security-first systems.

Ansible AWS Azure C# CI/CD Docker GCP Go Java Kubernetes Python Ruby Secrets Management Terraform
11 hours, 25 minutes ago

Database Reliability Engineer - Core Team

ClickHouse 51-250 IT Services

ClickHouse is hiring a Site Reliability Engineering team member for ClickHouse Core to improve the reliability, availability, scalability, and performance of ClickHouse Cloud for customers worldwide.

AWS Azure C++ ClickHouse GCP Python SQL
11 hours, 55 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers