Site Reliability Engineer

15 hours, 19 minutes ago
Full-time
Lead
DevOps and Infrastructure
Sitetracker

Sitetracker

Sitetracker provides a comprehensive platform for managing high-volume distributed projects, enabling real-time collaboration, automated reporting, and accurate forecasting to streamline the deployment of infrastructure projects.

Diversified Telecommunication Services
251-1K
Founded 2013
$183M raised

Description

  • Define SLIs, SLOs, and error-budget policies for critical user journeys to guide reliability decisions.
  • Partner with engineers to transition the organization from reactive firefighting to a proactive reliability practice.
  • Lead production incident response as Incident Commander and run blameless postmortems with follow-up actions.
  • Build observability dashboards and actionable alerting that clearly explain system behavior and paging needs.
  • Design and maintain production-readiness and deploy-safety practices for engineering teams.
  • Evaluate infrastructure tooling and lead migrations when new approaches are justified by evidence.
  • Operate and debug AWS-based systems, including network and IAM issues, and support multi-region or regional rollout planning.
  • Mentor engineers through pair debugging, postmortem coaching, and runbook reviews to raise team capability.
  • Work with stakeholders and engineering teams to communicate downtime, infrastructure changes, and reliability trade-offs.
  • Use AI tools and log analysis to accelerate troubleshooting, operational improvements, and delivery.

Requirements

  • Staff or Senior Staff-level SRE experience is implied for this role.
  • Strong experience defining SLIs, SLOs, error budgets, and reliability practices.
  • Hands-on AWS experience across VPC, IAM, compute services such as ECS, EC2, and Lambda, managed data services, and load balancing.
  • Experience managing production incidents, incident command, and blameless postmortems.
  • Ability to build observability, dashboards, alerts, and clear runbooks.
  • Experience working with infrastructure managed through CloudFormation, bash scripts, and GitHub Actions.
  • Ability to debug production AWS issues at the network and IAM level without immediately escalating to AWS support.
  • Experience evaluating and leading infrastructure migrations, with familiarity or interest in Terraform, service mesh, and multi-region architectures.
  • Strong communication skills for writing postmortems, sharing downtime notices, and influencing roadmap decisions.
  • Comfort using AI tooling such as coding agents and log analysis tools to speed up engineering work.

Benefits

  • Salary range of $97,000 to $149,200 per year.
  • Remote work arrangement.
  • Opportunity to build a reliability practice and influence engineering standards from the ground up.
  • Autonomy to set reliability strategy and decide when to adopt new technologies.
  • Role impact across enterprise-scale platform reliability and expanding AI workloads.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
21 minutes ago

Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)

MongoDB 1K-5K Internet Software & Services

MongoDB’s Storage Layer Services team is hiring a Site Reliability Engineer to help re-architect the cloud storage layer for Atlas and ensure the reliability and operational safety of its distributed storage infrastructure.

AWS Azure DNS GCP Go Kubernetes Linux Python TCP/IP TLS
1 hour, 18 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring an Engineering Manager to lead its Resilience Engineering team in building production load testing and chaos engineering capabilities that improve the safety and reliability of its production systems.

AWS Java Kotlin Kubernetes Python
3 hours, 34 minutes ago

Senior Site Reliability Engineer

Civica 1K-5K Internet Software & Services

Civica is hiring a Senior Site Reliability Engineer to own the reliability, performance, security, and automation of the cloud platform supporting its public-sector SaaS products.

Ansible AWS Azure CI/CD CloudFormation Datadog ELK Stack GCP GitHub Actions Go Grafana Jaeger Java Kubernetes .NET OpenSearch OpenShift Packer Prometheus Python Terraform
15 hours, 19 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers