Staff Site Reliability Engineer

10 hours, 37 minutes ago
Full-time
Lead
DevOps and Infrastructure
SmarterDx

SmarterDx

SmarterDx ensures revenue integrity by leveraging A.I. algorithms to uncover missed opportunities in clinical data, optimizing tasks for accurate ICD 10 coding and guaranteeing a 5:1 ROI.

Professional Services
11-50
Founded 2019

Description

  • Define and evolve reliability standards, including SLIs, SLOs, and error budgets, for the SmarterDx platform.
  • Implement and maintain a reliability platform using Terraform and infrastructure-as-code best practices.
  • Improve observability across metrics, logs, traces, and alerting to reduce detection and resolution times.
  • Lead incident response, facilitate blameless postmortems, and drive systemic improvements to prevent repeat issues.
  • Reduce operational toil through automation, self-healing systems, and better deployment and rollback mechanisms.
  • Provide production support to ensure availability, performance, and data durability across the platform.
  • Research, prototype, and advocate for new reliability practices, tooling, and architectural improvements across engineering.

Requirements

  • 10+ years of software or software reliability engineering experience operating and scaling distributed systems in production.
  • 3+ years of hands-on experience running cloud-native infrastructure in AWS.
  • Deep familiarity with containers, Kubernetes, monitoring, and alerting in live production environments.
  • Proven experience defining and managing SLIs/SLOs, leading incident response, and driving postmortems and reliability improvements.
  • Strong expertise with Terraform and infrastructure-as-code practices for managing production infrastructure safely and reproducibly.
  • Deep experience with Kubernetes architecture and operations, including workload reliability, cluster scaling, networking, and failure modes.
  • Experience working in security-conscious, compliance-oriented environments.
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
  • Reliability engineering experience with production database systems such as Postgres is preferred.
  • Experience with AWS, Terraform, Kubernetes, Go, Python, Typescript, and Postgres is part of the current tech stack.

Benefits

  • $230K to $250K base salary.
  • Comprehensive medical, dental, and vision coverage with 75% of premiums covered depending on the plan.
  • Up to 12 weeks of paid parental leave.
  • Remote-first work environment with the ability to work from anywhere in the U.S.
  • Unlimited PTO plus 10 company holidays.
  • 401(k) with Traditional and Roth options and a 4% employer match.
  • Fast-moving environment with minimal bureaucracy.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Incident Commander

Caseware 251-1K Internet Software & Services

Caseware is hiring a remote Incident Commander to lead incident response for its 24/7 SaaS operations, coordinating resolution, communication, and post-incident follow-up across internal and external stakeholders.

AWS JIRA New Relic PagerDuty
22 minutes ago

Senior Site Reliability Engineer

Spotify Media

Senior Site Reliability Engineer role at Spotify’s Backstage team, building and operating the cloud infrastructure behind its developer portal and AI-native agent workflows.

AWS GCP Go Java Kubernetes Microservices Python React Terraform TypeScript
37 minutes ago

Mid SRE Engineer / DevOps 6 moths contract

Margo Bank Professional Services

Mid SRE Engineer / DevOps role at a consulting team in Warsaw focused on building a developer platform and defining CI/CD standards across multiple teams on a 6-month contract.

Bash CI/CD DevSecOps Git Kubernetes Python
52 minutes ago

FBS AIOps Engineer

Capgemini 100K+ Internet Software & Services

The AIOps Engineer at Farmers Business Services will design and operate a centralized AIOps platform that supports IT Operations, SRE, and infrastructure teams across the enterprise.

Python Snowflake
52 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers