Staff Site Reliability Engineer

1 hour, 59 minutes ago
Lead
DevOps and Infrastructure
Puck

Puck

Puck helps great teams find great teammates through employer branding, conversations, and authentic candidate engagement, using personalized automation to enhance the candidate experience and improve hiring metrics.

Internet Software & Services
1-10
Founded 2020

Description

  • Lead the development of internal AI-assisted reliability tooling that analyzes tickets, logs, traces, and documentation to speed up outage resolution.
  • Improve observability coverage and signal quality for critical customer-facing systems across the development and support lifecycle.
  • Own incident response end-to-end, from detection through remediation, and improve documentation and learning after incidents.
  • Guide the development of customer- and user-facing observability tools within Domino’s products.
  • Define and mature SLO and SLI frameworks for priority services.
  • Scale cloud operations practices for Domino’s single-tenant SaaS offering.
  • Work with engineering teams to improve the reliability and repeatability of customer deployments and upgrades.
  • Mentor other engineers and help shape SRE workflows, operational readiness, and post-incident learning culture.

Requirements

  • Deep experience in Site Reliability Engineering, platform engineering, or a software engineering role with hands-on operational ownership.
  • Fluency with Kubernetes, Linux, cloud platforms, and observability tooling.
  • Ability to investigate complex real-world production problems using operational tooling and signals.
  • Strong software engineering skills in Python or Go.
  • Track record of building internal tools or services that people rely on.
  • Comfort leading technically ambiguous work and influencing across teams without direct authority.
  • History of improving reliability through engineering and automation.
  • Strong communication skills and experience mentoring engineers or shaping technical decisions.
  • Sound judgment about AI/LLM tooling, including when it is useful and when it adds noise.
  • Bonus: Experience with LLM-based systems, retrieval workflows, SaaS platform operations, or tooling for support or developer teams.

Benefits

  • Remote-first role indicated by #LI-Remote.
  • Opportunity to work on high-impact reliability tooling for AI-driven customer solutions.
  • Chance to help define and shape the SRE practice at Domino.
  • Work at a startup-style team backed by leading investors.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Manager, Software Engineering

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is seeking a Senior Manager to lead CorpTech Platform software teams that build and operate AI-enabled production systems and improve how internal engineering work is designed, shipped, and maintained.

CI/CD Computer Vision ERP LLM Microservices
1 hour, 37 minutes ago

DevOps Engineer / SRE

Fundraise Up 51-250 Capital Markets

Fundraise Up is hiring a DevOps Engineer/SRE to own on-premise infrastructure and keep its global fundraising platform stable, fast, and secure.

Ansible Bash CI/CD ClickHouse Elasticsearch Git GitOps HAProxy HashiCorp Vault Jenkins Kafka Koa Kubernetes Linux MongoDB NestJS Nginx Node.js Prometheus Python React Redis Terraform TypeScript Ubuntu Vue.js
2 hours, 53 minutes ago

Site Reliability Engineer

Obsidian Security 51-250 Internet Software & Services

Obsidian Security is hiring a Site Reliability Engineer in the UK to help ensure the reliability, scalability, and operational excellence of its multi-tenant SaaS platform for enterprise and financial customers.

Argo CD AWS Datadog GCP GitHub Actions GitOps Grafana Helm Kubernetes Microservices Prometheus
3 hours, 14 minutes ago

Senior Database Reliability Engineer

Sezzle 251-1K Diversified Financial Services

Sezzle is hiring a Senior Database Reliability Engineer to design and scale the database platform behind its applications, with a focus on making database usage safer, more reliable, and easier for developers across the company.

AWS CI/CD Datadog Elasticsearch Encryption Git GitLab Go Grafana Helm Kubernetes Microservices MySQL New Relic OpenTelemetry PostgreSQL Prometheus Python React React Native Secrets Management Terraform TypeScript
3 hours, 40 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers