Staff Platform Reliability Engineer

1 month, 1 week ago
Full-time
Lead
DevOps and Infrastructure
Puck

Puck

Puck helps great teams find great teammates through employer branding, conversations, and authentic candidate engagement, using personalized automation to enhance the candidate experience and improve hiring metrics.

Internet Software & Services
1-10
Founded 2020

Description

  • Serve as the technical owner of Tempest, Domino's scale and reliability platform.
  • Diagnose and resolve performance bottlenecks and resource misconfigurations surfaced by scale testing.
  • Profile services and trace root causes using observability data from Prometheus and New Relic.
  • Partner with platform and infrastructure teams to ship durable fixes rather than only filing tickets.
  • Deliver accurate, data-driven sizing recommendations for customer-facing documentation.
  • Strengthen observability by improving instrumentation, dashboards, and queries for scale testing.
  • Establish and operationalize scale testing on cloud platforms with appropriate sizing and configuration guidance.
  • Enable scale and reliability testing across additional cloud providers in partnership with platform teams.
  • Build infrastructure automation that improves operational efficiency as the product and customer base grow.

Requirements

  • Background in SRE, platform engineering, or infrastructure.
  • Hands-on experience operating and troubleshooting distributed systems in production Kubernetes environments.
  • Strong proficiency in Python.
  • Comfort working in a large, modular codebase spanning orchestration, infrastructure automation, and systems integration.
  • Experience with observability stacks such as Prometheus, Grafana, New Relic, or similar.
  • Ability to write queries, build dashboards, and use metrics to diagnose performance and reliability issues.
  • Demonstrated ability to profile services, identify resource bottlenecks, and drive durable fixes with engineering teams.
  • Familiarity with performance and load testing tools or methodologies such as Locust, k6, or similar.
  • Self-directed, accountable ownership mindset.
  • Ability to communicate priorities and status effectively in a remote, async environment.

Benefits

  • Annual US base salary range of $185,000 to $230,000.
  • Additional equity may be included.
  • Company bonus or sales commissions/bonuses may be included.
  • 401(k) plan.
  • Medical, dental, and vision benefits.
  • Wellness stipends.
  • Remote role (#LI-Remote).

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

Alpaca 51-250 Capital Markets

Alpaca is hiring a Site Reliability Engineer to keep its brokerage platform reliable and operable across cloud, Kubernetes, observability, messaging, and database systems, with a strong focus on PostgreSQL reliability on the trading-critical path.

DNS GitOps Go Kafka Kubernetes Linux Load Balancing PostgreSQL Python RabbitMQ Secrets Management TLS
2 hours, 4 minutes ago

Site Reliability Engineer

Kaseya 1K-5K IT Services

Kaseya is hiring a Site Reliability Engineer to own the reliability, automation, and production stability of its AWS-based services used by thousands of MSPs worldwide.

Ansible AWS Chef CloudFormation Datadog DevSecOps Elasticsearch Kibana Kubernetes MySQL PostgreSQL Puppet Secrets Management Serverless Terraform
6 hours, 4 minutes ago

SRE - DevOps Engineer - Argentina

Coderio 51-250 Internet Software & Services

Coderio is hiring a remote DevOps/SRE Engineer in Argentina to ensure the stability, scalability, and efficient operation of the infrastructure that supports its global digital solutions.

Argo CD CI/CD Flux GitHub Actions GitOps Helm Jenkins Kubernetes OpenShift Terraform
9 hours, 44 minutes ago

Senior Site Reliability Engineer

Cribl 251-1K IT Services

Cribl is hiring a Senior Site Reliability Engineer in Poland to help build and operate the telemetry infrastructure and observability platform that supports its cloud products and enterprise customers.

Ansible AWS Azure CI/CD Grafana JavaScript Kibana Linux New Relic Node.js PagerDuty Prometheus Splunk Terraform TypeScript
17 hours, 17 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers