Staff Platform Reliability Engineer

13 hours, 31 minutes ago
Full-time
Lead
Software Development
Puck

Puck

Puck helps great teams find great teammates through employer branding, conversations, and authentic candidate engagement, using personalized automation to enhance the candidate experience and improve hiring metrics.

Internet Software & Services
1-10
Founded 2020

Description

  • Serve as the technical owner of Tempest, Domino's scale and reliability platform, ensuring it remains reliable, extensible, and aligned with evolving infrastructure needs.
  • Diagnose and drive resolution of performance bottlenecks and resource misconfigurations surfaced by scale testing.
  • Work directly with platform and infrastructure teams to implement durable fixes rather than only filing tickets.
  • Deliver accurate, data-driven sizing recommendations for customer-facing documentation based on empirical testing across deployment sizes.
  • Strengthen observability across scale testing by improving Prometheus and New Relic instrumentation.
  • Establish and operationalize scale testing on cloud platforms with appropriate sizing and configuration guidance.
  • Partner with platform teams to enable scale and reliability testing across additional cloud providers.
  • Build infrastructure automation that increases team efficiency as the product and customer base grow.
  • Profile services and trace root causes through observability data during and after multi-day load runs.

Requirements

  • Background in SRE, platform engineering, or infrastructure.
  • Hands-on experience operating and troubleshooting distributed systems in production Kubernetes environments.
  • Strong proficiency in Python.
  • Comfort working in a large, modular codebase spanning orchestration, infrastructure automation, and systems integration.
  • Experience with observability stacks such as Prometheus, Grafana, New Relic, or similar.
  • Ability to write queries, build dashboards, and use metrics to diagnose systems-level performance and reliability issues.
  • Demonstrated ability to profile services, identify resource bottlenecks, and work with engineering teams to ship durable fixes.
  • Familiarity with performance and load testing methodologies such as Locust, k6, or similar.
  • Clear ownership mindset with the ability to communicate priorities and status effectively in a remote, async environment.

Benefits

  • Annual US base salary range of $185,000 to $230,000.
  • Additional benefits may include equity.
  • Additional benefits may include a company bonus or sales commissions/bonuses.
  • 401(k) plan.
  • Medical, dental, and vision benefits.
  • Wellness stipends.
  • Remote work arrangement (#LI-Remote).

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Database Reliability Engineer

Sporty Group 51-250 Media

Sporty is seeking a Database Reliability Engineer to own and improve its database infrastructure supporting multiple platforms and international expansion.

Ansible Argo CD Elasticsearch GitHub Actions Go Grafana Helm Jenkins Kubernetes MongoDB MySQL PostgreSQL Prometheus Python RabbitMQ Terraform
10 hours, 16 minutes ago

Senior Site Reliability Engineer

Moniepoint 1K-5K Diversified Financial Services

Moniepoint is hiring an experienced Site Reliability Engineer to improve the reliability, scalability, and observability of its highly distributed financial platform serving emerging markets.

AWS Azure Datadog GCP Go Java Kafka Kubernetes Microservices MySQL New Relic OpenTelemetry PostgreSQL Prometheus Python RabbitMQ Rust
11 hours, 1 minute ago

Senior Site Reliability Engineer, Identity Platform

Coinbase 1K-5K Capital Markets

Coinbase is hiring an experienced Site Reliability Engineer to build and scale identity and access management tooling for its IT Operations Corporate Engineering team supporting cloud-based, security-first systems.

Ansible AWS Azure C# CI/CD Docker GCP Go Java Kubernetes Python Ruby Secrets Management Terraform
11 hours, 31 minutes ago

Database Reliability Engineer - Core Team

ClickHouse 51-250 IT Services

ClickHouse is hiring a Site Reliability Engineering team member for ClickHouse Core to improve the reliability, availability, scalability, and performance of ClickHouse Cloud for customers worldwide.

AWS Azure C++ ClickHouse GCP Python SQL
12 hours, 1 minute ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers