Senior Site Reliability Engineer (Auth0)

1 week, 4 days ago
Full-time
Senior
DevOps and Infrastructure
Okta

Okta

Okta is a leading independent provider of identity solutions for enterprises, offering a comprehensive range of products and services to connect and protect employees, partners, and customers. With a focus on secure access, authentication, and automati...

Professional Services
5K-10K
Founded 2009

Description

  • Design and build custom Go software to improve platform reliability, resiliency, and redundancy.
  • Partner with engineering teams to embed reliability practices that improve availability, performance, and observability.
  • Identify infrastructure and observability improvement opportunities and implement solutions.
  • Participate in the on-call rotation and respond quickly to critical incidents in a 24/7 production environment.
  • Troubleshoot, mitigate, or appropriately escalate production issues.
  • Develop and refine SRE tooling and operational processes with a focus on automation and efficiency.
  • Define, document, and promote reliability best practices across the organization.

Requirements

  • Proven experience supporting large-scale, mission-critical production systems with a high degree of autonomy.
  • Proficiency in at least one programming language, with Go preferred, and experience writing custom applications rather than only scripts.
  • Experience with infrastructure as code, especially Terraform.
  • Experience with container orchestration and deployment tools such as Kubernetes, Docker, and ArgoCD.
  • Demonstrable experience with a major cloud provider such as AWS, Azure, or GCP.
  • Strong understanding of microservices architecture, SQL and NoSQL databases, and networking fundamentals.
  • Understanding of core SRE concepts including SLIs, SLOs, and error budgets.
  • Experience working in an on-call rotation for a 24/7 cloud-based environment.
  • Strong communication and collaboration skills, including effectiveness in a remote, distributed team.
  • A proactive, systematic problem-solving approach with strong ownership.
  • Experience with reliability-focused engineering in a production environment is preferred.

Benefits

  • Remote role based in Europe.
  • Access to Okta’s supporting-your-well-being benefits.
  • Opportunities for professional growth through talent development and connection/community programs.
  • Immersive in-person onboarding designed to accelerate impact and connection to the mission and team.
  • Equal opportunity employer with inclusive hiring practices.
  • Accommodation support available during the application, interview, or onboarding process.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
6 hours, 26 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking an Engineering Manager to lead its Resilience Engineering team, building production load testing and chaos engineering capabilities that improve the safety and reliability of production systems.

AWS Java Kotlin Kubernetes Microservices Python
6 hours, 35 minutes ago

Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)

MongoDB 1K-5K Internet Software & Services

MongoDB’s Storage Layer Services team is hiring a Site Reliability Engineer to help re-architect the cloud storage layer for Atlas and ensure the reliability and operational safety of its distributed storage infrastructure.

AWS Azure DNS GCP Go Kubernetes Linux Python TCP/IP TLS
7 hours, 24 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring an Engineering Manager to lead its Resilience Engineering team in building production load testing and chaos engineering capabilities that improve the safety and reliability of its production systems.

AWS Java Kotlin Kubernetes Python
9 hours, 39 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers