Senior Site Reliability Engineer (SRE)

1 hour, 16 minutes ago
Full-time
Senior
DevOps and Infrastructure
Oowlish

Oowlish

Top Nearshore Software Developers And Tech Squads | Oowlish Oowlish provides companies of all sizes access to the best technical talent in Brazil, making innovation more accessible and convenient than ever. Because our mission is to give every company,...

Internet Software & Services
51-250
Founded 2017

Description

  • Design, implement, and improve Site Reliability Engineering practices across production environments.
  • Define, manage, and continuously improve Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.
  • Lead and participate in incident response and incident command processes.
  • Build and evolve observability strategies, including monitoring, logging, alerting, and distributed tracing.
  • Improve system reliability, availability, scalability, and operational efficiency.
  • Partner with engineering teams to improve application performance and production readiness.
  • Develop automation solutions that reduce operational overhead and improve reliability.
  • Participate in root cause analysis and post-incident reviews.
  • Drive continuous improvement initiatives based on operational insights and incident learnings.
  • Help establish reliability best practices across teams and services.

Requirements

  • 5+ years of professional experience in Site Reliability Engineering, DevOps, or Production Engineering roles.
  • Strong understanding of Site Reliability Engineering principles and best practices.
  • Experience supporting and operating production systems at scale.
  • Strong knowledge of monitoring, observability, and reliability engineering concepts.
  • Experience working in cloud-based environments.
  • Strong troubleshooting and problem-solving skills.
  • Experience working with distributed systems and modern application architectures.
  • Proven Site Reliability Engineering experience.
  • Experience defining and managing SLOs, SLIs, and error budgets.
  • Experience leading or actively participating in Incident Command and Incident Response processes.
  • Experience designing and implementing observability strategies.
  • Hands-on experience with monitoring, logging, alerting, and distributed tracing.
  • Experience improving system reliability, availability, and operational excellence.
  • Experience supporting mission-critical production environments.
  • Experience with cloud platforms, with AWS preferred.
  • Strong automation mindset.
  • Experience conducting root cause analysis and postmortems.
  • Kubernetes experience is nice to have.
  • Terraform or Infrastructure as Code experience is nice to have.
  • CI/CD pipeline experience is nice to have.
  • Experience with containerized environments is nice to have.
  • Experience with distributed microservices architectures is nice to have.
  • Experience with performance engineering is nice to have.
  • Experience mentoring engineers on reliability practices is nice to have.
  • Multi-cloud experience is nice to have.
  • Experience working in highly regulated or high-availability environments is nice to have.

Benefits

  • Remote work / home office.
  • Competitive compensation based on experience.
  • Career plans with extensive growth opportunities.
  • International projects.
  • Oowlish English Program for technical and conversational English.
  • Oowlish Fitness with Total Pass.
  • Games and competitions.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Platform Database Engineer (MONGO DB)

Valtech 5K-10K Professional Services

Platform Database Engineer at a US-remote enterprise role focused on designing, operating, and optimizing MongoDB platforms across cloud-based mission-critical data environments.

AWS Bash CI/CD EC2 GitOps Kafka Kubernetes Linux MongoDB Prometheus Python Terraform
16 minutes ago

Senior Software Engineer - Grafana Databases, Managed Services | Germany | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring a Senior Software Engineer for its Managed Services team to run and improve the production infrastructure behind Grafana Cloud’s next-generation database products.

AWS Azure Cassandra ClickHouse GCP Go Grafana Helm Kafka Kubernetes Linux Microservices PostgreSQL Snowflake Terraform
31 minutes ago

Staff Reliability Engineer

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is hiring Reliability Engineers to support autonomous defense systems across the full product lifecycle, from early design through production and fielded operations.

46 minutes ago

Senior Reliability Engineer

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is hiring a Reliability Engineer to support autonomous defense systems across the full product lifecycle, from concept and design through production and fielded operations.

1 hour, 1 minute ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers