Senior Site Reliability Engineer (SRE) - (GCP)

3 weeks, 2 days ago
Full-time
Senior
DevOps and Infrastructure
Devsu

Devsu

Master the Art of Digital Innovation with Devsu Learn to create digital solutions that drive change and growth. Devsu provides the tools and resources you need to master the art of digital innovation. Devsu is a technology agency that provides software...

Internet Software & Services
51-250
Founded 2010

Description

  • Own and operate the monitoring and observability stack across on-premises and GCP environments.
  • Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and application visibility.
  • Define, tune, and maintain alerts to improve signal quality and reduce noise.
  • Establish observability standards and best practices across teams.
  • Improve system health, performance, and reliability through monitoring and operational improvements.
  • Apply SRE practices to improve availability, resilience, and performance.
  • Define and track SLIs, SLOs, and error budgets.
  • Participate in on-call rotations and SEV incident response.
  • Lead or contribute to incident investigations and root cause analysis, and drive preventative actions.
  • Support and monitor Kubernetes environments, including GKE and on-prem clusters, and troubleshoot platform issues affecting application reliability.
  • Provide L2/L3 application support coverage during resource shortages, major incidents, or escalations.
  • Triage application issues using runbooks and dashboards, and document actions and resolutions in ServiceNow.

Requirements

  • Strong experience as a Site Reliability Engineer or Reliability Engineer.
  • Deep hands-on expertise with Grafana, including dashboards, alerting, and troubleshooting.
  • Solid experience with monitoring and observability systems.
  • Production experience operating Kubernetes environments.
  • Experience supporting systems in both GCP and on-premises environments is mandatory.
  • Strong Linux systems and troubleshooting skills.
  • Fluent English, written and spoken.
  • Ability to work in PST time zone.
  • Ability to participate in an on-call rotation that includes one weekend day.
  • Weekend on-call time is compensated with one day off during the week.
  • Experience supporting application teams during SEV incidents is preferred.
  • Knowledge of capacity planning and performance tuning is preferred.
  • Scripting skills such as Python or Bash are preferred.
  • Experience with hybrid infrastructure environments is preferred.
  • Experience with Prometheus, logging platforms, PagerDuty, ServiceNow, Slack, networking, and infrastructure monitoring tools is relevant to the role.

Benefits

  • Stable, long-term contract with opportunities for career growth.
  • Private health insurance.
  • Remote-friendly culture that supports work-life balance.
  • Continuous training, mentorship, and learning programs.
  • Free access to AI training resources and AI tools.
  • Flexible paid time off policy plus paid holiday days.
  • Challenging software projects for clients in the US and LatAm.
  • Collaboration with talented engineers across Latin America and the US in a diverse work environment.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

NoSQL Database Engineer II

LivePerson 1K-5K Internet Software & Services

LivePerson is hiring a NoSQL Database Engineer (L2) in India to support production reliability and platform engineering for large-scale NoSQL systems and cloud infrastructure.

Bash Cassandra Couchbase GCP Go Grafana Prometheus Python Redis Terraform
8 hours, 22 minutes ago

Sr. Production Engineer, Solutions Engineering

Pinterest 5K-10K Internet Software & Services

Pinterest is hiring a Senior Production Engineer on Solutions Engineering to design AI-driven reliability and automation systems that improve the operation of large-scale distributed infrastructure serving hundreds of millions of users.

Ansible AWS Azure Chef Docker Envoy GCP Go Hadoop Kafka Kubernetes Linux MySQL Puppet Python Terraform Unix
8 hours, 22 minutes ago

Senior Network Site Reliability Engineer

Miro 1K-5K Internet Software & Services

Miro is hiring a Senior Network Site Reliability Engineer to strengthen the reliability, availability, and scalability of its AWS-based production infrastructure.

Agile AWS Azure Bash CI/CD DNS EC2 GCP GitHub GitLab Kubernetes Linux Python TCP/IP Terraform
8 hours, 37 minutes ago

Sênior Site Reliability Engineer - Network

Harford County Public Library 51-250 Diversified Consumer Services

Stone Tech, da Stone Co., busca um Senior Site Reliability Engineer - Network para liderar projetos críticos de infraestrutura de redes e evoluir a arquitetura global de conectividade do grupo.

Ansible API Gateway AWS Azure Cisco Datadog Fortinet GCP Kong Palo Alto Prometheus SIEM Splunk Terraform Zabbix
8 hours, 52 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers