Senior Site Reliability Engineer (SRE) - (GCP)

4 hours, 49 minutes ago
Full-time
Senior
DevOps and Infrastructure
Devsu

Devsu

Master the Art of Digital Innovation with Devsu Learn to create digital solutions that drive change and growth. Devsu provides the tools and resources you need to master the art of digital innovation. Devsu is a technology agency that provides software...

Internet Software & Services
51-250
Founded 2010

Description

  • Own and operate the monitoring and observability stack across on-premises and GCP environments.
  • Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and application visibility.
  • Define, tune, and maintain alerts to improve signal quality and reduce noise.
  • Establish observability standards and best practices across teams.
  • Improve system health, performance, and reliability through monitoring and operational improvements.
  • Apply SRE practices to improve availability, resilience, and performance.
  • Define and track SLIs, SLOs, and error budgets.
  • Participate in on-call rotations and SEV incident response.
  • Lead or contribute to incident investigations and root cause analysis, and drive preventative actions.
  • Support and monitor Kubernetes environments, including GKE and on-prem clusters, and troubleshoot platform issues affecting application reliability.
  • Provide L2/L3 application support coverage during resource shortages, major incidents, or escalations.
  • Triage application issues using runbooks and dashboards, and document actions and resolutions in ServiceNow.

Requirements

  • Strong experience as a Site Reliability Engineer or Reliability Engineer.
  • Deep hands-on expertise with Grafana, including dashboards, alerting, and troubleshooting.
  • Solid experience with monitoring and observability systems.
  • Production experience operating Kubernetes environments.
  • Experience supporting systems in both GCP and on-premises environments is mandatory.
  • Strong Linux systems and troubleshooting skills.
  • Fluent English, written and spoken.
  • Ability to work in PST time zone.
  • Ability to participate in an on-call rotation that includes one weekend day.
  • Weekend on-call time is compensated with one day off during the week.
  • Experience supporting application teams during SEV incidents is preferred.
  • Knowledge of capacity planning and performance tuning is preferred.
  • Scripting skills such as Python or Bash are preferred.
  • Experience with hybrid infrastructure environments is preferred.
  • Experience with Prometheus, logging platforms, PagerDuty, ServiceNow, Slack, networking, and infrastructure monitoring tools is relevant to the role.

Benefits

  • Stable, long-term contract with opportunities for career growth.
  • Private health insurance.
  • Remote-friendly culture that supports work-life balance.
  • Continuous training, mentorship, and learning programs.
  • Free access to AI training resources and AI tools.
  • Flexible paid time off policy plus paid holiday days.
  • Challenging software projects for clients in the US and LatAm.
  • Collaboration with talented engineers across Latin America and the US in a diverse work environment.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

SupplyHouse.com 251-1K Building Materials

SupplyHouse.com is hiring a full-time Site Reliability Engineer in India to support the scalability, reliability, and performance of its cloud infrastructure and applications.

Ansible Bash CI/CD Datadog Docker GCP GitLab CI Go Grafana Jenkins Kubernetes Linux Network Security Prometheus Python Terraform Unix
1 hour, 8 minutes ago

Site Reliability Engineer

Obsidian Security 51-250 Internet Software & Services

Obsidian Security is hiring a Site Reliability Engineer in the UK to help ensure the reliability, scalability, and operational excellence of its multi-tenant SaaS platform for enterprise and financial customers.

Argo CD AWS Datadog GCP GitHub Actions GitOps Grafana Helm Kubernetes Microservices Prometheus
2 hours, 39 minutes ago

Senior Site Reliability Engineer

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is hiring a Site Reliability Engineer for its Mission Autonomy team to support the reliability and operational excellence of autonomous systems used across cloud, hardware-in-the-loop, and air-gapped environments.

Ansible AWS Azure DNS Docker GCP Go HTTP Kubernetes Linux Load Balancing Puppet Python Splunk TCP/IP Terraform
5 hours, 10 minutes ago

Operations Reliability Engineer - Automations

Alpaca 51-250 Capital Markets

Alpaca is hiring an Operations Reliability Engineer to embed within brokerage operations and build software that replaces manual work with durable, auditable systems at global scale.

Agile Argo CD CI/CD Docker GCP Go gRPC Kubernetes Microservices PostgreSQL React REST API Scrum SQL Terraform TypeScript
7 hours, 4 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers