Site Reliability Engineer (Hosted Infra) - Platform

10 hours, 58 minutes ago
Full-time
Senior
DevOps and Infrastructure
Elastic

Elastic

Elastic is a leading platform for search-powered solutions, providing real-time insights and making data usable for developers and enterprises worldwide.

Internet Software & Services
1K-5K
Founded 2010

Description

  • Engineer software and internal tools to automate large-scale systems and reduce operational toil.
  • Optimize host reliability and lifecycle management across multiple cloud providers.
  • Build alerting and monitoring systems that improve incident prevention and observability.
  • Scale global infrastructure and evolve infrastructure management processes to support growing demand.
  • Participate in code reviews, planning, knowledge sharing, and team mentoring.
  • Take part in a balanced SRE on-call rotation, including incident response, runbooks, postmortems, and reliability improvements.
  • Contribute documentation such as software designs, architecture decisions, runbooks, and postmortems.
  • Communicate project status clearly, surface blockers early, and follow through on action items.

Requirements

  • Experience building software with Golang.
  • Experience reviewing code and giving constructive feedback.
  • Production experience operating large-scale cloud compute environments with hundreds of hosts or more through automated workflows.
  • Deep experience with Linux systems and OS-level debugging in the terminal.
  • Experience running containerized workloads in production.
  • A customer-first, systems-thinking approach focused on root causes rather than symptoms.
  • Comfort working across time zones in both real-time and asynchronous collaboration.
  • Ability to create clear, maintainable documentation such as designs, runbooks, architecture diagrams, and postmortems.
  • A sensible approach to using AI tools to reduce operational burden without adding unnecessary complexity.
  • Preferred: production experience with Terraform, Puppet, Ansible, Argo CD, Argo Workflows, CUE, Docker, Kubernetes, Ubuntu, or Ubuntu Live Patch.
  • Preferred: on-call experience during incidents using observability tools such as Elastic Stack, Graphite, Prometheus, or Influx.
  • Preferred: hands-on experience engineering solutions with the Elastic Stack.

Benefits

  • Base salary with a typical starting range of $143,100 to $175,000 USD.
  • Eligibility to participate in Elastic's stock program.
  • Company-matched 401(k) with dollar-for-dollar matching up to 6% of eligible earnings.
  • Health coverage for you and your family in many locations.
  • Flexible locations and schedules for many roles.
  • Generous vacation days each year.
  • Up to 40 hours each year for volunteer projects.
  • Minimum of 16 weeks of parental leave.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

SRE Lead

GoReel 51-200 Software Development

SRE Lead at a top European iGaming solution provider, responsible for building and maintaining the observability cloud infrastructure and platform while improving deployment processes and system reliability.

Argo CD AWS Azure Bash CI/CD Confluence Debian Docker EC2 Elasticsearch Fluentd GCP Git GitLab Grafana Helm Jenkins JIRA Kibana Kubernetes OpsGenie Prometheus Python
1 hour, 8 minutes ago

Senior Site Reliability Engineer (SRE)

The Investigo Group Professional Services

The Investigo Group is hiring a Senior Site Reliability Engineer to operate and mature its production Kubernetes and OpenShift platforms across secure on-premises and hybrid environments.

Ansible Argo CD CI/CD Flux GitHub Actions GitOps Go Grafana Helm Juniper Kubernetes Linux Load Balancing Machine Learning OpenID Connect OpenShift OpenTelemetry Palo Alto Prometheus Python SAML Shell Scripting Terraform
7 hours, 21 minutes ago

Senior DevOps Engineer - Cloud Operations

Black Duck Inn 1K-5K Internet Software & Services

Black Duck Software is hiring a Sr. DevOps Engineer, Cloud Operations to own and operate global customer-facing SaaS and hosted infrastructure on Google Cloud Platform for enterprise applications.

Argo CD Bash CI/CD DevSecOps DNS GCP GitHub Actions GitOps Go HashiCorp Vault Helm Java Kubernetes Load Balancing Microservices Python Terraform TLS
8 hours, 46 minutes ago

Senior AIOps Engineer, Incident Response [Remote-US]

Quanata 201-500 information technology & services

Quanata is hiring an experienced production operations and reliability leader to oversee production health, incident response, and operational support for its AI-driven insurance technology platform.

AWS Confluence JIRA
18 hours, 22 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers