Senior Site Reliability Engineer- Remote

3 hours, 49 minutes ago
Full-time
Lead
DevOps and Infrastructure
ClickHouse

ClickHouse

ClickHouse provides a fast open source column-oriented database management system that enables users to generate real-time analytical data reports through SQL queries, catering to the needs of industries requiring efficient data processing and analysis.

IT Services
51-250
Founded 2021
$300M raised

Description

  • Build and lead processes that improve the reliability, availability, scalability, and performance of ClickHouse Cloud infrastructure.
  • Collaborate with engineering teams across Control Plane, Data Plane, Core, Security, Support, and Operations on distributed system design and implementation.
  • Design and implement scalable, secure, highly available, and fault-tolerant systems.
  • Define and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
  • Implement and maintain monitoring and alerting across infrastructure components to detect and resolve incidents quickly.
  • Improve incident response workflows and conduct blameless post-mortems for outages.
  • Work with Support to communicate clearly with impacted customers during incidents.
  • Continuously improve the reliability and performance of ClickHouse services.
  • Plan and drive chaos engineering initiatives across engineering teams.
  • Manage on-call processes, escalation practices, and downtime reduction efforts.

Requirements

  • Bachelor’s or Master’s degree in Computer Science or a related field.
  • At least 8 years of experience in Site Reliability Engineering or a related field.
  • Hands-on experience with Go and/or Python.
  • Strong knowledge of cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Excellent understanding of distributed databases and SQL; experience with ClickHouse is a major plus.
  • Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm.
  • Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet.
  • Strong problem-solving and production debugging skills.
  • High responsibility, ownership, and accountability.
  • Excellent communication and interpersonal skills.

Benefits

  • Typical starting salary of $141,000–$208,000 USD in the US.
  • Typical starting salary of $157,000–$230,000 USD in US Premium Markets.
  • Flexible, remote-friendly work environment.
  • Employer contributions toward healthcare.
  • Equity in the company through stock options for new team members.
  • Flexible time off in the US and generous entitlement in other countries.
  • $500 home office setup stipend for remote employees.
  • Opportunities to attend company-wide global offsites.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Incident Engineer

Netomi 51-250 IT Services

Netomi is hiring a remote Incident Engineer in Gurugram to manage end-to-end incident response for its enterprise AI customer experience platform and keep customer- and internal-facing systems running reliably.

AWS Datadog LLM
4 minutes ago

Sr. Site Reliability Engineer

Backblaze 251-1K IT Services

Backblaze is seeking a Senior Site Reliability Engineer to improve the stability, scalability, and reliability of its customer-facing cloud storage services and infrastructure.

Ansible AWS Azure Bash Docker ELK Stack GCP Go Grafana HashiCorp Vault Jenkins Kubernetes Linux Microservices Prometheus Python Terraform
2 hours, 19 minutes ago

Senior SRE - Data

Lightspeed 1K-5K Professional Services

Lightspeed is hiring a data infrastructure and platform engineer to support its data and AI ecosystem by building secure, reliable, highly available cloud infrastructure and governance foundations.

Ansible Bash CI/CD Docker GCP GitHub Actions Go Kubernetes Linux MySQL PostgreSQL Puppet Terraform Unix
2 hours, 34 minutes ago

Sr. Site Reliability Engineer I

Axon 1K-5K Professional Services

Axon is hiring a Senior Site Reliability Engineer in Canada to strengthen cloud-native identity and security systems that support mission-critical services and reliable product delivery.

AWS Azure C# CI/CD Go Java Kubernetes OpenID Connect Python SAML Secrets Management
2 hours, 34 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers