Database Reliability Engineer - Core Team

1 month ago
Full-time
Senior
DevOps and Infrastructure
ClickHouse

ClickHouse

ClickHouse provides a fast open source column-oriented database management system that enables users to generate real-time analytical data reports through SQL queries, catering to the needs of industries requiring efficient data processing and analysis.

IT Services
51-250
Founded 2021
$300M raised

Description

  • Continuously improve the reliability and performance of ClickHouse Core through monitoring, tuning, and platform changes.
  • Design, create, and refine metrics and alerts to detect and prevent production issues before they affect customers.
  • Investigate recurring and high-impact production problems, identify root causes, and submit bug fixes, issue reports, and improvement proposals.
  • Enhance and run incident response processes and blameless post-mortem analyses for ClickHouse Core outages, coordinating communication with support and Cloud teams to inform impacted customers.
  • Manage on-call processes, coordinate escalations, and establish best practices to resolve issues quickly and minimize customer impact.
  • Plan, enable, and drive Chaos engineering initiatives across engineering teams to validate system resilience.
  • Collaborate with Control Plane, Dataplane, Security, Support, and Operations teams to guide and standardize best practices for running ClickHouse for customers.
  • Own engineering escalation management, investigations, and continuous improvement of how ClickHouse is operated and optimized in the cloud.

Requirements

  • Bachelor’s or Master’s degree in Computer Science or a related field.
  • At least 5 years of experience in Reliability Engineering, QA, or customer-facing engineering roles.
  • Experience operating ClickHouse or other SQL databases in production (ClickHouse experience is a major plus).
  • Strong understanding of distributed database internals and SQL.
  • Scripting experience with Shell or Python, and the ability to read and understand C++ code.
  • Knowledge of cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Proven production debugging and problem-solving skills.
  • Ability to work in a fast-paced, global team and partner with the business to drive results.
  • High level of ownership, accountability, and excellent communication skills.

Benefits

  • Flexible, remote-friendly work environment (ClickHouse operates in ~20 countries).
  • Employer contributions toward healthcare.
  • Equity in the company via stock options for new team members.
  • Flexible time off in the US and generous time-off entitlements in other countries.
  • $500 home office setup allowance for remote employees.
  • Opportunities for global company gatherings and offsites.
  • Compensation ranges disclosed for US roles with location-based premium adjustments (e.g., SF Bay Area, NYC); contact paytransparency@clickhouse.com with questions.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Field Engineer | UK | Remote

Grafana 1K-5K IT Services

Senior Field Engineering Infrastructure role at Grafana Labs responsible for maintaining and developing the pre-sales demo kit and backend infrastructure, creating technical demos and training, and enabling the Solution Engineering team to scale adoption and close deals.

AWS Azure CI/CD Datadog Elasticsearch GCP Grafana Kubernetes Prometheus Splunk Terraform
1 month ago

Cloud / Platform Engineer (Remote)

Alex Staff Agency 11-50 Professional Services

Cloud/Platform Engineer at a U.S.-based EdTech company operating a global, high-load digital learning platform, responsible for maintaining production reliability and operating multi-region cloud and Kubernetes infrastructure.

AWS Bash CI/CD GCP Go Kubernetes Python Terraform
1 month ago

Customer Reliability Engineer

Sysdig 251-1K IT Services

Customer Reliability Engineer at Sysdig (remote, flexible for Italy/Spain) delivering senior-level technical support and escalation management to ensure customers run and secure cloud/container environments reliably.

AWS Azure Bash Cassandra Elasticsearch GCP Kafka Kubernetes Linux PostgreSQL Python Shell Scripting
1 month ago

Senior Site Reliability Expert

Lightspeed 1K-5K Professional Services

Senior Site Reliability Expert at Lightspeed (Retail) responsible for designing, building, and operating the infrastructure platform that empowers product teams to deliver scalable, highly available production environments and efficient software delivery pipelines.

Argo CD AWS CircleCI Docker DynamoDB GCP Go Jenkins Kubernetes Linux MySQL PostgreSQL Python Redis Ruby Shell Scripting Terraform
1 month ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers