Director, Reliability Engineering

1 month ago
Full-time
Executive
DevOps and Infrastructure
HubSpot

HubSpot

HubSpot provides a comprehensive cloud-based CRM platform that integrates marketing, sales, service, and operations tools to help businesses attract, engage, and delight customers effectively.

Media
5K-10K
Founded 2006

Description

  • Lead and develop a team of ~20 reliability engineers, fostering operational excellence, continuous learning, and career growth.
  • Attract, retain, and grow top SRE talent and build clear engineering career paths.
  • Define and drive HubSpot’s reliability roadmap, prioritizing proactive resilience and incident reduction alongside cost and performance tradeoffs.
  • Set and evolve company-wide SLO standards to align engineering effort with customer experience.
  • Lead the strategy and implementation of AI-driven operations, integrating agentic approaches for incident detection, diagnosis, mitigation, and automated runbook execution.
  • Design and build intelligent systems that learn from operational history to surface risks and recommend or execute mitigations while balancing automation with human judgment.
  • Own incident management end-to-end, including response coordination, executive communication during major incidents, and blameless post-incident reviews to drive systemic improvements.
  • Influence engineering culture across 100+ product teams, identify systemic platform risks, and drive cross-functional mitigation efforts and alignment with Infrastructure, Product Engineering, and Security leadership.

Requirements

  • 10+ years of experience in software engineering, SRE, or infrastructure, with 5+ years leading teams.
  • Proven track record of building and scaling reliability functions in environments with significant operational complexity.
  • Deep technical fluency with the ability to participate credibly in architecture discussions, incident analysis, and system design.
  • Experience or strong interest in AIOps, agentic automation, or ML-driven observability, with curiosity and vision for AI/ML to transform operations.
  • Proven ability to drive cultural and process change across large engineering organizations without relying on top-down mandates.
  • Strong executive communication skills; comfortable leading incident bridges, presenting to leadership, and representing reliability externally.
  • Experience with modern cloud infrastructure (AWS preferred), observability tooling, and incident management practices.
  • A philosophy that balances reliability with velocity, prioritizing sustainable speed over gating.

Benefits

  • Flexible remote-first / hybrid work environment with regional in-person onboarding and periodic in-person events.
  • Support for accommodations due to disability or travel limitations during hiring and onboarding.
  • High-visibility, high-impact leadership role with executive access and strategic influence.
  • Opportunity to shape how AI transforms operational practices across the company and potentially the industry.
  • Work at a globally distributed company recognized for an award-winning culture and focus on employee growth and connection.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Infrastructure Engineer - Postgres

ClickHouse 51-250 IT Services

Senior SRE / Senior Infrastructure Engineer at ClickHouse responsible for owning reliability, automation, and operations for the company’s Postgres integration across AWS, GCP, and Azure to ensure scalable, secure, and dependable cloud data platform services.

AWS Azure CI/CD ClickHouse Docker GCP Go Grafana Kubernetes OpenTelemetry PostgreSQL Prometheus Terraform
1 month ago

Senior Field Engineer | UK | Remote

Grafana 1K-5K IT Services

Senior Field Engineering Infrastructure role at Grafana Labs responsible for maintaining and developing the pre-sales demo kit and backend infrastructure, creating technical demos and training, and enabling the Solution Engineering team to scale adoption and close deals.

AWS Azure CI/CD Datadog Elasticsearch GCP Grafana Kubernetes Prometheus Splunk Terraform
1 month ago

Cloud / Platform Engineer (Remote)

Alex Staff Agency 11-50 Professional Services

Cloud/Platform Engineer at a U.S.-based EdTech company operating a global, high-load digital learning platform, responsible for maintaining production reliability and operating multi-region cloud and Kubernetes infrastructure.

AWS Bash CI/CD GCP Go Kubernetes Python Terraform
1 month ago

Customer Reliability Engineer

Sysdig 251-1K IT Services

Customer Reliability Engineer at Sysdig (remote, flexible for Italy/Spain) delivering senior-level technical support and escalation management to ensure customers run and secure cloud/container environments reliably.

AWS Azure Bash Cassandra Elasticsearch GCP Kafka Kubernetes Linux PostgreSQL Python Shell Scripting
1 month ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers