Site Reliability Engineering (SRE) Tech Lead

2 hours, 3 minutes ago
Full-time
Lead
DevOps and Infrastructure
Obsidian Security

Obsidian Security

Obsidian Security is a Southern California-based company at the forefront of cybersecurity, artificial intelligence, and hybrid cloud environments. They offer a comprehensive security solution for businesses, including advanced threat protection, insid...

Internet Software & Services
51-250
Founded 2017
$30M raised

Description

  • Map and instrument critical system paths for top-tier enterprise customers.
  • Build connector health models to distinguish internal defects, upstream SaaS outages, and expected sparse or low-signal scenarios.
  • Establish tiered incident communication, including a public status page and direct outreach for high-priority accounts.
  • Define and roll out SLI/SLO standards across microservices.
  • Develop self-service instrumentation tooling so engineering teams can own observability.
  • Implement baseline-aware anomaly detection across connectors beyond static thresholds.
  • Mature incident response processes through structured post-mortems and continuous reliability improvements.
  • Lead a unified reliability strategy in partnership with DevOps and Platform Engineering leads.
  • Architect and implement systems for monitoring complex, mission-critical SaaS workloads.

Requirements

  • 7+ years of experience in SRE, production engineering, or a similar role.
  • 2+ years of experience operating as a technical lead.
  • Deep expertise with AWS and/or GCP.
  • Experience with Kubernetes and Helm.
  • Experience with observability tools such as Prometheus and Grafana.
  • Experience with CI/CD systems such as GitLab CI/CD and ArgoCD.
  • Proven experience building monitoring for multi-tenant SaaS systems with complex data pipelines.
  • Strong debugging skills across distributed microservices and legacy systems.
  • Hands-on engineering mindset with the ability to instrument services directly, not just configure tooling.
  • Track record of building or significantly improving incident detection and response systems.
  • Experience in B2B SaaS serving enterprise or financial customers is preferred.
  • Familiarity with third-party SaaS connector ingestion patterns is preferred.
  • Experience building anomaly detection systems or baseline-aware alerting is preferred.
  • Experience implementing customer-facing status pages and incident communication frameworks is preferred.

Benefits

  • Competitive compensation with equity and 401(k).
  • Comprehensive healthcare with dental and vision coverage.
  • Flexible paid time off and paid holiday time off.
  • 12 weeks of new parent or family leave.
  • Personal and professional development resources.
  • Base salary range of $250,000 to $280,000 USD.
  • Eligible for equity awards and may be eligible for sales commission or incentive compensation.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Member of Technical Staff, Fleet Reliability

Pure Storage 1K-5K IT Services

Pure Storage is hiring a Forensics Software Engineer to own fleet reliability and build investigative and predictive solutions that help diagnose customer issues and protect globally distributed systems.

C++ Go Java Linux Python
18 minutes ago

Senior Software Engineer - Search Platform

Algolia 251-1K Internet Software & Services

Algolia is hiring a Senior Software Engineer to join the Metis team and help build and operate the cloud-based distributed architecture behind its NeuralSearch AI search engine.

Go Kubernetes
1 hour, 33 minutes ago

Staff Site Reliability Engineer

Alphasense 51-250 Industrial Conglomerates

AlphaSense is hiring a Staff Site Reliability Engineer to shape reliability, scalability, and performance for its AI-driven market intelligence platform and global engineering organization.

AWS Azure Datadog DNS GCP Go Grafana Kubernetes Load Balancing OpenTelemetry Prometheus Python TCP/IP
3 hours, 3 minutes ago

Staff Site Reliability Engineer

Alphasense 51-250 Industrial Conglomerates

AlphaSense is hiring a Staff Site Reliability Engineer to architect and scale reliability, observability, and incident-response practices for its global SaaS platform.

AWS Azure Datadog DNS GCP Go Grafana Kubernetes Load Balancing OpenTelemetry Prometheus Python TCP/IP
4 hours, 18 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers