Obsidian Security

Obsidian Security

Obsidian Security is a Southern California-based company at the forefront of cybersecurity, artificial intelligence, and hybrid cloud environments. They offer a comprehensive security solution for businesses, including advanced threat protection, insid...

Internet Software & Services
51-250
Founded 2017
$30M raised

Description

  • Support and maintain the service quality of the customer-facing SaaS security platform.
  • Address scalability, reliability, observability, and cost-efficiency challenges across production systems.
  • Collaborate with Engineering to maintain and improve Helm charts, application deployment, monitoring, and CI/CD pipelines.
  • Embed with the engineering team to develop deep understanding of the application and its runtime behavior.
  • Define service verification strategies and implement them in the CI/CD process to meet SLAs.
  • Improve developer experience by optimizing CI/CD workflows and performance.
  • Participate in the on-call rotation and provide 24/7 support with the global SRE team.
  • Monitor, debug, and optimize production infrastructure and services on AWS and GCP.
  • Own and evolve the observability stack, including metrics pipelines, dashboards, log aggregation, and distributed tracing.
  • Define SLIs and SLOs across services and build alerting strategies that reduce noise and surface actionable signals.
  • Own the Kubernetes infrastructure for Sherlock, including multiple independently scaled worker pools and HPA autoscaling.
  • Design and maintain Sherlock’s CloudSQL schema, migration pipeline, task queue, and pgvector-based index.
  • Build dashboards for queue depth, worker latency, error rates, accuracy trends, and speed metrics.
  • Own the benchmark CI gate that blocks prompt merges when accuracy or speed regressions exceed thresholds.
  • Deliver capacity planning and cost dashboards for Sherlock’s GKE node pools.
  • Serve as the primary on-call engineer for Sherlock infrastructure by month 3.

Requirements

  • 4+ years of experience in a DevOps or SRE role supporting SaaS services on GCP and/or AWS.
  • Bachelor’s degree in Computer Science or a related field.
  • Production Kubernetes experience, including authoring and owning Deployments, HPAs, and resource limits.
  • Strong proficiency with Kubernetes, microservices architecture, Helm, GitLab CI/CD, and ArgoCD.
  • Deep hands-on experience with Grafana observability tools, including Prometheus/Mimir, Loki, and distributed tracing tools such as Tempo, Jaeger, or OpenTelemetry.
  • Ability to design SLI/SLO frameworks, build alerting rules, and reduce alert fatigue across complex microservices.
  • PostgreSQL fluency, including schema design, indexing, migrations, and query optimization.
  • Experience with async or queue-based architectures, including debugging stuck queues, consumer lag, and duplicate processing.
  • Programming proficiency in Python or Go.
  • Strong ownership mindset and comfort with production on-call responsibility.
  • GCP expertise with Cloud SQL, GKE, IAM, and Pub/Sub.
  • Experience with pgvector or other vector databases.
  • CI/CD pipeline ownership experience with GitLab CI or GitHub Actions.
  • Familiarity with LLM APIs such as Anthropic, Bedrock, or Vertex.
  • Understanding of AI agent design patterns and frameworks.
  • Experience with Kafka, Elasticsearch, ScyllaDB, Databricks, Dagster, Sentry, or Kong.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer (SRE)

The Investigo Group Professional Services

The Investigo Group is hiring a Senior Site Reliability Engineer to operate and mature its production Kubernetes and OpenShift platforms across secure on-premises and hybrid environments.

Ansible Argo CD CI/CD Flux GitHub Actions GitOps Go Grafana Helm Juniper Kubernetes Linux Load Balancing Machine Learning OpenID Connect OpenShift OpenTelemetry Palo Alto Prometheus Python SAML Shell Scripting Terraform
6 hours, 13 minutes ago

Senior DevOps Engineer - Cloud Operations

Black Duck Inn 1K-5K Internet Software & Services

Black Duck Software is hiring a Sr. DevOps Engineer, Cloud Operations to own and operate global customer-facing SaaS and hosted infrastructure on Google Cloud Platform for enterprise applications.

Argo CD Bash CI/CD DevSecOps DNS GCP GitHub Actions GitOps Go HashiCorp Vault Helm Java Kubernetes Load Balancing Microservices Python Terraform TLS
7 hours, 38 minutes ago

Site Reliability Engineer (Hosted Infra) - Platform

Elastic 1K-5K Internet Software & Services

Elastic is hiring a Cloud Infrastructure SRE to help build and operate large-scale multi-cloud infrastructure that powers Elastic Cloud across globally distributed regions.

Ansible Argo CD Docker Go Kubernetes Linux Prometheus Puppet Terraform Ubuntu
9 hours, 50 minutes ago

Senior AIOps Engineer, Incident Response [Remote-US]

Quanata 201-500 information technology & services

Quanata is hiring an experienced production operations and reliability leader to oversee production health, incident response, and operational support for its AI-driven insurance technology platform.

AWS Confluence JIRA
17 hours, 14 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers