Obsidian Security

Obsidian Security

Obsidian Security is a Southern California-based company at the forefront of cybersecurity, artificial intelligence, and hybrid cloud environments. They offer a comprehensive security solution for businesses, including advanced threat protection, insid...

Internet Software & Services
51-250
Founded 2017
$30M raised

Description

  • Support and maintain the service quality of the customer-facing SaaS security platform.
  • Address scalability, reliability, observability, and cost-efficiency challenges across production systems.
  • Collaborate with Engineering to maintain and improve Helm charts, application deployment, monitoring, and CI/CD pipelines.
  • Embed with the engineering team to develop deep understanding of the application and its runtime behavior.
  • Define service verification strategies and implement them in the CI/CD process to meet SLAs.
  • Improve developer experience by optimizing CI/CD workflows and performance.
  • Participate in the on-call rotation and provide 24/7 support with the global SRE team.
  • Monitor, debug, and optimize production infrastructure and services on AWS and GCP.
  • Own and evolve the observability stack, including metrics pipelines, dashboards, log aggregation, and distributed tracing.
  • Define SLIs and SLOs across services and build alerting strategies that reduce noise and surface actionable signals.
  • Own the Kubernetes infrastructure for Sherlock, including multiple independently scaled worker pools and HPA autoscaling.
  • Design and maintain Sherlock’s CloudSQL schema, migration pipeline, task queue, and pgvector-based index.
  • Build dashboards for queue depth, worker latency, error rates, accuracy trends, and speed metrics.
  • Own the benchmark CI gate that blocks prompt merges when accuracy or speed regressions exceed thresholds.
  • Deliver capacity planning and cost dashboards for Sherlock’s GKE node pools.
  • Serve as the primary on-call engineer for Sherlock infrastructure by month 3.

Requirements

  • 4+ years of experience in a DevOps or SRE role supporting SaaS services on GCP and/or AWS.
  • Bachelor’s degree in Computer Science or a related field.
  • Production Kubernetes experience, including authoring and owning Deployments, HPAs, and resource limits.
  • Strong proficiency with Kubernetes, microservices architecture, Helm, GitLab CI/CD, and ArgoCD.
  • Deep hands-on experience with Grafana observability tools, including Prometheus/Mimir, Loki, and distributed tracing tools such as Tempo, Jaeger, or OpenTelemetry.
  • Ability to design SLI/SLO frameworks, build alerting rules, and reduce alert fatigue across complex microservices.
  • PostgreSQL fluency, including schema design, indexing, migrations, and query optimization.
  • Experience with async or queue-based architectures, including debugging stuck queues, consumer lag, and duplicate processing.
  • Programming proficiency in Python or Go.
  • Strong ownership mindset and comfort with production on-call responsibility.
  • GCP expertise with Cloud SQL, GKE, IAM, and Pub/Sub.
  • Experience with pgvector or other vector databases.
  • CI/CD pipeline ownership experience with GitLab CI or GitHub Actions.
  • Familiarity with LLM APIs such as Anthropic, Bedrock, or Vertex.
  • Understanding of AI agent design patterns and frameworks.
  • Experience with Kafka, Elasticsearch, ScyllaDB, Databricks, Dagster, Sentry, or Kong.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Manager, Engineering

Sumo Logic 251-1K Internet Software & Services

Sumo Logic is hiring a Senior Manager, Engineering for Application Security to lead global programs that improve product security, reliability, and operational efficiency across its cloud platform.

Agile AWS C++ Docker GCP Java Kafka Kubernetes OWASP Ruby Scala SIEM
15 hours, 42 minutes ago

Staff Software Engineer - Databases SRE | Sweden | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring a Staff Software Engineer, SRE to improve the reliability and scalability of Grafana Cloud’s database products for high-value customers across AWS, GCP, and Azure.

AWS Azure GCP Go Helm Java Kubernetes Linux Microservices Python Terraform
1 day, 14 hours ago

Senior Site Reliability Engineer (SRE)

Oowlish 51-250 Internet Software & Services

Oowlish is hiring a Senior Site Reliability Engineer to own the reliability and operational excellence of business-critical production systems for international clients in a remote, collaborative environment.

AWS Datadog Go Heroku Kubernetes PostgreSQL Python SQL Server TypeScript
1 day, 15 hours ago

Staff Software Engineer - Databases SRE | Spain | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring a Staff Software Engineer - SRE to strengthen the reliability of its cloud database products for high-SLA customers across AWS, GCP, and Azure.

AWS Azure GCP Go Helm Java Kubernetes Linux Python Terraform
1 day, 15 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers