Obsidian Security

Obsidian Security

Obsidian Security is a Southern California-based company at the forefront of cybersecurity, artificial intelligence, and hybrid cloud environments. They offer a comprehensive security solution for businesses, including advanced threat protection, insid...

Internet Software & Services
51-250
Founded 2017
$30M raised

Description

  • Improve the reliability, availability, and resiliency of production systems and distributed services.
  • Build and maintain monitoring, alerting, dashboards, and observability tooling.
  • Support incident response, on-call operations, troubleshooting, and postmortem processes.
  • Partner with engineering teams to implement SLI/SLO practices and reliability-focused workflows.
  • Automate infrastructure operations, deployment workflows, and platform tooling across Kubernetes, cloud infrastructure, and data pipelines.
  • Collaborate with DevOps, Platform Engineering, and product teams to improve observability, incident response, and service resilience.
  • Help ensure production issues are detected and addressed quickly.
  • Contribute to operational standards and continuous improvement across the platform.

Requirements

  • 3-6 years of experience in Site Reliability Engineering, DevOps, Production Engineering, or related roles.
  • Experience operating and supporting production systems in AWS and/or GCP.
  • Familiarity with Kubernetes and Helm in cloud-native environments.
  • Experience with observability and monitoring tools such as Prometheus, Grafana, Datadog, or similar platforms.
  • Exposure to CI/CD systems such as GitLab CI/CD, GitHub Actions, ArgoCD, or equivalent.
  • Strong troubleshooting and debugging skills across distributed systems and microservices.
  • Experience writing automation or infrastructure tooling using scripting or programming languages.
  • Strong systems thinking and a collaborative engineering mindset.
  • AI Agent development experience is preferred.
  • Experience supporting SaaS platforms in production environments is preferred.
  • Familiarity with incident management and postmortem practices is preferred.
  • Exposure to infrastructure-as-code and GitOps workflows is preferred.
  • Understanding of SLI/SLO concepts and operational metrics is preferred.
  • Experience with enterprise-scale monitoring or customer-facing production systems is preferred.

Benefits

  • Competitive compensation with equity and 401k.
  • Base salary range of £95,000-£117,000 GBP.
  • Comprehensive healthcare with dental and vision coverage.
  • Flexible paid time off plus paid holiday time off.
  • 12 weeks of new parent or family leave.
  • Personal and professional development resources.
  • Eligible for equity awards, with possible sales commission or incentive compensation depending on role or function.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is hiring a Site Reliability Engineer for its Mission Autonomy team to support the reliability and operational excellence of autonomous systems used across cloud, hardware-in-the-loop, and air-gapped environments.

Ansible AWS Azure DNS Docker GCP Go HTTP Kubernetes Linux Load Balancing Puppet Python Splunk TCP/IP Terraform
57 minutes ago

Vice President Site Reliability Engineering (Data Centers)

Galaxy 251-1K Capital Markets

Galaxy is hiring a Site Reliability Engineering leader to own enterprise automation and infrastructure platform reliability across a hybrid environment supporting digital assets, data center operations, and AI-related compute.

Active Directory Ansible AWS Azure Bash Git GitHub Actions GitLab CI Go Grafana Jenkins Linux Packer Palo Alto PowerShell Prometheus Python Splunk Terraform Windows Server
2 hours, 17 minutes ago

Senior Production Engineer

Veeam Software 1K-5K Internet Software & Services

Veeam is hiring a Production Engineer to support the reliability, scalability, and operational excellence of its Data Cloud platform.

Azure C# CI/CD Elasticsearch Go Grafana Java JavaScript OpenTelemetry Prometheus TypeScript
2 hours, 32 minutes ago

Site Reliability Engineer

SupplyHouse.com 251-1K Building Materials

SupplyHouse.com is hiring a full-time Site Reliability Engineer in India to support the scalability, reliability, and performance of its cloud infrastructure and applications.

Ansible Bash CI/CD Datadog Docker GCP GitLab CI Go Grafana Jenkins Kubernetes Linux Network Security Prometheus Python Terraform Unix
2 hours, 47 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers