Principal Site Reliability & Forward Deployed Engineer

2 hours, 52 minutes ago
Full-time
Lead
Software Development
Abacus Insights

Abacus Insights

Abacus Insights simplifies healthcare data with intelligent solutions, unlocking data value and empowering health plans, consumers, and providers.

Insurance
51-250
Founded 2017
$82M raised

Description

  • Act as a senior technical escalation point during production incidents and customer-impacting issues.
  • Lead real-time incident triage, mitigation, recovery, and root cause analysis.
  • Own post-launch reliability, stability, and operational quality of core systems.
  • Investigate and resolve complex production defects, field issues, and escalations.
  • Translate fixes and incident learnings into durable product, platform, and operational improvements.
  • Support strategic customers with deployments, integrations, and production-grade technical challenges.
  • Troubleshoot AWS-hosted systems, including compute, storage, networking, IAM, and security.
  • Debug Databricks jobs, clusters, Spark-based pipelines, performance issues, scalability issues, and data correctness problems.
  • Write production-quality code and automation to improve reliability, observability, and operational efficiency.
  • Provide technical leadership, mentor engineers, and collaborate across Product, Engineering, Data, and Customer teams.

Requirements

  • 10+ years of experience in software engineering, SRE, sustaining engineering, or production operations.
  • Deep hands-on experience operating production systems in AWS.
  • Strong experience troubleshooting Databricks and large-scale data platforms.
  • Proficiency in Python and experience building production services or tooling.
  • Strong understanding of distributed systems, incident management, RCA, monitoring, alerting, observability, and CI/CD pipelines using Infrastructure as Code.
  • Proven ability to own problems end-to-end from detection to permanent resolution.
  • Excellent communication skills, especially during incidents and customer escalations.
  • Ability to work backward from customer impact to root cause across systems and codebases in environments with minimal documentation.
  • Strong instinct for operational risk and proactively identifying failure modes before they impact customers.
  • Experience in healthcare, health insurance, or regulated data environments is preferred.
  • Familiarity with Kubernetes (EKS), EMR, Lambda, Spark internals, and Snowflake or similar data warehouses is preferred.
  • Experience with FHIR, MDM systems, or entity resolution is preferred.
  • Prior SWAT, escalation engineering, or tiger-team experience is preferred.
  • Experience contributing to or operating within SRE/on-call programs is preferred.

Benefits

  • Base salary plus eligibility for performance bonuses and equity grants.
  • Unlimited paid time off.
  • Work from anywhere flexibility.
  • Comprehensive health coverage with multiple plan options.
  • Equity for every employee.
  • Growth-focused environment with development support.
  • Home office setup allowance.
  • Monthly cell phone allowance.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer II ( Remote )

LivePerson 1K-5K Internet Software & Services

LivePerson is seeking a Mid-Level Site Reliability Engineer to join its global Platform Engineering team in India, focused on keeping cloud-native production systems reliable, scalable, and performant.

Agile Argo CD AWS Datadog Flux GCP GitOps Go Grafana Helm Kubernetes Linux PagerDuty Prometheus Python Scrum Shell Scripting Terraform
7 minutes ago

Consulting Field Solutions Architect, Cyber Resilience - Global/Strategics

Pure Storage 1K-5K IT Services

Everpure (formerly Pure Storage) is hiring a Cyber Resilience Field Solutions Architect leader to drive cybersecurity technical sales, field enablement, and partner support across an assigned region.

CRM Cybersecurity Penetration Testing SIEM
7 minutes ago

Consulting Field Solutions Architect, Analytics & AI (DACH)

Pure Storage 1K-5K IT Services

Everpure is seeking an Analytics & AI Field Solutions Architect to support DACH customers by shaping opportunities, enabling the field, and driving technical engagement around data and AI solutions.

Apache Spark Elasticsearch Kafka Kubernetes PyTorch Snowflake Splunk TensorFlow
7 minutes ago

Developer Relations Engineer

Umpisa 11-50 Internet Software & Services

Umpisa Inc. is seeking a Developer Relations Engineer to represent its technical products, grow developer adoption, and help shape product direction through community engagement and developer-focused content.

Agile AWS Azure Blockchain GCP MongoDB MySQL PostgreSQL React Scrum Tailwind CSS
22 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers