Principal Site Reliability & Forward Deployed Engineer

3 weeks, 1 day ago
Full-time
Lead
Software Development
Abacus Insights

Abacus Insights

Abacus Insights simplifies healthcare data with intelligent solutions, unlocking data value and empowering health plans, consumers, and providers.

Insurance
51-250
Founded 2017
$82M raised

Description

  • Act as a senior technical escalation point during production incidents and customer-impacting issues.
  • Lead real-time incident triage, mitigation, recovery, and root cause analysis.
  • Own post-launch reliability, stability, and operational quality of core systems.
  • Investigate and resolve complex production defects, field issues, and escalations.
  • Translate fixes and incident learnings into durable product, platform, and operational improvements.
  • Support strategic customers with deployments, integrations, and production-grade technical challenges.
  • Troubleshoot AWS-hosted systems, including compute, storage, networking, IAM, and security.
  • Debug Databricks jobs, clusters, Spark-based pipelines, performance issues, scalability issues, and data correctness problems.
  • Write production-quality code and automation to improve reliability, observability, and operational efficiency.
  • Provide technical leadership, mentor engineers, and collaborate across Product, Engineering, Data, and Customer teams.

Requirements

  • 10+ years of experience in software engineering, SRE, sustaining engineering, or production operations.
  • Deep hands-on experience operating production systems in AWS.
  • Strong experience troubleshooting Databricks and large-scale data platforms.
  • Proficiency in Python and experience building production services or tooling.
  • Strong understanding of distributed systems, incident management, RCA, monitoring, alerting, observability, and CI/CD pipelines using Infrastructure as Code.
  • Proven ability to own problems end-to-end from detection to permanent resolution.
  • Excellent communication skills, especially during incidents and customer escalations.
  • Ability to work backward from customer impact to root cause across systems and codebases in environments with minimal documentation.
  • Strong instinct for operational risk and proactively identifying failure modes before they impact customers.
  • Experience in healthcare, health insurance, or regulated data environments is preferred.
  • Familiarity with Kubernetes (EKS), EMR, Lambda, Spark internals, and Snowflake or similar data warehouses is preferred.
  • Experience with FHIR, MDM systems, or entity resolution is preferred.
  • Prior SWAT, escalation engineering, or tiger-team experience is preferred.
  • Experience contributing to or operating within SRE/on-call programs is preferred.

Benefits

  • Base salary plus eligibility for performance bonuses and equity grants.
  • Unlimited paid time off.
  • Work from anywhere flexibility.
  • Comprehensive health coverage with multiple plan options.
  • Equity for every employee.
  • Growth-focused environment with development support.
  • Home office setup allowance.
  • Monthly cell phone allowance.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Solutions Engineer - Financial Institutions

Yuno 51-200 Payment Processing Software

Yuno is hiring a Solutions Engineer to support banks and financial institutions across Europe, Italy, and Portugal by leading technical integrations and helping them adopt the company’s payment platform in complex regulated environments.

REST API
4 hours, 2 minutes ago

Forward Deployed Engineer (FDE)

Maneva 11-50 Automation Machinery Manufacturing

Maneva is hiring a Forward Deployed Engineer to implement and support AI-powered computer vision systems for manufacturing customers, with the goal of improving production quality and throughput through on-site technical ownership.

AWS Azure Computer Vision Docker ERP GCP Git Linux MLOps Python TCP/IP
4 hours, 29 minutes ago

Senior Site Reliability Engineer

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is hiring a Site Reliability Engineer for its Mission Autonomy team to support the reliability and operational excellence of autonomous systems used across cloud, hardware-in-the-loop, and air-gapped environments.

Ansible AWS Azure DNS Docker GCP Go HTTP Kubernetes Linux Load Balancing Puppet Python Splunk TCP/IP Terraform
4 hours, 44 minutes ago

Senior Solutions Engineer | California | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring an Enterprise Solutions Engineer to partner with sales and customers on technical pre-sales, product education, and opportunity closure in a fast-growing, remote-first observability company.

Grafana
7 hours, 19 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers