Arcadia

Arcadia

Arcadia provides a healthcare data platform that enables organizations to unify diverse data sources, derive actionable insights through analytics, and enhance patient outcomes by delivering high-quality care experiences.

IT Services
251-1K
Founded 2002
$154M raised

Description

  • Act as the technical leader for reliability for one or more domains and set direction and standards while staying hands-on where needed.
  • Define and drive reliability strategy across critical services, including SLOs, SLIs, error budgets, and reliability KPIs.
  • Lead incident response maturity by managing complex incidents, improving incident command practices, and ensuring high-quality post-incident reviews and remediation.
  • Design and implement automation to reduce toil and risk through runbook automation, self-service tools, and safe operational workflows.
  • Advance GitOps delivery practices using Argo CD, including promotion strategies, progressive delivery, canaries, and deploy guardrails.
  • Scale infrastructure management using Crossplane and Terraform by creating reusable patterns, policy controls, and paved roads for teams.
  • Lead operational readiness and reliability reviews for new features and architectural changes, reinforcing availability, latency, security, and cost requirements.
  • Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services.
  • Champion infrastructure security best practices for environments handling PHI, including least privilege, secrets management, auditability, and defense-in-depth.
  • Mentor Staff and Senior engineers through design reviews, code reviews, pairing, and documentation to raise reliability standards across teams.

Requirements

  • 8+ years of experience in SRE, platform engineering, systems engineering, or a related role operating production services at scale.
  • Demonstrated principal-level impact leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations.
  • Expertise in Kubernetes operations and troubleshooting, including safe rollout and rollback patterns, workload debugging, and operational guardrails.
  • Strong GitOps experience with Argo CD; experience building delivery workflows and automation using Argo Workflows.
  • Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform, including reusable platform patterns and controls.
  • Deep AWS experience across IAM, networking/VPC, compute, storage, managed services, and observability.
  • Strong understanding of reliability and failure modes in cloud systems.
  • Proficiency in Python for building automation, tooling, and reliability improvements.
  • Strong incident management and on-call leadership experience with measurable improvements in availability, MTTR, alert quality, cost, or operational maturity.
  • Excellent communication skills and the ability to translate technical risk and reliability tradeoffs for engineering leadership, product, and stakeholders.
  • Experience with ScyllaDB or similar distributed databases such as Cassandra is preferred.
  • Experience with Spark or other data processing platforms, including reliability and cost considerations for large-scale workloads, is preferred.
  • Familiarity with agentic coding practices and principles, including safe automation, reviewable changes, and guardrail-first workflows, is preferred.
  • Strong infrastructure security knowledge, including threat modeling for cloud/Kubernetes, RBAC/IAM design, secrets management, supply chain security, and security observability, is preferred.

Benefits

  • Competitive salary of $180,000 to $230,000 per year.
  • Flexible, remote-friendly work environment.
  • Mission-driven company focused on transforming healthcare.
  • Employee-driven programs and initiatives for personal and professional development.
  • A diverse, energized, and purpose-driven team community.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer

OfficeSpace Software 251-1K Internet Software & Services

OfficeSpace Software is hiring a Senior Site Reliability Engineer to own the performance, reliability, and cost efficiency of its production platform at scale while helping modernize operations with AI-assisted reliability engineering.

Ansible Apache Argo CD CI/CD Datadog GitOps Grafana Kubernetes Linux MariaDB Microservices MySQL Nginx PostgreSQL Prometheus Puppet Python Redis Ruby Ruby on Rails Sidekiq Terraform
1 hour, 42 minutes ago

Senior Database Reliability Engineer

Sezzle 251-1K Diversified Financial Services

Sezzle is hiring a Senior Database Reliability Engineer to design, build, and scale the shared database platform and reliability controls that support its applications across production and development environments.

AWS CI/CD Datadog Elasticsearch Encryption Git Go Grafana Helm Kubernetes Microservices MySQL New Relic OpenTelemetry PostgreSQL Prometheus Python React React Native REST API Secrets Management Terraform TypeScript
2 hours, 52 minutes ago

Associate Site Reliability Engineer

Ivanti 1K-5K Internet Software & Services

Ivanti is hiring a Site Reliability Engineer to help operate and improve its cloud-based SaaS services through automation, observability, and reliable production support.

Ansible Apache AWS Azure Chef Docker Elasticsearch Git HAProxy InfluxDB Java Jenkins Kafka Kubernetes Linux MongoDB MySQL Nginx PostgreSQL PowerShell Python Redis Ruby Splunk Terraform
7 hours ago

Senior Database Reliability Engineer

Sezzle 251-1K Diversified Financial Services

Sezzle is hiring a Senior Database Reliability Engineer to design and scale the database platform that supports its applications and improve reliability, safety, and developer experience across the company’s production systems.

AWS CI/CD Datadog Docker Elasticsearch Git GitLab Go Grafana GraphQL Helm Kubernetes Microservices MySQL New Relic OpenTelemetry PostgreSQL Prometheus Python React React Native REST API Terraform TypeScript
16 hours, 55 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers