Arcadia

Arcadia

Arcadia provides a healthcare data platform that enables organizations to unify diverse data sources, derive actionable insights through analytics, and enhance patient outcomes by delivering high-quality care experiences.

IT Services
251-1K
Founded 2002
$154M raised

Description

  • Act as the technical leader for reliability for one or more domains and set direction and standards while staying hands-on where needed.
  • Define and drive reliability strategy across critical services, including SLOs, SLIs, error budgets, and reliability KPIs.
  • Lead incident response maturity by managing complex incidents, improving incident command practices, and ensuring high-quality post-incident reviews and remediation.
  • Design and implement automation to reduce toil and risk through runbook automation, self-service tools, and safe operational workflows.
  • Advance GitOps delivery practices using Argo CD, including promotion strategies, progressive delivery, canaries, and deploy guardrails.
  • Scale infrastructure management using Crossplane and Terraform by creating reusable patterns, policy controls, and paved roads for teams.
  • Lead operational readiness and reliability reviews for new features and architectural changes, reinforcing availability, latency, security, and cost requirements.
  • Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services.
  • Champion infrastructure security best practices for environments handling PHI, including least privilege, secrets management, auditability, and defense-in-depth.
  • Mentor Staff and Senior engineers through design reviews, code reviews, pairing, and documentation to raise reliability standards across teams.

Requirements

  • 8+ years of experience in SRE, platform engineering, systems engineering, or a related role operating production services at scale.
  • Demonstrated principal-level impact leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations.
  • Expertise in Kubernetes operations and troubleshooting, including safe rollout and rollback patterns, workload debugging, and operational guardrails.
  • Strong GitOps experience with Argo CD; experience building delivery workflows and automation using Argo Workflows.
  • Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform, including reusable platform patterns and controls.
  • Deep AWS experience across IAM, networking/VPC, compute, storage, managed services, and observability.
  • Strong understanding of reliability and failure modes in cloud systems.
  • Proficiency in Python for building automation, tooling, and reliability improvements.
  • Strong incident management and on-call leadership experience with measurable improvements in availability, MTTR, alert quality, cost, or operational maturity.
  • Excellent communication skills and the ability to translate technical risk and reliability tradeoffs for engineering leadership, product, and stakeholders.
  • Experience with ScyllaDB or similar distributed databases such as Cassandra is preferred.
  • Experience with Spark or other data processing platforms, including reliability and cost considerations for large-scale workloads, is preferred.
  • Familiarity with agentic coding practices and principles, including safe automation, reviewable changes, and guardrail-first workflows, is preferred.
  • Strong infrastructure security knowledge, including threat modeling for cloud/Kubernetes, RBAC/IAM design, secrets management, supply chain security, and security observability, is preferred.

Benefits

  • Competitive salary of $180,000 to $230,000 per year.
  • Flexible, remote-friendly work environment.
  • Mission-driven company focused on transforming healthcare.
  • Employee-driven programs and initiatives for personal and professional development.
  • A diverse, energized, and purpose-driven team community.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

TextNow 51-250 Wireless Telecommunication Services

TextNow is hiring a remote Site Reliability Engineer in Canada to own infrastructure, monitoring, logging, CI/CD, and reliability for the systems supporting its free phone service platform.

Ansible AWS CI/CD GitHub System Design Terraform
6 hours, 7 minutes ago

Senior Application Engineer

Warner Music Group is hiring a Senior Application Engineer to support, improve, and modernize the software systems behind its global music operations.

Angular AWS CI/CD GitHub Actions Java Oracle PostgreSQL Python React SQL
6 hours, 22 minutes ago

Site Reliability Engineer - Backstage

Spotify Media

Site Reliability Engineer for Spotify’s Backstage team in New York City, focused on building and operating cloud infrastructure for an external developer portal and internal AI-driven coding workflows.

AWS GCP Go Java LLM Microservices Python React Terraform TypeScript
7 hours, 37 minutes ago

Blockchain Site Reliability Engineer

InfStones 51-250 Internet Software & Services

InfStones is hiring a remote Blockchain Site Reliability Engineer in Dallas to ensure the reliability, availability, and performance of its blockchain node infrastructure.

Docker Ethereum Go Grafana JavaScript Kubernetes Linux Prometheus Python Rust Solana
8 hours, 22 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers