Arcadia

Arcadia provides a healthcare data platform that enables organizations to unify diverse data sources, derive actionable insights through analytics, and enhance patient outcomes by delivering high-quality care experiences.

IT Services

Information Technology

251-1K (540)

Founded 2002

$154M raised

17 open positions

Links

View All Jobs

Principal Site Reliability Engineer

14 hours, 7 minutes ago

United States

Full-time

Lead

Site Reliability Engineer (SRE)

DevOps and Infrastructure

Apache Spark Argo CD AWS Cassandra GitOps Kubernetes Python Secrets Management Terraform

Apply Now

Arcadia

IT Services

251-1K

Founded 2002

$154M raised

View All Jobs 17

Description

Act as the technical leader for reliability for one or more domains and set direction and standards while staying hands-on where needed.
Define and drive reliability strategy across critical services, including SLOs, SLIs, error budgets, and reliability KPIs.
Lead incident response maturity by managing complex incidents, improving incident command practices, and ensuring high-quality post-incident reviews and remediation.
Design and implement automation to reduce toil and risk through runbook automation, self-service tools, and safe operational workflows.
Advance GitOps delivery practices using Argo CD, including promotion strategies, progressive delivery, canaries, and deploy guardrails.
Scale infrastructure management using Crossplane and Terraform by creating reusable patterns, policy controls, and paved roads for teams.
Lead operational readiness and reliability reviews for new features and architectural changes, reinforcing availability, latency, security, and cost requirements.
Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services.
Champion infrastructure security best practices for environments handling PHI, including least privilege, secrets management, auditability, and defense-in-depth.
Mentor Staff and Senior engineers through design reviews, code reviews, pairing, and documentation to raise reliability standards across teams.

Requirements

8+ years of experience in SRE, platform engineering, systems engineering, or a related role operating production services at scale.
Demonstrated principal-level impact leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations.
Expertise in Kubernetes operations and troubleshooting, including safe rollout and rollback patterns, workload debugging, and operational guardrails.
Strong GitOps experience with Argo CD; experience building delivery workflows and automation using Argo Workflows.
Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform, including reusable platform patterns and controls.
Deep AWS experience across IAM, networking/VPC, compute, storage, managed services, and observability.
Strong understanding of reliability and failure modes in cloud systems.
Proficiency in Python for building automation, tooling, and reliability improvements.
Strong incident management and on-call leadership experience with measurable improvements in availability, MTTR, alert quality, cost, or operational maturity.
Excellent communication skills and the ability to translate technical risk and reliability tradeoffs for engineering leadership, product, and stakeholders.
Experience with ScyllaDB or similar distributed databases such as Cassandra is preferred.
Experience with Spark or other data processing platforms, including reliability and cost considerations for large-scale workloads, is preferred.
Familiarity with agentic coding practices and principles, including safe automation, reviewable changes, and guardrail-first workflows, is preferred.
Strong infrastructure security knowledge, including threat modeling for cloud/Kubernetes, RBAC/IAM design, secrets management, supply chain security, and security observability, is preferred.

Benefits

Competitive salary of $180,000 to $230,000 per year.
Flexible, remote-friendly work environment.
Mission-driven company focused on transforming healthcare.
Employee-driven programs and initiatives for personal and professional development.
A diverse, energized, and purpose-driven team community.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

TextNow 51-250 Wireless Telecommunication Services

TextNow is hiring a remote Site Reliability Engineer in Canada to own infrastructure, monitoring, logging, CI/CD, and reliability for the systems supporting its free phone service platform.

Canada Full-time Senior Site Reliability Engineer (SRE)

$113k-$212k

Ansible AWS CI/CD GitHub System Design Terraform

6 hours, 7 minutes ago

Apply

6 hours, 7 minutes ago

Senior Application Engineer

Warner Music Group 5K-10K Media

Warner Music Group is hiring a Senior Application Engineer to support, improve, and modernize the software systems behind its global music operations.

Canada Full-time Senior Site Reliability Engineer (SRE) Software Engineer

$100k-$145k

Angular AWS CI/CD GitHub Actions Java Oracle PostgreSQL Python React SQL

6 hours, 22 minutes ago

Apply

6 hours, 22 minutes ago

Site Reliability Engineer - Backstage

Spotify Media

Site Reliability Engineer for Spotify’s Backstage team in New York City, focused on building and operating cloud infrastructure for an external developer portal and internal AI-driven coding workflows.

United States Full-time Mid Level Site Reliability Engineer (SRE)

$133k-$190k

AWS GCP Go Java LLM Microservices Python React Terraform TypeScript

7 hours, 37 minutes ago

Apply

7 hours, 37 minutes ago

Blockchain Site Reliability Engineer

InfStones 51-250 Internet Software & Services

InfStones is hiring a remote Blockchain Site Reliability Engineer in Dallas to ensure the reliability, availability, and performance of its blockchain node infrastructure.

United States Contract Senior Site Reliability Engineer (SRE)

Docker Ethereum Go Grafana JavaScript Kubernetes Linux Prometheus Python Rust Solana

8 hours, 22 minutes ago

Apply

8 hours, 22 minutes ago

Arcadia

Tags

Links

Principal Site Reliability Engineer

Arcadia

Description

Requirements

Benefits

Similar Roles

Site Reliability Engineer

Senior Application Engineer

Site Reliability Engineer - Backstage

Blockchain Site Reliability Engineer

You're on a roll! Sign up now to keep applying.