Blink Health

Blink Health

Blink Health is a digital health company revolutionizing the prescription medication industry by providing affordable and accessible medications to millions of people across America. Their cloud-based pharmacy platform eliminates traditional roadblocks...

Health Care Providers & Services
251-1K
Founded 2014
$165M raised

Description

  • Evaluate and improve the organization’s disaster recovery posture, including RTO/RPO, dependency mapping, and failure domain analysis.
  • Define, document, and establish disaster recovery standards and best practices across cloud infrastructure, platforms, and application architectures.
  • Partner with SRE, platform, security, and product engineering teams to design resilient, fault-tolerant systems.
  • Lead the disaster recovery roadmap by balancing technical feasibility, cost, risk, and business priorities.
  • Design reference architectures for disaster recovery patterns such as pilot-light, warm standby, hot standby, and active-active.
  • Drive adoption of active-active disaster recovery for critical systems, including traffic management, data replication, consistency, and automated failover.
  • Define and operationalize DR testing strategies, including game days, chaos testing, and regular recovery exercises.
  • Establish documentation, runbooks, and escalation paths to ensure recoverability is clear and not dependent on individuals.
  • Evaluate and recommend platform upgrades, cloud services, and tooling that improve resilience, recovery speed, and reliability.
  • Serve as a technical advisor and mentor on disaster recovery and resilience for leadership and engineering teams.

Requirements

  • Bachelor’s or Master’s degree in Computer Science or equivalent practical experience.
  • 8+ years of experience in cloud infrastructure, platform engineering, SRE, or reliability-focused architecture roles.
  • Deep understanding of disaster recovery concepts including RTO/RPO, blast radius reduction, failure domains, and dependency isolation.
  • Proven experience designing and implementing multi-region and multi-availability zone architectures.
  • Hands-on experience moving systems toward active-active or highly available architectures.
  • Strong grasp of data replication strategies, consistency tradeoffs, and recovery patterns for databases and stateful systems.
  • Extensive experience with major cloud providers, with AWS preferred and GCP/Azure acceptable.
  • Experience with Kubernetes-based platforms, including regional failover, workload portability, and cluster recovery strategies.
  • Experience designing and maintaining Infrastructure as Code using tools such as Terraform, Pulumi, CloudFormation, or Ansible.
  • Experience defining and running DR tests, game days, and failure simulations.

Benefits

  • Opportunity to have a large impact on patients’ access to affordable medications.
  • Work on a fast-growing healthcare technology company with a mission-driven product.
  • Join a highly collaborative, cross-functional team of builders and operators.
  • Equal opportunity employer committed to diversity and inclusion.
  • Potential for application-related SMS or MMS status updates if consent is provided.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Staff Operations Engineer

Mozilla 251-1K Internet Software & Services

Mozilla is hiring a Staff Operations Engineer to lead the design, reliability, and evolution of hybrid-cloud and workplace infrastructure across teams.

Ansible DNS Linux Puppet Python TCP/IP Unix
16 hours, 9 minutes ago

Principal Site Reliability Engineer (SRE)

Symmetrio Professional Services

Symmetrio is recruiting a Principal Site Reliability Engineer for a rapidly growing healthcare technology company to own the reliability, scalability, security, and performance of a mission-critical SaaS platform used by healthcare providers across the United States.

Active Directory AWS CI/CD Datadog Django Grafana Kubernetes Python Terraform Windows Server
16 hours, 24 minutes ago

Performance Test Engineer Lead

PartnerOne 51-250 Media

An enterprise performance engineering role at a cloud-focused organization, responsible for validating the scalability, stability, and production readiness of distributed systems across Azure and hybrid environments.

Azure CI/CD Kubernetes PowerShell
16 hours, 39 minutes ago

Site Reliability Engineer

MLabs 11-50 Internet Software & Services

Remote UK-hours Site Reliability Engineering role at a financial technology company, focused on automating and operating the infrastructure that supports global integration services for financial institutions.

Active Directory Ansible AWS CI/CD GCP OAuth PostgreSQL SAML
16 hours, 54 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers