Blink Health

Blink Health

Blink Health is a digital health company revolutionizing the prescription medication industry by providing affordable and accessible medications to millions of people across America. Their cloud-based pharmacy platform eliminates traditional roadblocks...

Health Care Providers & Services
251-1K
Founded 2014
$165M raised

Description

  • Evaluate and improve the organization’s disaster recovery posture, including RTO/RPO, dependency mapping, and failure domain analysis.
  • Define, document, and establish disaster recovery standards and best practices across cloud infrastructure, platforms, and application architectures.
  • Partner with SRE, platform, security, and product engineering teams to design resilient, fault-tolerant systems.
  • Lead the disaster recovery roadmap by balancing technical feasibility, cost, risk, and business priorities.
  • Design reference architectures for disaster recovery patterns such as pilot-light, warm standby, hot standby, and active-active.
  • Drive adoption of active-active disaster recovery for critical systems, including traffic management, data replication, consistency, and automated failover.
  • Define and operationalize DR testing strategies, including game days, chaos testing, and regular recovery exercises.
  • Establish documentation, runbooks, and escalation paths to ensure recoverability is clear and not dependent on individuals.
  • Evaluate and recommend platform upgrades, cloud services, and tooling that improve resilience, recovery speed, and reliability.
  • Serve as a technical advisor and mentor on disaster recovery and resilience for leadership and engineering teams.

Requirements

  • Bachelor’s or Master’s degree in Computer Science or equivalent practical experience.
  • 8+ years of experience in cloud infrastructure, platform engineering, SRE, or reliability-focused architecture roles.
  • Deep understanding of disaster recovery concepts including RTO/RPO, blast radius reduction, failure domains, and dependency isolation.
  • Proven experience designing and implementing multi-region and multi-availability zone architectures.
  • Hands-on experience moving systems toward active-active or highly available architectures.
  • Strong grasp of data replication strategies, consistency tradeoffs, and recovery patterns for databases and stateful systems.
  • Extensive experience with major cloud providers, with AWS preferred and GCP/Azure acceptable.
  • Experience with Kubernetes-based platforms, including regional failover, workload portability, and cluster recovery strategies.
  • Experience designing and maintaining Infrastructure as Code using tools such as Terraform, Pulumi, CloudFormation, or Ansible.
  • Experience defining and running DR tests, game days, and failure simulations.

Benefits

  • Opportunity to have a large impact on patients’ access to affordable medications.
  • Work on a fast-growing healthcare technology company with a mission-driven product.
  • Join a highly collaborative, cross-functional team of builders and operators.
  • Equal opportunity employer committed to diversity and inclusion.
  • Potential for application-related SMS or MMS status updates if consent is provided.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Manager, Engineering

Sumo Logic 251-1K Internet Software & Services

Sumo Logic is hiring a Senior Manager, Engineering for Application Security to lead global programs that improve product security, reliability, and operational efficiency across its cloud platform.

Agile AWS C++ Docker GCP Java Kafka Kubernetes OWASP Ruby Scala SIEM
22 hours, 47 minutes ago

Staff Software Engineer - Databases SRE | Sweden | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring a Staff Software Engineer, SRE to improve the reliability and scalability of Grafana Cloud’s database products for high-value customers across AWS, GCP, and Azure.

AWS Azure GCP Go Helm Java Kubernetes Linux Microservices Python Terraform
1 day, 22 hours ago

Senior Site Reliability Engineer (SRE)

Oowlish 51-250 Internet Software & Services

Oowlish is hiring a Senior Site Reliability Engineer to own the reliability and operational excellence of business-critical production systems for international clients in a remote, collaborative environment.

AWS Datadog Go Heroku Kubernetes PostgreSQL Python SQL Server TypeScript
1 day, 22 hours ago

Staff Software Engineer - Databases SRE | Spain | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring a Staff Software Engineer - SRE to strengthen the reliability of its cloud database products for high-SLA customers across AWS, GCP, and Azure.

AWS Azure GCP Go Helm Java Kubernetes Linux Python Terraform
1 day, 22 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers