Manager, Software Engineering (Resilience Engineering)

15 hours, 20 minutes ago
Full-time
Lead
Software Development
Affirm

Affirm

Affirm offers a transparent buy now, pay later service founded in 2012 by Max Levchin. No late fees or surprises, just a responsible way to pay over time for your favorite brands.

Diversified Financial Services
1K-5K
Founded 2012

Description

  • Define and drive the resilience engineering vision, with emphasis on production load testing and chaos engineering.
  • Lead and mentor a team of engineers building platforms and tooling for safe production experimentation.
  • Partner with infrastructure, product, and security leaders to embed resilience validation into the software development lifecycle.
  • Own the design and evolution of systems for controlled production load testing and fault injection.
  • Implement safeguards such as isolation boundaries, approval workflows, and automated rollback mechanisms.
  • Build end-to-end observability, traceability, and auditability for resilience experiments.
  • Drive reliability improvements by identifying weaknesses through load testing and chaos experiments.
  • Establish monitoring, alerting, and incident response practices for proactive resilience validation.
  • Work with engineering teams to design and execute safe production load tests and chaos experiments.
  • Enable adoption of resilience practices through reusable tooling, frameworks, and standardized workflows.

Requirements

  • Proven experience leading engineering teams in reliability, infrastructure, or distributed systems.
  • Hands-on experience with production load testing, chaos engineering, or large-scale system validation.
  • Experience with a chaos engineering vendor such as Gremlin, Harness, or a similar tool.
  • Strong understanding of distributed system failure modes, including latency, partial failure, and cascading outages.
  • Experience building or operating systems with strong safety guarantees, including isolation, rate limiting, guardrails, and auditability.
  • Familiarity with cloud-native environments such as AWS and Kubernetes, plus observability tooling.
  • Strong programming background in Python, Kotlin, Java, or similar languages.
  • Excellent problem-solving skills and the ability to balance resilience investments with immediate business needs.
  • Strong communication and leadership skills with a track record of influencing engineering practices across teams.
  • Equivalent practical experience or a Bachelor’s degree in a related field is required.

Benefits

  • Remote-first work environment, with most roles available almost anywhere in the country of employment.
  • Affirm covers 100% of medical, dental, and vision premiums for you and your dependents.
  • Monthly stipends for health, wellness, and tech spending.
  • Flexible Spending Wallets for technology, food, lifestyle needs, and family-forming expenses.
  • Competitive vacation and holiday schedules.
  • Employee Stock Purchase Plan (ESPP) with the ability to buy Affirm shares at a discount.
  • Base salary range of $200,000 - $250,000 per year for most U.S. states, or $225,000 - $275,000 in CA, WA, NY, NJ, and CT.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
22 minutes ago

Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)

MongoDB 1K-5K Internet Software & Services

MongoDB’s Storage Layer Services team is hiring a Site Reliability Engineer to help re-architect the cloud storage layer for Atlas and ensure the reliability and operational safety of its distributed storage infrastructure.

AWS Azure DNS GCP Go Kubernetes Linux Python TCP/IP TLS
1 hour, 19 minutes ago

Software Engineering Manager

Fixify Internet Software & Services

Fixify is hiring a remote-friendly Software Engineering Manager to lead a small engineering team building AI-native IT automation, while staying hands-on in architecture and production delivery.

AWS Azure CI/CD GCP System Design
4 hours, 19 minutes ago

Engineering Manager, Events

Klaviyo 1K-5K IT Services

Klaviyo is hiring an Engineering Manager to lead the team behind its real-time data platform and event infrastructure that powers segmentation, flows, and analytics at scale.

AWS Azure GCP Microservices Python System Design
4 hours, 25 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers