Manager, Software Engineering (Resilience Engineering)

3 hours, 10 minutes ago
Full-time
Lead
Software Development
Affirm

Affirm

Affirm offers a transparent buy now, pay later service founded in 2012 by Max Levchin. No late fees or surprises, just a responsible way to pay over time for your favorite brands.

Diversified Financial Services
1K-5K
Founded 2012

Description

  • Define and drive the resilience engineering vision, with production load testing and chaos engineering as core practices.
  • Lead and mentor engineers building platforms and tooling for safe production experimentation.
  • Partner with infrastructure, product, and security leaders to embed resilience validation into the development lifecycle.
  • Own the design and evolution of platforms for controlled production load testing and fault injection.
  • Implement safeguards such as isolation boundaries, approval workflows, and automated rollback mechanisms.
  • Build observability, traceability, and auditability into resilience experimentation systems.
  • Work with engineering teams to plan and execute safe production load tests and chaos experiments.
  • Partner with infrastructure teams to establish guardrails around tests and experimentation.
  • Enable adoption of resilience practices through reusable tooling, frameworks, and standardized workflows.
  • Identify systemic weaknesses and drive cross-functional reliability and fault-tolerance improvements.

Requirements

  • Proven experience leading engineering teams in reliability, infrastructure, or distributed systems.
  • Hands-on experience with production load testing, chaos engineering, or large-scale system validation.
  • Experience using a chaos engineering vendor such as Gremlin, Harness, or similar tools.
  • Strong understanding of distributed system failure modes, including latency, partial failure, and cascading outages.
  • Experience building or operating systems with strong safety guarantees, including isolation, rate limiting, guardrails, and auditability.
  • Familiarity with cloud-native environments such as AWS and Kubernetes, plus observability tooling.
  • Strong programming background in Python, Kotlin, Java, or similar languages.
  • Excellent problem-solving skills and the ability to balance long-term resilience work with immediate business needs.
  • Strong communication and leadership skills with a track record of influencing engineering practices across teams.

Benefits

  • Remote-first work environment with the ability to work almost anywhere within the country of employment.
  • Base salary range of CAD $178,000 to CAD $228,000 per year.
  • Equity rewards may be available as part of the total compensation package.
  • 100% subsidized medical, dental, and vision coverage for employees and dependents.
  • Monthly stipends for health, wellness, and tech spending.
  • Flexible Spending Wallets for technology, food, lifestyle needs, and family-forming expenses.
  • Competitive vacation and holiday time off.
  • Employee stock purchase plan (ESPP) with shares available at a discount.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)

MongoDB 1K-5K Internet Software & Services

MongoDB’s Storage Layer Services team is hiring a Site Reliability Engineer to help re-architect the cloud storage layer for Atlas and ensure the reliability and operational safety of its distributed storage infrastructure.

AWS Azure DNS GCP Go Kubernetes Linux Python TCP/IP TLS
55 minutes ago

Software Engineering Manager

Fixify Internet Software & Services

Fixify is hiring a remote-friendly Software Engineering Manager to lead a small engineering team building AI-native IT automation, while staying hands-on in architecture and production delivery.

AWS Azure CI/CD GCP System Design
3 hours, 54 minutes ago

Engineering Manager, Events

Klaviyo 1K-5K IT Services

Klaviyo is hiring an Engineering Manager to lead the team behind its real-time data platform and event infrastructure that powers segmentation, flows, and analytics at scale.

AWS Azure GCP Microservices Python System Design
4 hours, 1 minute ago

Engineering Manager II - Analytics Platform

Spotify Media

Spotify is hiring an Engineering Manager II for its Analytics Platform team in Toronto to lead the Learning Infrastructure Studio in building foundational data science infrastructure and intelligence systems at scale.

5 hours, 14 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers