Airalo

Airalo

Airalo is the world's first eSIM store offering travelers access to eSIMs in 200+ countries & regions at affordable prices. With Airalo, travelers can manage their eSIMs, top up on the go, and enjoy pain-free connectivity while traveling. Say goodbye t...

Airlines
51-250
Founded 2019
$67M raised

Description

  • Lead the design of scalable, fault-tolerant, and self-healing systems in a multi-region AWS environment.
  • Define and track service level objectives (SLOs) and service level indicators (SLIs) to inform reliability and error budget decisions.
  • Conduct blameless post-incident reviews, identify systemic root causes, and implement long-term preventive measures.
  • Develop internal tools and automation to eliminate recurring manual operational work.
  • Create and maintain automated runbooks and playbooks for operational tasks and incident response.
  • Improve observability by turning monitoring into actionable insights using high-cardinality data.
  • Proactively identify and mitigate operational risk through chaos engineering and architecture reviews.
  • Partner with software engineers early in the SDLC to design for reliability, scalability, and maintainability.
  • Continuously evaluate and optimize system performance, capacity, and cost efficiency.
  • Improve the on-call experience by reducing alert fatigue, lowering MTTR, and supporting sustainable rotation health.

Requirements

  • Bachelor’s degree in Computer Engineering or a similar discipline.
  • 5+ years of experience as a Site Reliability Engineer or in a similar role.
  • 3+ years of experience with AWS services, including strong knowledge of container orchestration.
  • 2+ years of Kubernetes experience.
  • Deep understanding of observability principles and tools such as Prometheus, Datadog, and OpenTelemetry.
  • Experience leading incident management and complex postmortem analysis.
  • Experience with infrastructure as code, preferably Terraform.
  • Experience with chaos engineering and other resilience-testing techniques.
  • Experience with CI/CD tools such as GitHub Actions for automated delivery.
  • Proficiency in at least one programming language such as Python, Go, or Java for automation and internal tooling.
  • Experience with event-driven architecture such as SNS and SQS.
  • Ability to work independently and collaboratively in a fast-paced environment.
  • Strong communication skills and fluency in English.
  • Experience with Scrum or other agile methods (preferred).
  • Certification such as AWS Certified DevOps Engineer or Certified Kubernetes Administrator (CKA) (preferred).
  • Experience with Telco Core Networks, low-latency networking, telecommunications, eSIM, or GSMA technologies (preferred).
  • Experience with AI-driven SRE tools for anomaly detection and improvements (preferred).
  • Contributions to open-source SRE projects or communities (preferred).

Benefits

  • Remote-first work environment with the option to work from anywhere.
  • Health insurance.
  • Work-from-anywhere stipend.
  • Annual wellness and learning credits.
  • Annual all-expenses-paid company retreat in a destination location.
  • Paid on-call rotation with standby fees and overtime pay.
  • No on-call duties during the first 6 months.
  • Guaranteed rest periods and flexible hours after night incidents.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

Dropbox 1K-5K Internet Software & Services

Dropbox is hiring a Corporate Site Reliability Engineer to shape infrastructure strategy for IT services by improving the reliability, scalability, security, and observability of critical systems.

Ansible AWS Bash Chef Datadog DHCP DNS Docker EC2 Git GitHub GitHub Actions GitOps Kubernetes Linux Python REST API RHEL Serverless Terraform Ubuntu WAF
17 minutes ago

Director, Software Engineering (Site Reliability Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking a Director of Site Reliability Engineering to lead reliability, availability, and operational excellence for its global platform and core services.

1 hour, 17 minutes ago

Senior Site Reliability Engineer (SRE)

Nebius 51-250 Internet Software & Services

Nebius is hiring a backend/infrastructure engineer to support its AI cloud platform by ensuring reliable, scalable service operations and improving delivery and deployment processes.

Ansible C++ CI/CD Docker Go Helm Kubernetes Python SaltStack Terraform Unix
1 hour, 47 minutes ago

Director, Software Engineering (Site Reliability Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking a senior Reliability Engineering leader to build and scale resilience, incident response, and risk management programs across its global engineering organization.

2 hours, 2 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers