Airalo

Airalo

Airalo is the world's first eSIM store offering travelers access to eSIMs in 200+ countries & regions at affordable prices. With Airalo, travelers can manage their eSIMs, top up on the go, and enjoy pain-free connectivity while traveling. Say goodbye t...

Airlines
51-250
Founded 2019
$67M raised

Description

  • Lead the design of scalable, fault-tolerant, self-healing systems in a multi-region AWS environment.
  • Define and track SLOs and SLIs to guide architectural decisions and error budget policies.
  • Conduct blameless post-incident reviews to identify root causes and implement preventive measures.
  • Build internal tools and automation to eliminate manual operational work.
  • Develop and maintain automated runbooks and playbooks for operational tasks and incident response.
  • Improve observability by turning high-cardinality data into proactive, actionable insights.
  • Proactively identify and mitigate operational risks through chaos engineering and architecture reviews.
  • Work with software engineers early in the SDLC to design for reliability, scalability, and maintainability.
  • Continuously optimize system performance, capacity, and cost efficiency.
  • Refine the on-call experience to reduce alert fatigue, improve MTTR, and keep rotations sustainable.

Requirements

  • Bachelor’s degree in Computer Engineering or a similar discipline.
  • 5+ years of experience as a Site Reliability Engineer or in a similar role.
  • 3+ years of experience with AWS services, including strong knowledge of container orchestration.
  • 2+ years of Kubernetes experience.
  • Deep understanding of observability principles and tools such as Prometheus, Datadog, and OpenTelemetry.
  • Experience leading incident management and complex postmortem analysis.
  • Experience with infrastructure as code, especially Terraform.
  • Experience with chaos engineering and other resilience-testing techniques.
  • Experience with CI/CD tools such as GitHub Actions for automated delivery.
  • Proficiency in at least one programming language such as Python, Go, or Java for automation and internal tooling.
  • Event-driven architecture experience with SNS, SQS, or similar technologies.
  • Ability to work independently and collaboratively in a fast-paced environment.
  • Strong communication skills and fluency in English.
  • Prior experience with Scrum or other agile methods (preferred).
  • Certification such as AWS Certified DevOps Engineer or Certified Kubernetes Administrator (CKA) (preferred).
  • Prior experience with Telco Core Networks, low-latency networking, telecommunications, eSIM, or GSMA-related technologies (preferred).
  • Experience with AI-driven SRE tools for anomaly detection and improvements (preferred).
  • Contributions to open-source SRE projects or communities (preferred).

Benefits

  • Fully remote work.
  • Generous PTO.
  • Wellness allowance.
  • Learning allowance.
  • Annual Airalo Away retreat.
  • Standby fees and overtime pay for on-call rotations.
  • Delayed on-call start for the first 6 months.
  • Guaranteed rest periods and flexible hours after night incidents.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Database Reliability Engineer

PointClickCare 1K-5K Health Care Providers & Services

PointClickCare is hiring a Senior Database Reliability Engineer to manage and improve the cloud database infrastructure behind its mission-critical SaaS platform.

Ansible AWS Azure C# Databricks GCP Git Grafana InfluxDB JIRA MySQL PostgreSQL PowerShell Python SQL SQL Server Terraform
21 minutes ago

Site Reliability Engineer

SwissBorg 51-250 Capital Markets

SwissBorg is hiring a Site Reliability Engineer to support and scale its cloud infrastructure and operations for a fast-growing crypto investment platform.

Ansible Argo CD AWS CI/CD DNS Git GitLab GitOps Grafana Kafka Kubernetes OpenSearch OpenTelemetry PostgreSQL Prometheus Terraform
36 minutes ago

Staff Platform Site Reliability Specialist (Observability & Kubernetes)

Everbridge 1K-5K Internet Software & Services

Everbridge is hiring a Staff Platform Site Reliability Specialist to own and evolve its enterprise observability platform and Kubernetes environment across a large-scale cloud-native infrastructure.

AWS GCP Grafana Kubernetes Terraform
36 minutes ago

LiveOps Engineer

Civica 1K-5K Internet Software & Services

Civica is seeking a LiveOps Engineer to help operate and improve its cloud and production environments that support critical public services for citizens worldwide.

Ansible AWS Azure Bash CI/CD Datadog DNS Docker Elasticsearch Git GitHub Actions Go Grafana Helm Jenkins Kubernetes Load Balancing PowerShell Prometheus Python Terraform
1 hour, 51 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers