Airalo

Airalo

Airalo is the world's first eSIM store offering travelers access to eSIMs in 200+ countries & regions at affordable prices. With Airalo, travelers can manage their eSIMs, top up on the go, and enjoy pain-free connectivity while traveling. Say goodbye t...

Airlines
51-250
Founded 2019
$67M raised

Description

  • Lead the design of scalable, fault-tolerant, self-healing systems in a multi-region AWS environment.
  • Define and track SLOs and SLIs to guide architectural decisions and error budget policies.
  • Conduct blameless post-incident reviews to identify root causes and implement preventive measures.
  • Build internal tools and automation to eliminate manual operational work.
  • Develop and maintain automated runbooks and playbooks for operational tasks and incident response.
  • Improve observability by turning high-cardinality data into proactive, actionable insights.
  • Proactively identify and mitigate operational risks through chaos engineering and architecture reviews.
  • Work with software engineers early in the SDLC to design for reliability, scalability, and maintainability.
  • Continuously optimize system performance, capacity, and cost efficiency.
  • Refine the on-call experience to reduce alert fatigue, improve MTTR, and keep rotations sustainable.

Requirements

  • Bachelor’s degree in Computer Engineering or a similar discipline.
  • 5+ years of experience as a Site Reliability Engineer or in a similar role.
  • 3+ years of experience with AWS services, including strong knowledge of container orchestration.
  • 2+ years of Kubernetes experience.
  • Deep understanding of observability principles and tools such as Prometheus, Datadog, and OpenTelemetry.
  • Experience leading incident management and complex postmortem analysis.
  • Experience with infrastructure as code, especially Terraform.
  • Experience with chaos engineering and other resilience-testing techniques.
  • Experience with CI/CD tools such as GitHub Actions for automated delivery.
  • Proficiency in at least one programming language such as Python, Go, or Java for automation and internal tooling.
  • Event-driven architecture experience with SNS, SQS, or similar technologies.
  • Ability to work independently and collaboratively in a fast-paced environment.
  • Strong communication skills and fluency in English.
  • Prior experience with Scrum or other agile methods (preferred).
  • Certification such as AWS Certified DevOps Engineer or Certified Kubernetes Administrator (CKA) (preferred).
  • Prior experience with Telco Core Networks, low-latency networking, telecommunications, eSIM, or GSMA-related technologies (preferred).
  • Experience with AI-driven SRE tools for anomaly detection and improvements (preferred).
  • Contributions to open-source SRE projects or communities (preferred).

Benefits

  • Fully remote work.
  • Generous PTO.
  • Wellness allowance.
  • Learning allowance.
  • Annual Airalo Away retreat.
  • Standby fees and overtime pay for on-call rotations.
  • Delayed on-call start for the first 6 months.
  • Guaranteed rest periods and flexible hours after night incidents.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Assoc, Protocol Engineer (Chainlink)

Galaxy 251-1K Capital Markets

Galaxy is hiring an experienced Protocol, DevOps, or SRE Engineer to help build and operate secure blockchain infrastructure supporting its digital assets platform and custody offerings.

AWS Azure Bash Blockchain C C++ Datadog Docker ELK Stack Encryption Ethereum GCP Go Grafana Java Kubernetes Linux Network Security Perl Prometheus Python Rust Solana Terraform
2 hours, 2 minutes ago

Senior Site Reliability Engineer

Parallel Domain 51-250 Aerospace & Defense

Parallel Domain is hiring a Senior Site Reliability Engineer to operate and evolve the infrastructure that powers large-scale simulation and validation for autonomous systems in a remote role across Canada and the U.S. Pacific Northwest.

Active Directory Argo CD AWS Bash DNS Docker GitHub Actions Grafana Helm Kubernetes Linux Load Balancing Packer Prometheus Python Terraform
5 hours, 51 minutes ago

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
7 hours, 58 minutes ago

Intermediate Site Reliability Engineer - OP02119

Dev.Pro 251-1K Internet Software & Services

Dev.Pro is hiring an IT Specialist for its SRE team to support company and client environments by maintaining infrastructure, monitoring services, and automating operations across cloud and on-premises systems.

Ansible Apache AWS Bash CI/CD DHCP DNS Docker ELK Stack GCP Git Grafana Jenkins Linux MySQL Nginx PostgreSQL Prometheus Puppet Python SQL SQL Server SSH TCP/IP TeamCity Terraform TLS Ubuntu Windows Server Zabbix
10 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers