Airalo

Airalo

Airalo is the world's first eSIM store offering travelers access to eSIMs in 200+ countries & regions at affordable prices. With Airalo, travelers can manage their eSIMs, top up on the go, and enjoy pain-free connectivity while traveling. Say goodbye t...

Airlines
51-250
Founded 2019
$67M raised

Description

  • Lead the design of scalable, fault-tolerant, and self-healing systems in a multi-region AWS environment.
  • Define and track service level objectives (SLOs) and service level indicators (SLIs) to inform reliability and error budget decisions.
  • Conduct blameless post-incident reviews, identify systemic root causes, and implement long-term preventive measures.
  • Develop internal tools and automation to eliminate recurring manual operational work.
  • Create and maintain automated runbooks and playbooks for operational tasks and incident response.
  • Improve observability by turning monitoring into actionable insights using high-cardinality data.
  • Proactively identify and mitigate operational risk through chaos engineering and architecture reviews.
  • Partner with software engineers early in the SDLC to design for reliability, scalability, and maintainability.
  • Continuously evaluate and optimize system performance, capacity, and cost efficiency.
  • Improve the on-call experience by reducing alert fatigue, lowering MTTR, and supporting sustainable rotation health.

Requirements

  • Bachelor’s degree in Computer Engineering or a similar discipline.
  • 5+ years of experience as a Site Reliability Engineer or in a similar role.
  • 3+ years of experience with AWS services, including strong knowledge of container orchestration.
  • 2+ years of Kubernetes experience.
  • Deep understanding of observability principles and tools such as Prometheus, Datadog, and OpenTelemetry.
  • Experience leading incident management and complex postmortem analysis.
  • Experience with infrastructure as code, preferably Terraform.
  • Experience with chaos engineering and other resilience-testing techniques.
  • Experience with CI/CD tools such as GitHub Actions for automated delivery.
  • Proficiency in at least one programming language such as Python, Go, or Java for automation and internal tooling.
  • Experience with event-driven architecture such as SNS and SQS.
  • Ability to work independently and collaboratively in a fast-paced environment.
  • Strong communication skills and fluency in English.
  • Experience with Scrum or other agile methods (preferred).
  • Certification such as AWS Certified DevOps Engineer or Certified Kubernetes Administrator (CKA) (preferred).
  • Experience with Telco Core Networks, low-latency networking, telecommunications, eSIM, or GSMA technologies (preferred).
  • Experience with AI-driven SRE tools for anomaly detection and improvements (preferred).
  • Contributions to open-source SRE projects or communities (preferred).

Benefits

  • Remote-first work environment with the option to work from anywhere.
  • Health insurance.
  • Work-from-anywhere stipend.
  • Annual wellness and learning credits.
  • Annual all-expenses-paid company retreat in a destination location.
  • Paid on-call rotation with standby fees and overtime pay.
  • No on-call duties during the first 6 months.
  • Guaranteed rest periods and flexible hours after night incidents.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Assoc, Protocol Engineer (Chainlink)

Galaxy 251-1K Capital Markets

Galaxy is hiring an experienced Protocol, DevOps, or SRE Engineer to help build and operate secure blockchain infrastructure supporting its digital assets platform and custody offerings.

AWS Azure Bash Blockchain C C++ Datadog Docker ELK Stack Encryption Ethereum GCP Go Grafana Java Kubernetes Linux Network Security Perl Prometheus Python Rust Solana Terraform
1 hour, 59 minutes ago

Senior Site Reliability Engineer

Parallel Domain 51-250 Aerospace & Defense

Parallel Domain is hiring a Senior Site Reliability Engineer to operate and evolve the infrastructure that powers large-scale simulation and validation for autonomous systems in a remote role across Canada and the U.S. Pacific Northwest.

Active Directory Argo CD AWS Bash DNS Docker GitHub Actions Grafana Helm Kubernetes Linux Load Balancing Packer Prometheus Python Terraform
5 hours, 48 minutes ago

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
7 hours, 55 minutes ago

Intermediate Site Reliability Engineer - OP02119

Dev.Pro 251-1K Internet Software & Services

Dev.Pro is hiring an IT Specialist for its SRE team to support company and client environments by maintaining infrastructure, monitoring services, and automating operations across cloud and on-premises systems.

Ansible Apache AWS Bash CI/CD DHCP DNS Docker ELK Stack GCP Git Grafana Jenkins Linux MySQL Nginx PostgreSQL Prometheus Puppet Python SQL SQL Server SSH TCP/IP TeamCity Terraform TLS Ubuntu Windows Server Zabbix
9 hours, 57 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers