RapidSOS

RapidSOS

RapidSOS is an advanced emergency technology provider that connects life-saving data from various devices, apps, and sensors to emergency responders, enhancing response times and improving outcomes in critical situations.

Diversified Telecommunication Services
51-250
Founded 2013
$281M raised

Description

  • Own the reliability, scalability, and operational health of Kubernetes clusters, shared services, and core AWS infrastructure.
  • Drive infrastructure-as-code standards using Terraform and Atlantis.
  • Partner with engineering managers to define SLOs, error budgets, and service ownership practices.
  • Lead proactive reliability work, including capacity planning, failure mode analysis, runbook quality, chaos engineering, and reliability reviews.
  • Drive blameless postmortems and ensure incidents result in systemic improvements with clear ownership.
  • Run the Tier 1 on-call rotation and coordinate with the third-party NOC.
  • Lead incident command for Sev-1 incidents and keep engineering leadership informed.
  • Mentor and grow the SRE Operations team, including headcount planning as the function expands.
  • Shape the long-term AI strategy for infrastructure and operations through automation and workflow improvements.
  • Own AWS cost management, reserved instance strategy, and reporting on reliability metrics to leadership.
  • Collaborate with Platform SRE on major infrastructure initiatives such as Gateway API adoption, cross-region architecture, and security changes.

Requirements

  • 7+ years of experience in SRE, platform engineering, or DevOps.
  • At least 2 years of experience managing a team.
  • Direct ownership of Kubernetes and AWS infrastructure in production environments where uptime and resilience are critical.
  • Experience shifting teams from reactive operations to engineering-first reliability practices.
  • Experience partnering with engineering teams to improve reliability, scalability, and operational readiness before production issues occur.
  • Ability to write Python and review production-quality scripts and tooling.
  • Hands-on experience with SLOs, error budgets, and blameless postmortems.
  • Hands-on familiarity with Terraform/Atlantis, Kubernetes/Helm/ArgoCD, Datadog, Concourse CI/GitHub Actions, RabbitMQ, and AWS services including EKS, RDS/Aurora, ElastiCache, VPC networking, IAM, KMS, and Route53.
  • Experience with AI-driven automation or operations tooling is preferred.
  • Experience in mission-critical or high-availability environments is preferred.

Benefits

  • Competitive salary of $185,000 to $215,000.
  • Equity options.
  • Competitive salary and benefits package.
  • Flexible remote work environment (#LI-Remote).
  • Dynamic, fun startup environment with a highly talented team.
  • Opportunity to work on a mission-driven product with global impact.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

Recorded Future 251-1K Professional Services

Recorded Future is hiring a Site Reliability Engineer to strengthen the reliability, scalability, and performance of its critical cloud systems in close partnership with engineering teams.

AWS Chef Elasticsearch ELK Stack Grafana Kafka Kibana Kubernetes Linux Logstash Microservices MongoDB OpenTelemetry Prometheus RabbitMQ Terraform
2 hours, 5 minutes ago

Senior Site Reliability Engineer (Remote - Brazil)

Loadsmart 251-1K Air Freight & Logistics

Loadsmart is hiring a Senior Site Reliability Engineer in Brazil to build and maintain its internal platform and ensure the reliability, safety, and operational excellence of critical engineering systems.

Ansible AWS Bash Chef CI/CD Docker Kubernetes PostgreSQL Python Terraform
2 hours, 5 minutes ago

Site Reliability Engineer

Alpaca 51-250 Capital Markets

Alpaca is hiring a Site Reliability Engineer to keep its brokerage platform reliable and operable across cloud, Kubernetes, observability, messaging, and database systems, with a strong focus on PostgreSQL reliability on the trading-critical path.

DNS GitOps Go Kafka Kubernetes Linux Load Balancing PostgreSQL Python RabbitMQ Secrets Management TLS
5 hours, 25 minutes ago

Site Reliability Engineer

Kaseya 1K-5K IT Services

Kaseya is hiring a Site Reliability Engineer to own the reliability, automation, and production stability of its AWS-based services used by thousands of MSPs worldwide.

Ansible AWS Chef CloudFormation Datadog DevSecOps Elasticsearch Kibana Kubernetes MySQL PostgreSQL Puppet Secrets Management Serverless Terraform
9 hours, 24 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers