RapidSOS

RapidSOS

RapidSOS is an advanced emergency technology provider that connects life-saving data from various devices, apps, and sensors to emergency responders, enhancing response times and improving outcomes in critical situations.

Diversified Telecommunication Services
51-250
Founded 2013
$281M raised

Description

  • Own the reliability, scalability, and operational health of Kubernetes clusters, shared services, and core AWS infrastructure.
  • Drive infrastructure-as-code standards using Terraform and Atlantis.
  • Partner with engineering managers to define SLOs, error budgets, and service ownership practices.
  • Lead proactive reliability work, including capacity planning, failure mode analysis, runbook quality, chaos engineering, and reliability reviews.
  • Drive blameless postmortems and ensure incidents result in systemic improvements with clear ownership.
  • Run the Tier 1 on-call rotation and coordinate with the third-party NOC.
  • Lead incident command for Sev-1 incidents and keep engineering leadership informed.
  • Mentor and grow the SRE Operations team, including headcount planning as the function expands.
  • Shape the long-term AI strategy for infrastructure and operations through automation and workflow improvements.
  • Own AWS cost management, reserved instance strategy, and reporting on reliability metrics to leadership.
  • Collaborate with Platform SRE on major infrastructure initiatives such as Gateway API adoption, cross-region architecture, and security changes.

Requirements

  • 7+ years of experience in SRE, platform engineering, or DevOps.
  • At least 2 years of experience managing a team.
  • Direct ownership of Kubernetes and AWS infrastructure in production environments where uptime and resilience are critical.
  • Experience shifting teams from reactive operations to engineering-first reliability practices.
  • Experience partnering with engineering teams to improve reliability, scalability, and operational readiness before production issues occur.
  • Ability to write Python and review production-quality scripts and tooling.
  • Hands-on experience with SLOs, error budgets, and blameless postmortems.
  • Hands-on familiarity with Terraform/Atlantis, Kubernetes/Helm/ArgoCD, Datadog, Concourse CI/GitHub Actions, RabbitMQ, and AWS services including EKS, RDS/Aurora, ElastiCache, VPC networking, IAM, KMS, and Route53.
  • Experience with AI-driven automation or operations tooling is preferred.
  • Experience in mission-critical or high-availability environments is preferred.

Benefits

  • Competitive salary of $185,000 to $215,000.
  • Equity options.
  • Competitive salary and benefits package.
  • Flexible remote work environment (#LI-Remote).
  • Dynamic, fun startup environment with a highly talented team.
  • Opportunity to work on a mission-driven product with global impact.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Data Center Reliability Engineer

Phaidra 51-250 Internet Software & Services

Phaidra is hiring a Data Center Reliability Engineer to translate data center telemetry into operational intelligence for its AI-powered monitoring and control systems.

GitLab LLM Machine Learning NumPy Pandas Python Reinforcement Learning
12 hours, 50 minutes ago

Senior Site Reliability Engineer

Accenture 100K+ Professional Services

Accenture Federal Services is hiring a Site Reliability Engineer to improve the reliability, performance, and scalability of a client system supporting US federal mission operations.

13 hours, 20 minutes ago

Senior Site Reliability Engineer (Remote Build)

Remote 251-1K Professional Services

Remote is hiring a Senior Site Reliability Engineer for Remote Build to own the reliability, security, and operational strategy behind its global employment infrastructure platform.

AWS Bash CI/CD Datadog Elixir GitHub Actions GitLab Go Grafana Java Jenkins Kubernetes Linux Microservices Node.js Prometheus Python Terraform
1 day, 12 hours ago

Senior Site Reliability Engineer (Remote Build)

Remote 251-1K Professional Services

Remote is hiring a Senior Site Reliability Engineer to own the reliability, security, and operational strategy for Remote Build’s global infrastructure platform supporting AI-driven HR and Finance integrations.

AWS Bash CI/CD Datadog Elixir GitHub Actions GitLab Go Grafana Java Jenkins Kubernetes Linux Microservices Node.js Prometheus Python Terraform
1 day, 13 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers