RapidSOS

RapidSOS

RapidSOS is an advanced emergency technology provider that connects life-saving data from various devices, apps, and sensors to emergency responders, enhancing response times and improving outcomes in critical situations.

Diversified Telecommunication Services
51-250
Founded 2013
$281M raised

Description

  • Own performance and reliability outcomes for services operating at scale, including the application-level choices that affect system behavior.
  • Design and implement resilience improvements such as safer deployment patterns, failover strategies, and redundancy.
  • Instrument services with structured logging, metrics, and alerting to improve observability and debugging.
  • Take production incidents from first signal through root cause analysis and resolution, including fixes that strengthen long-term stability.
  • Work across infrastructure-as-code, container orchestration, CI/CD pipelines, and service-level application code to resolve issues end to end.
  • Collaborate with engineering teams to improve reliability and performance across the systems they own.
  • Investigate issues across infrastructure and application layers to identify and fix problems at the source.
  • Help shape the organization’s reliability practices by improving visibility, resilience, and operational readiness.

Requirements

  • 5+ years of professional engineering experience with deep expertise in Python.
  • Experience with AWS infrastructure, including networking, managed databases, IAM, DNS-based routing, failover, and traffic-routing cost implications.
  • Hands-on Kubernetes experience with containerized workloads in production across EKS, ECS, or Fargate.
  • Strong understanding of distributed systems failure modes, including resource exhaustion, replication lag, and queue backpressure.
  • Experience operating high-throughput messaging systems such as RabbitMQ, Kafka, AWS SNS, or SQS.
  • Experience with infrastructure-as-code tools such as Terraform and CI/CD pipelines, with a focus on reliability and scalability.
  • Experience building or improving observability through logging, metrics, and alerting.
  • Demonstrable experience using AI to safely and securely improve velocity, reliability, and recoverability of services.
  • Strong communication and interpersonal skills, with the ability to collaborate effectively as a team player.
  • Strong proficiency in coding best practices and the ability to write clean, maintainable, and testable code.
  • Demonstrated problem-solving ability across both infrastructure and application layers.
  • Ability and willingness to collaborate in person a few times per quarter, or as needed.
  • Preferred: experience supporting production systems in an on-call or similar reliability-focused capacity.
  • Preferred: experience with Datadog, Elasticsearch/OpenSearch, ArgoCD-based GitOps deployments, and modernizing legacy CI/CD tools such as Concourse or Jenkins.

Benefits

  • Salary range of $160,000 to $195,000, depending on experience, skills, education, location, and business needs.
  • Equity options / equity participation.
  • Competitive salary and benefits package.
  • Flexible, dynamic, and fun startup work environment.
  • Opportunity to work with a passionate, highly talented team on a mission-driven problem.
  • Remote-friendly role (#LI-Remote).
  • Equal opportunity workplace with inclusive hiring practices.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
5 hours, 18 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking an Engineering Manager to lead its Resilience Engineering team, building production load testing and chaos engineering capabilities that improve the safety and reliability of production systems.

AWS Java Kotlin Kubernetes Microservices Python
5 hours, 27 minutes ago

Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)

MongoDB 1K-5K Internet Software & Services

MongoDB’s Storage Layer Services team is hiring a Site Reliability Engineer to help re-architect the cloud storage layer for Atlas and ensure the reliability and operational safety of its distributed storage infrastructure.

AWS Azure DNS GCP Go Kubernetes Linux Python TCP/IP TLS
6 hours, 16 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring an Engineering Manager to lead its Resilience Engineering team in building production load testing and chaos engineering capabilities that improve the safety and reliability of its production systems.

AWS Java Kotlin Kubernetes Python
8 hours, 31 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers