Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)

1 hour, 52 minutes ago
Full-time
Senior
Software Development
MongoDB

MongoDB

MongoDB provides a developer data platform that simplifies data management and accelerates application development, enabling businesses to leverage modern database technology for innovative solutions across various industries.

Internet Software & Services
1K-5K
Founded 2007

Description

  • Work on multi-tenant distributed storage systems while balancing long-term infrastructure goals with immediate engineering needs.
  • Build reliable, resilient, fault-tolerant, and self-healing services and infrastructure.
  • Define and configure metrics to detect incidents and measure service health, availability, and performance.
  • Participate in a 24/7 on-call rotation to resolve storage infrastructure issues.
  • Optimize infrastructure performance across the stack, from the application layer down to the kernel.
  • Partner with engineering teams to define SLOs and capacity plans for storage services.
  • Support the operational safety, durability, and consistency of the Atlas storage layer.

Requirements

  • 6+ years of experience in software development and operating distributed systems.
  • Proficiency in Python, Go, or a similar programming language.
  • Experience operating or supporting stateful storage or database systems at scale.
  • Comfort with durability, consistency, and recovery trade-offs in storage systems.
  • Customer-focused mindset.
  • Strong bias toward efficiency and automation over manual processes.
  • Experience using and extending Kubernetes or similar containerization technologies.
  • Experience with cloud infrastructure platforms such as AWS, Google Cloud Platform (GCP), or Azure.
  • Understanding of Linux internals and networking concepts including TCP/IP, DNS, TLS, and routing.
  • Preferred: Experience leading major architectural shifts from legacy storage stacks to multi-tenant storage architectures.
  • Preferred: Experience planning and executing large-scale data and workload migrations with tight availability and durability requirements.
  • Preferred: Experience managing and scaling infrastructure across multi-cloud environments.
  • Preferred: Experience designing secure, multi-tenant runtime environments at scale.

Benefits

  • Base salary range of $144,000 to $248,000 USD for U.S.-based candidates.
  • Equity and participation in the employee stock purchase program.
  • Flexible paid time off.
  • 20 weeks of fully paid gender-neutral parental leave.
  • Fertility and adoption assistance.
  • 401(k) plan.
  • Mental health counseling.
  • Access to transgender-inclusive health insurance coverage and other health benefits.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking an Engineering Manager to lead its Resilience Engineering team, building production load testing and chaos engineering capabilities that improve the safety and reliability of production systems.

AWS Java Kotlin Kubernetes Microservices Python
1 hour, 3 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring an Engineering Manager to lead its Resilience Engineering team in building production load testing and chaos engineering capabilities that improve the safety and reliability of its production systems.

AWS Java Kotlin Kubernetes Python
4 hours, 8 minutes ago

Senior Site Reliability Engineer

Civica 1K-5K Internet Software & Services

Civica is hiring a Senior Site Reliability Engineer to own the reliability, performance, security, and automation of the cloud platform supporting its public-sector SaaS products.

Ansible AWS Azure CI/CD CloudFormation Datadog ELK Stack GCP GitHub Actions Go Grafana Jaeger Java Kubernetes .NET OpenSearch OpenShift Packer Prometheus Python Terraform
15 hours, 53 minutes ago

Site Reliability Engineer

Sitetracker 251-1K Diversified Telecommunication Services

Site Reliability Engineer at a Canada-based technology company, responsible for building and scaling a proactive reliability practice for AI-driven platform workloads in a remote environment.

AWS Bash CloudFormation EC2 GitHub Actions Load Balancing Terraform
15 hours, 53 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers