Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)

2 weeks, 6 days ago
Full-time
Senior
Software Development
MongoDB

MongoDB

MongoDB provides a developer data platform that simplifies data management and accelerates application development, enabling businesses to leverage modern database technology for innovative solutions across various industries.

Internet Software & Services
1K-5K
Founded 2007

Description

  • Work on multi-tenant distributed storage systems while balancing long-term infrastructure goals with immediate engineering needs.
  • Build reliable, resilient, fault-tolerant, and self-healing services and infrastructure.
  • Define and configure metrics to detect incidents and measure service health, availability, and performance.
  • Participate in a 24/7 on-call rotation to resolve storage infrastructure issues.
  • Optimize infrastructure performance across the stack, from the application layer down to the kernel.
  • Partner with engineering teams to define SLOs and capacity plans for storage services.
  • Support the operational safety, durability, and consistency of the Atlas storage layer.

Requirements

  • 6+ years of experience in software development and operating distributed systems.
  • Proficiency in Python, Go, or a similar programming language.
  • Experience operating or supporting stateful storage or database systems at scale.
  • Comfort with durability, consistency, and recovery trade-offs in storage systems.
  • Customer-focused mindset.
  • Strong bias toward efficiency and automation over manual processes.
  • Experience using and extending Kubernetes or similar containerization technologies.
  • Experience with cloud infrastructure platforms such as AWS, Google Cloud Platform (GCP), or Azure.
  • Understanding of Linux internals and networking concepts including TCP/IP, DNS, TLS, and routing.
  • Preferred: Experience leading major architectural shifts from legacy storage stacks to multi-tenant storage architectures.
  • Preferred: Experience planning and executing large-scale data and workload migrations with tight availability and durability requirements.
  • Preferred: Experience managing and scaling infrastructure across multi-cloud environments.
  • Preferred: Experience designing secure, multi-tenant runtime environments at scale.

Benefits

  • Base salary range of $144,000 to $248,000 USD for U.S.-based candidates.
  • Equity and participation in the employee stock purchase program.
  • Flexible paid time off.
  • 20 weeks of fully paid gender-neutral parental leave.
  • Fertility and adoption assistance.
  • 401(k) plan.
  • Mental health counseling.
  • Access to transgender-inclusive health insurance coverage and other health benefits.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

SRE - DevOps Engineer - Argentina

Coderio 51-250 Internet Software & Services

Coderio is hiring a remote DevOps/SRE Engineer in Argentina to ensure the stability, scalability, and efficient operation of the infrastructure that supports its global digital solutions.

Argo CD CI/CD Flux GitHub Actions GitOps Helm Jenkins Kubernetes OpenShift Terraform
9 hours, 4 minutes ago

Senior Site Reliability Engineer

Cribl 251-1K IT Services

Cribl is hiring a Senior Site Reliability Engineer in Poland to help build and operate the telemetry infrastructure and observability platform that supports its cloud products and enterprise customers.

Ansible AWS Azure CI/CD Grafana JavaScript Kibana Linux New Relic Node.js PagerDuty Prometheus Splunk Terraform TypeScript
11 hours, 37 minutes ago

Site Reliability Engineer

Kaseya 1K-5K IT Services

Kaseya is hiring a Site Reliability Engineer to own the reliability, automation, and production stability of its AWS-based services used by thousands of MSPs worldwide.

Ansible AWS Chef CloudFormation Datadog DevSecOps Elasticsearch Kibana Kubernetes MySQL PostgreSQL Puppet Secrets Management Serverless Terraform
17 hours, 56 minutes ago

Senior Site Reliability Engineer (Remote - Brazil)

Loadsmart 251-1K Air Freight & Logistics

Loadsmart is hiring a Senior Site Reliability Engineer in Brazil to build and maintain its internal platform and ensure the reliability, safety, and operational excellence of critical engineering systems.

Ansible AWS Bash Chef CI/CD Docker Kubernetes PostgreSQL Python Terraform
18 hours, 18 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers