Dropbox

Dropbox

Dropbox is a technology company that builds simple, powerful products for individuals and businesses. With over 700 million registered users worldwide, Dropbox offers file sync, sharing, online backup, cloud storage, collaboration tools, and more to st...

Internet Software & Services
1K-5K
Founded 2007

Description

  • Ensure the reliability, scalability, and performance of Dropbox's infrastructure and services.
  • Collaborate with cross-functional teams to define and maintain best practices for monitoring, logging, and incident response.
  • Build, implement, and maintain automation and infrastructure-as-code tooling, including Terraform, Ansible, GitHub Actions, and custom code platforms.
  • Use container orchestration platforms such as Kubernetes, Amazon ECS, and Red Hat OpenShift to manage containers at scale.
  • Manage and optimize monitoring and logging pipelines using tools such as Datadog and Cribl LogStream.
  • Drive service health and visibility improvement projects for stakeholders across technical and business groups.
  • Develop and maintain custom tooling and automation scripts in Bash, Python, and other scripting languages.
  • Handle incidents and occasional on-call duties to address bugs, outages, and other operational issues.
  • Contribute to the evolution of infrastructure while ensuring security and scalability.

Requirements

  • 5+ years of experience in site reliability engineering or a similar engineering role with hands-on coding experience.
  • Strong knowledge of AWS services, including EC2, S3, RDS, R53, Lambda, and others.
  • Strong knowledge of Linux administration, internals, filesystems, volume management, Ubuntu, RHEL, DNS, and DHCP.
  • Experience with monitoring and logging tools such as Datadog and pipeline tools such as Vector or Cribl LogStream.
  • Experience driving transformational programs related to metrics and observability.
  • Experience with scripting in a higher-level language, with Python preferred.
  • Experience developing automation for infrastructure-related tasks using Chef, Ansible, or Terraform.
  • Experience with log analysis and building metrics, alerts, and visuals from log data.
  • Strong proficiency in infrastructure-as-code tools, such as Terraform.
  • Strong proficiency in configuration management tools, specifically Ansible Automation Platform and Chef.
  • Experience with containerization technologies such as Docker and orchestration platforms like Kubernetes or Amazon ECS.
  • Knowledge of LDAP, REST APIs, and current authentication systems.
  • Familiarity with GitHub and Git-based workflows.
  • Understanding of RDS databases and network security technologies such as WAF.
  • Strong problem-solving skills and the ability to work well in a fast-paced, collaborative environment.
  • Excellent written and verbal communication skills.
  • Experience managing large-scale multi-cloud or hybrid infrastructure.
  • Strong background in Infrastructure as Code and GitOps workflows.
  • Familiarity with Kubernetes, Docker, and serverless platforms.
  • Proven track record improving observability, reliability, and incident response.
  • Understanding of compliance and security frameworks such as SOC2, ISO 27001, and FedRAMP.
  • Experience implementing Zero Trust security and access models.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Staff Operations Engineer

Mozilla 251-1K Internet Software & Services

Mozilla is hiring a Staff Operations Engineer to lead the design, reliability, and evolution of hybrid-cloud and workplace infrastructure across teams.

Ansible DNS Linux Puppet Python TCP/IP Unix
18 hours, 45 minutes ago

Principal Site Reliability Engineer (SRE)

Symmetrio Professional Services

Symmetrio is recruiting a Principal Site Reliability Engineer for a rapidly growing healthcare technology company to own the reliability, scalability, security, and performance of a mission-critical SaaS platform used by healthcare providers across the United States.

Active Directory AWS CI/CD Datadog Django Grafana Kubernetes Python Terraform Windows Server
19 hours ago

Performance Test Engineer Lead

PartnerOne 51-250 Media

An enterprise performance engineering role at a cloud-focused organization, responsible for validating the scalability, stability, and production readiness of distributed systems across Azure and hybrid environments.

Azure CI/CD Kubernetes PowerShell
19 hours, 15 minutes ago

Site Reliability Engineer

MLabs 11-50 Internet Software & Services

Remote UK-hours Site Reliability Engineering role at a financial technology company, focused on automating and operating the infrastructure that supports global integration services for financial institutions.

Active Directory Ansible AWS CI/CD GCP OAuth PostgreSQL SAML
19 hours, 30 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers