Remote

Remote

Global HR Solutions & Employment Tools for Distributed Teams | Remote Hire international talent in minutes. Remote is the most disruptive global payroll, tax, HR and compliance solution for distributed teams. The easier way to employ internationally 🌍....

Professional Services
251-1K
Founded 2019
$496M raised

Description

  • Own the technical direction, architecture, tooling, and long-term roadmap for Remote's SRE/platform domain.
  • Define and drive the platform reliability strategy, including SLOs, SLIs, error budgets, observability, and incident management maturity.
  • Lead complex, cross-team infrastructure initiatives from discovery through delivery while aligning work to business goals.
  • Identify and lead AI enablement initiatives that reduce operational toil and improve engineering workflows, incident response, and platform capabilities.
  • Drive AI-powered automation for platform operations, including intelligent alerting, automated incident triage, self-healing infrastructure, and AI-assisted runbooks.
  • Contribute to capacity planning and cost-efficiency across Remote's infrastructure.
  • Mentor senior engineers through code reviews, design feedback, and hands-on guidance.
  • Collaborate with the Security team on platform hardening, threat mitigation, and compliance.
  • Raise engineering quality standards, manage technical debt deliberately, and champion best practices across the SRE team.
  • Contribute to hiring, onboarding, and continuous improvement of how the SRE team operates.

Requirements

  • 8+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering.
  • Deep expertise in Kubernetes, including operating, designing, and scaling production clusters.
  • Experience designing and managing cloud infrastructure on AWS or other cloud providers at scale.
  • Strong infrastructure-as-code experience with Terraform.
  • Experience defining and operating reliability frameworks such as SLOs, SLIs, error budgets, and alerting strategies.
  • Solid observability experience with Datadog, Grafana/Prometheus, or similar tools.
  • Proficiency with CI/CD platforms such as GitLab CI or GitHub Actions, plus deployment automation.
  • Comfort with Bash and scripting for automation; broader programming skills are a plus.
  • Experience with Docker and the broader container ecosystem.
  • Experience applying AI tools to infrastructure, operations, or developer tooling.
  • Proven track record of driving platform-wide technical initiatives and influencing engineering direction without formal authority.
  • Strong communication skills with the ability to tailor messaging to technical and non-technical audiences.
  • Self-directed with the ability to identify priorities and execute with minimal supervision.
  • Experience mentoring senior engineers and helping others lead and grow.
  • Comfort navigating ambiguity and translating vague requirements into concrete solutions.
  • Ability to evaluate technical decisions through a business lens, considering cost and value.
  • Excellent communication and interpersonal skills (nice to have).
  • Holistic debugging skills (nice to have).
  • Security knowledge and capabilities from both defensive and offensive perspectives (nice to have).

Benefits

  • Base salary range of $188,550 to $212,150.
  • Fair, unbiased compensation with pay equity and compensation reviewed for internal moves.
  • Stock options.
  • Work from anywhere / fully remote role.
  • Flexible paid time off.
  • Flexible working hours in an async environment.
  • 16 weeks of paid parental leave.
  • Mental health support services.
  • Learning budget.
  • Home office budget and IT equipment.
  • Budget for local in-person social events or co-working spaces.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Software Engineer II - Inline Mailflow

Abnormal AI Internet Software & Services

Abnormal AI is hiring a Software Engineer for the Inline Mailflow team to build next-generation SMTP relay infrastructure for outbound email security and long-term secure email gateway displacement.

Apache Spark AWS Django DNS Docker Go Kubernetes Prometheus Python
12 hours, 44 minutes ago

Site Reliability Engineer

Capital Markets Gateway 51-250 Capital Markets

Capital Markets Gateway LLC is hiring a remote Site Reliability Engineer in Canada to strengthen reliability, observability, and incident response for its ECM fintech platform supporting global capital markets workflows.

Azure Bash Datadog Docker Elasticsearch GitHub Grafana GraphQL JIRA Kubernetes Linux Microservices .NET OpenTelemetry PostgreSQL Prometheus Python React Redis Terraform TypeScript
20 hours, 17 minutes ago

Staff Software Engineer - Reliability

Rubrik 1K-5K IT Services

Rubrik is hiring a Staff Site Reliability Engineer to lead reliability, automation, and cloud infrastructure architecture for its global SaaS and government-compliant environments, while also guiding the Application-SRE team and bridging customer issues back into engineering priorities.

AWS GCP Go Grafana Java Kubernetes MySQL OpenTelemetry Prometheus Pulumi Python Terraform
20 hours, 47 minutes ago

Sr. Database Reliability Engineer

SpaceX 10K-50K Aerospace & Defense

SpaceX is seeking a Senior Database Reliability Engineer to own and improve the reliability, performance, and operational support of the company’s Oracle and PostgreSQL database environment within its IT Engineering organization.

Bash Git Linux Machine Learning MySQL Oracle PostgreSQL Python SQL Windows Server
20 hours, 47 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers