Firstup

Firstup

Firstup offers an intelligent communication platform designed to engage employees throughout their entire employment journey, providing insights that help organizations support, promote, and retain their workforce effectively.

Professional Services
251-1K
Founded 2008
$47M raised

Description

  • Own the availability, performance, and resilience of the multi-region AWS platform.
  • Drive reliability improvements using SLIs/SLOs, error budgets, and proactive engineering practices.
  • Lead efforts to reduce MTTR and improve incident response across the organization.
  • Guide architecture decisions for microservices, Kubernetes (EKS), and serverless workloads.
  • Advance the observability strategy using Datadog to provide actionable insights across infrastructure and applications.
  • Establish and refine incident management practices, including on-call processes, escalation paths, and post-incident reviews.
  • Act as incident commander for critical events and participate in the on-call rotation.
  • Improve operational standards through automation, standardization, and modern best practices.
  • Drive cost optimization across AWS environments without sacrificing performance or reliability.
  • Lead, mentor, and support a distributed CloudOps team across the US and UK.
  • Oversee operations of a legacy .NET-based solution in private data centers in the US and Europe.

Requirements

  • 10+ years of experience in cloud infrastructure, SRE, or DevOps roles.
  • Recent experience leading CloudOps or SRE teams.
  • Proven experience leading operational or platform transformations in a SaaS environment.
  • Experience operating multi-region, customer-facing systems at scale.
  • Strong hands-on experience with AWS multi-region architectures.
  • Hands-on experience with Kubernetes (EKS) and containerized environments.
  • Infrastructure as Code experience, with Terraform preferred.
  • Experience with CI/CD pipelines such as CircleCI or similar tools.
  • Experience with observability platforms such as Datadog or equivalent tools.
  • Solid understanding of microservices and distributed systems design.
  • Familiarity with serverless architectures and modern cloud-native patterns.
  • Deep experience with incident management, on-call operations, and reliability engineering practices.
  • Strong understanding of SLO/SLI frameworks, monitoring strategies, and performance optimization.
  • Demonstrated ability to balance hands-on technical work with team leadership.
  • Collaborative and pragmatic leadership style with ability to influence across teams.
  • Passion for building and supporting high-performing teams.
  • Bias toward continuous improvement and measurable outcomes.

Benefits

  • Base salary range of $200,000 to $228,000.
  • Excellent PTO program.
  • Great health benefits.
  • Remote work arrangement.
  • Casual and friendly work environment.
  • Leadership team committed to personal and professional growth.
  • Inclusive, high-growth environment where ideas are rewarded.
  • Opportunity to make a direct impact on reliability, scalability, and customer experience.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Associate SRE

66degrees 251-1K IT Services

66degrees is hiring a Site Reliability Engineer to support enterprise Google Cloud environments through reliability engineering, automation, and incident response for client workloads.

Agile Datadog GCP Kanban Kubernetes Linux Prometheus Python Scrum Terraform
2 hours, 43 minutes ago

Senior Site Reliability Engineer - AWS

Filevine 251-1K Specialized Consumer Services

Filevine is hiring a Senior Site Reliability Engineer to embed with cross-functional teams and improve the reliability, automation, and scalability of its AWS-based legal technology platform.

AWS Bash CI/CD EC2 Kubernetes PowerShell Python
12 hours, 39 minutes ago

Staff Site Reliability Engineer

Puck 1-10 Internet Software & Services

Domino is hiring a senior Site Reliability Engineer to build AI-assisted reliability systems and strengthen the operational resilience of its cloud-based data science platform.

Go Kubernetes Linux LLM Python
13 hours, 41 minutes ago

Senior Site Reliability Engineer

GoReel 51-200 Software Development

Senior Site Reliability Engineer needed to support the reliability, scalability, performance, and stability of systems and applications for an international iGaming company.

Argo CD AWS CI/CD Confluence Debian Docker EC2 Elasticsearch GitHub GitHub Actions GitLab Grafana Helm Jenkins JIRA Kibana Kubernetes OpsGenie Prometheus
15 hours, 36 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers