Firstup

Firstup

Firstup offers an intelligent communication platform designed to engage employees throughout their entire employment journey, providing insights that help organizations support, promote, and retain their workforce effectively.

Professional Services
251-1K
Founded 2008
$47M raised

Description

  • Own the availability, performance, and resilience of the multi-region AWS platform.
  • Drive reliability improvements using SLIs/SLOs, error budgets, and proactive engineering practices.
  • Lead efforts to reduce MTTR and improve incident response across the organization.
  • Guide architecture decisions for microservices, Kubernetes (EKS), and serverless workloads.
  • Advance the observability strategy using Datadog to provide actionable insights across infrastructure and applications.
  • Establish and refine incident management practices, including on-call processes, escalation paths, and post-incident reviews.
  • Act as incident commander for critical events and participate in the on-call rotation.
  • Improve operational standards through automation, standardization, and modern best practices.
  • Drive cost optimization across AWS environments without sacrificing performance or reliability.
  • Lead, mentor, and support a distributed CloudOps team across the US and UK.
  • Oversee operations of a legacy .NET-based solution in private data centers in the US and Europe.

Requirements

  • 10+ years of experience in cloud infrastructure, SRE, or DevOps roles.
  • Recent experience leading CloudOps or SRE teams.
  • Proven experience leading operational or platform transformations in a SaaS environment.
  • Experience operating multi-region, customer-facing systems at scale.
  • Strong hands-on experience with AWS multi-region architectures.
  • Hands-on experience with Kubernetes (EKS) and containerized environments.
  • Infrastructure as Code experience, with Terraform preferred.
  • Experience with CI/CD pipelines such as CircleCI or similar tools.
  • Experience with observability platforms such as Datadog or equivalent tools.
  • Solid understanding of microservices and distributed systems design.
  • Familiarity with serverless architectures and modern cloud-native patterns.
  • Deep experience with incident management, on-call operations, and reliability engineering practices.
  • Strong understanding of SLO/SLI frameworks, monitoring strategies, and performance optimization.
  • Demonstrated ability to balance hands-on technical work with team leadership.
  • Collaborative and pragmatic leadership style with ability to influence across teams.
  • Passion for building and supporting high-performing teams.
  • Bias toward continuous improvement and measurable outcomes.

Benefits

  • Base salary range of $200,000 to $228,000.
  • Excellent PTO program.
  • Great health benefits.
  • Remote work arrangement.
  • Casual and friendly work environment.
  • Leadership team committed to personal and professional growth.
  • Inclusive, high-growth environment where ideas are rewarded.
  • Opportunity to make a direct impact on reliability, scalability, and customer experience.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Manager, Engineering

Sumo Logic 251-1K Internet Software & Services

Sumo Logic is hiring a Senior Manager, Engineering for Application Security to lead global programs that improve product security, reliability, and operational efficiency across its cloud platform.

Agile AWS C++ Docker GCP Java Kafka Kubernetes OWASP Ruby Scala SIEM
12 hours, 56 minutes ago

Staff Software Engineer - Databases SRE | Sweden | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring a Staff Software Engineer, SRE to improve the reliability and scalability of Grafana Cloud’s database products for high-value customers across AWS, GCP, and Azure.

AWS Azure GCP Go Helm Java Kubernetes Linux Microservices Python Terraform
1 day, 12 hours ago

Senior Site Reliability Engineer (SRE)

Oowlish 51-250 Internet Software & Services

Oowlish is hiring a Senior Site Reliability Engineer to own the reliability and operational excellence of business-critical production systems for international clients in a remote, collaborative environment.

AWS Datadog Go Heroku Kubernetes PostgreSQL Python SQL Server TypeScript
1 day, 12 hours ago

Staff Software Engineer - Databases SRE | Spain | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring a Staff Software Engineer - SRE to strengthen the reliability of its cloud database products for high-SLA customers across AWS, GCP, and Azure.

AWS Azure GCP Go Helm Java Kubernetes Linux Python Terraform
1 day, 12 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers