Firstup

Firstup

Firstup offers an intelligent communication platform designed to engage employees throughout their entire employment journey, providing insights that help organizations support, promote, and retain their workforce effectively.

Professional Services
251-1K
Founded 2008
$47M raised

Description

  • Own the availability, performance, and resilience of the multi-region AWS platform.
  • Drive reliability improvements using SLIs/SLOs, error budgets, and proactive engineering practices.
  • Lead efforts to reduce MTTR and improve incident response across the organization.
  • Guide architecture decisions for microservices, Kubernetes (EKS), and serverless workloads.
  • Advance the observability strategy using Datadog to provide actionable insights across infrastructure and applications.
  • Establish and refine incident management practices, including on-call processes, escalation paths, and post-incident reviews.
  • Act as incident commander for critical events and participate in the on-call rotation.
  • Improve operational standards through automation, standardization, and modern best practices.
  • Drive cost optimization across AWS environments without sacrificing performance or reliability.
  • Lead, mentor, and support a distributed CloudOps team across the US and UK.
  • Oversee operations of a legacy .NET-based solution in private data centers in the US and Europe.

Requirements

  • 10+ years of experience in cloud infrastructure, SRE, or DevOps roles.
  • Recent experience leading CloudOps or SRE teams.
  • Proven experience leading operational or platform transformations in a SaaS environment.
  • Experience operating multi-region, customer-facing systems at scale.
  • Strong hands-on experience with AWS multi-region architectures.
  • Hands-on experience with Kubernetes (EKS) and containerized environments.
  • Infrastructure as Code experience, with Terraform preferred.
  • Experience with CI/CD pipelines such as CircleCI or similar tools.
  • Experience with observability platforms such as Datadog or equivalent tools.
  • Solid understanding of microservices and distributed systems design.
  • Familiarity with serverless architectures and modern cloud-native patterns.
  • Deep experience with incident management, on-call operations, and reliability engineering practices.
  • Strong understanding of SLO/SLI frameworks, monitoring strategies, and performance optimization.
  • Demonstrated ability to balance hands-on technical work with team leadership.
  • Collaborative and pragmatic leadership style with ability to influence across teams.
  • Passion for building and supporting high-performing teams.
  • Bias toward continuous improvement and measurable outcomes.

Benefits

  • Base salary range of $200,000 to $228,000.
  • Excellent PTO program.
  • Great health benefits.
  • Remote work arrangement.
  • Casual and friendly work environment.
  • Leadership team committed to personal and professional growth.
  • Inclusive, high-growth environment where ideas are rewarded.
  • Opportunity to make a direct impact on reliability, scalability, and customer experience.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

[Job-28557] Senior SRE, Brazil

CI&T 5K-10K Internet Software & Services

CI&T is hiring a Senior SRE in Brazil to support a cloud-based application project with a strong focus on reliability, observability, and proactive operational ownership.

Android AWS Datadog Docker GitHub GitHub Actions Go Google Analytics Grafana iOS Java Jenkins Kubernetes Linux Prometheus Python Splunk Terraform
5 hours, 42 minutes ago

Site Reliability Engineer (SRE)

hatch I.T. 11-50 Professional Services

CardioOne is hiring a remote Site Reliability Engineer to partner with engineering teams in keeping its healthcare platform reliable, scalable, secure, and high-performing as the company grows.

Ansible AWS Azure Chef CI/CD Datadog Docker Java Kubernetes Linux Microservices OpenTelemetry PostgreSQL Puppet Python Shell Scripting Terraform
6 hours, 27 minutes ago

Staff Site Reliability Engineer

Caseware 251-1K Internet Software & Services

Caseware is hiring a Staff Site Reliability Engineer in Romania to help build and scale its AI platform by keeping AWS, Kubernetes, and GitOps-based production systems reliable, observable, and automated.

AWS AWS CDK CI/CD Docker GitHub GitHub Actions GitOps Kubernetes Linux Load Balancing Microservices Terraform
6 hours, 42 minutes ago

Senior Infrastructure Engineer - Postgres

ClickHouse 51-250 IT Services

ClickHouse is hiring a Senior SRE / Senior Infrastructure Engineer to own reliability, automation, and operations for its multi-cloud Postgres integration and cloud data platform as it scales globally.

AWS Azure CI/CD GCP Go Grafana Kubernetes OpenTelemetry PostgreSQL Prometheus Terraform
18 hours, 12 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers