Firstup

Firstup

Firstup offers an intelligent communication platform designed to engage employees throughout their entire employment journey, providing insights that help organizations support, promote, and retain their workforce effectively.

Professional Services
251-1K
Founded 2008
$47M raised

Description

  • Own the availability, performance, and resilience of the multi-region AWS platform.
  • Drive reliability improvements using SLIs/SLOs, error budgets, and proactive engineering practices.
  • Lead efforts to reduce MTTR and improve incident response across the organization.
  • Guide architecture decisions for microservices, Kubernetes (EKS), and serverless workloads.
  • Advance the observability strategy using Datadog to provide actionable insights across infrastructure and applications.
  • Establish and refine incident management practices, including on-call processes, escalation paths, and post-incident reviews.
  • Act as incident commander for critical events and participate in the on-call rotation.
  • Improve operational standards through automation, standardization, and modern best practices.
  • Drive cost optimization across AWS environments without sacrificing performance or reliability.
  • Lead, mentor, and support a distributed CloudOps team across the US and UK.
  • Oversee operations of a legacy .NET-based solution in private data centers in the US and Europe.

Requirements

  • 10+ years of experience in cloud infrastructure, SRE, or DevOps roles.
  • Recent experience leading CloudOps or SRE teams.
  • Proven experience leading operational or platform transformations in a SaaS environment.
  • Experience operating multi-region, customer-facing systems at scale.
  • Strong hands-on experience with AWS multi-region architectures.
  • Hands-on experience with Kubernetes (EKS) and containerized environments.
  • Infrastructure as Code experience, with Terraform preferred.
  • Experience with CI/CD pipelines such as CircleCI or similar tools.
  • Experience with observability platforms such as Datadog or equivalent tools.
  • Solid understanding of microservices and distributed systems design.
  • Familiarity with serverless architectures and modern cloud-native patterns.
  • Deep experience with incident management, on-call operations, and reliability engineering practices.
  • Strong understanding of SLO/SLI frameworks, monitoring strategies, and performance optimization.
  • Demonstrated ability to balance hands-on technical work with team leadership.
  • Collaborative and pragmatic leadership style with ability to influence across teams.
  • Passion for building and supporting high-performing teams.
  • Bias toward continuous improvement and measurable outcomes.

Benefits

  • Base salary range of $200,000 to $228,000.
  • Excellent PTO program.
  • Great health benefits.
  • Remote work arrangement.
  • Casual and friendly work environment.
  • Leadership team committed to personal and professional growth.
  • Inclusive, high-growth environment where ideas are rewarded.
  • Opportunity to make a direct impact on reliability, scalability, and customer experience.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Cloud Resilience Architect

Blink Health 251-1K Health Care Providers & Services

Blink Health is hiring a disaster recovery and resilience architecture leader to strengthen the reliability of its healthcare technology platforms and critical patient-facing systems.

Ansible AWS Azure CloudFormation DNS GCP Kubernetes Load Balancing Pulumi Terraform
4 hours, 18 minutes ago

Senior Site Reliability Engineer

Omilia 251-1K IT Services

Senior Site Reliability Engineer at Omilia, responsible for operating production cloud infrastructure, improving observability, and driving reliability across the software delivery lifecycle.

Agile Ansible AWS Bash CentOS Go Grafana Kubernetes MySQL PostgreSQL Prometheus Python Redis RHEL TCP/IP Terraform
5 hours, 58 minutes ago

Senior Site Reliability Engineer (DevTools)

Nebius 51-250 Internet Software & Services

Nebius is hiring an SRE for its DevTools team to maintain and improve large-scale developer infrastructure that supports builds, artifacts, and version control workflows for its AI cloud platform.

CI/CD GitLab Go Java Kotlin Python Ruby Spring TeamCity
8 hours, 26 minutes ago

Senior Observability & Telemetry Engineer - Radian Arc

Submer 51-250 IT Services

Radian Arc, part of InferX, is hiring a Senior Observability Engineer to design and build the telemetry and observability platform for large-scale GPU cloud infrastructure and edge deployments across EMEA.

CI/CD ClickHouse Go Grafana Kubernetes Linux OpenTelemetry Prometheus Python Rust WAF
9 hours, 19 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers