Service Reliability Lead

3 hours, 7 minutes ago
Full-time
Senior
DevOps and Infrastructure
SPD Technology

SPD Technology

SPD Technology specializes in custom software product development, focusing on fintech and payment solutions, as well as AI/ML solutions, data engineering, and cloud services to help businesses leverage technology for growth and innovation.

Internet Software & Services
Founded 2006

Description

  • Own the L2/L3/L4 incident escalation path and serve as the senior technical contact for all incidents.
  • Lead incident response end-to-end, including P1 triage, real-time technical decisions, workaround management, and RCA delivery.
  • Build and maintain the monitoring, alerting, and on-call stack using CloudWatch, Grafana, and PagerDuty.
  • Manage SLA measurement, validation, reporting, and service credit/penalty mechanics.
  • Prepare monthly service performance reports covering incident volumes, SLA performance, RCA status, and risk items.
  • Author Root Cause Analysis documents within 5 days of incident resolution.
  • Identify recurring incident patterns and design Service Improvement Plans with corrective actions and delivery timelines.
  • Operate in alignment with the AWS Shared Responsibility Model and help distinguish internal versus third-party failures.
  • Coordinate with vendors, clients, and the support team during incidents, maintenance windows, and escalations.
  • Proactively reduce incident frequency and improve mean time to resolution.

Requirements

  • 5–8 years of experience in production operations or Site Reliability Engineering.
  • Hands-on incident command experience.
  • Strong AWS operational experience with CloudWatch, EKS, RDS, and networking.
  • Experience with monitoring and alerting tools such as Grafana, CloudWatch, and PagerDuty.
  • Working knowledge of PCI DSS.
  • Experience writing RCAs and using structured problem-solving approaches.
  • Experience with SLA management and service credit mechanics.
  • Experience with hypercare or go-live stabilization periods.
  • Experience in fintech or payment systems is preferred.
  • Ability to work within the EU time zone (UTC+1/UTC+2).

Benefits

  • Fully remote work with a flexible working schedule.
  • Stable workload and stable income.
  • Provided laptop and licensed software.
  • Performance and merit reviews.
  • Personal development plans and individual learning support.
  • Access to a corporate library and public speaking support.
  • Referral bonus program.
  • Company-wide tech and cultural events, plus CSR initiatives.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Database Reliability Engineer (DBRE) & Architect (worldwide remote)

CloudLinux 51-250 IT Services

CloudLinux is seeking a visionary engineer to lead the evolution of its data platform by building an internal DBaaS model that turns database infrastructure into a reliable service for product teams across a hybrid cloud environment.

Ansible Apache Airflow AWS Azure ClickHouse DigitalOcean GCP GitLab Go Grafana Jenkins Kafka Kubernetes MongoDB PostgreSQL Python Redash Redis SQL Terraform
2 hours, 22 minutes ago

Director of Cloud Operations

Firstup 251-1K Professional Services

Firstup is hiring a Director of Cloud Operations to lead the reliability, scalability, and efficiency of its globally distributed SaaS cloud platform across AWS, while partnering with engineering, security, and product teams.

AWS CI/CD CircleCI Datadog Kubernetes Microservices .NET Serverless Terraform
7 hours, 26 minutes ago

Senior Applications Support Specialist

Ensono 1K-5K IT Services

Application Reliability Lead at an enterprise in a regulated environment, responsible for restoring service during incidents and improving the resilience, stability, and operational readiness of critical applications.

Grafana Java .NET PowerShell Prometheus Python Splunk SQL
12 hours, 58 minutes ago

Senior Site Reliability Engineer (Calgary, Canada)

Syndio 51-250 Professional Services

Syndio is hiring a Senior Site Reliability Engineer to help design and operate cloud-based systems that improve reliability, observability, and availability for its compensation platform.

CI/CD Datadog GCP GitOps Go Helm Kubernetes Linux Python Terraform
12 hours, 59 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers