Tyk API Management

Tyk API Management

Tyk is a leading API Management Platform that enables interconnectivity between systems and devices through its fast, scalable, and open-source API Gateway, Analytics, Dev Portal, and Dashboard.

Internet Software & Services
51-250
Founded 2015
$40M raised

Description

  • Lead hands-on maintenance and optimization of the global cloud platform within defined SLAs, SLOs, and SLIs.
  • Collaborate with the SRE team to shape strategy and translate it into actionable technical plans.
  • Identify reliability issues, perform root cause analysis, and implement corrective solutions with the squad.
  • Lead performance tuning and fault-finding using OS and application metrics.
  • Design and implement automation for operational tasks and cloud operations workflows.
  • Develop monitoring, alerting, dashboards, and KPIs to improve platform visibility and response.
  • Participate in on-call rotation and support effective incident response, resolution, and postmortems.
  • Document operational findings, maintain runbooks, and drive continuous improvement across processes and practices.
  • Support multi-region and multi-cloud expansion with a focus on scalability and automation.
  • Engage with commercial teams on growth plans and translate them into technical SRE strategy.
  • Coordinate penetration testing and plan software upgrades to improve cloud services.

Requirements

  • Experience in an SRE role.
  • Strong knowledge of cloud technologies and SLA, SLO, and SLI management.
  • Experience with software design, automation, and root cause analysis.
  • Experience supporting production systems on-call with a customer-focused mindset.
  • Excellent communication and leadership skills.
  • Ability to analyze and improve operational processes and performance metrics.
  • Hands-on experience launching and operating production Kubernetes clusters.
  • Experience designing and operating infrastructure on AWS and other cloud providers.
  • Experience operating MongoDB or another document database, Redis or another key-value store, and Linux servers.
  • Experience with Prometheus, Grafana, and logging collection/analysis systems.
  • Advanced knowledge of Go, AWS/EKS, and Linux.
  • Proficient with Terraform and infrastructure as code, plus Helm.
  • Familiarity with monitoring tools such as Prometheus, Grafana, and Thanos.
  • Strong grasp of networking concepts and protocols such as DNS, TCP/IP, HTTP, TLS, UDP, subnets, routing, peering, load balancing, and NAT.
  • Ability to participate in the on-call rotation, including early-morning coverage from 4:00am to 16:00pm UTC.
  • Proactive, energetic, innovative, and change-oriented, with a desire to lead or mentor a team.

Benefits

  • Unlimited paid holidays.
  • Remote working from anywhere in the world.
  • Flexible working hours.
  • Employee share scheme.
  • Generous maternity and paternity leave.
  • Volunteering days.
  • Employee wellbeing platform.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
5 hours, 43 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking an Engineering Manager to lead its Resilience Engineering team, building production load testing and chaos engineering capabilities that improve the safety and reliability of production systems.

AWS Java Kotlin Kubernetes Microservices Python
5 hours, 52 minutes ago

Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)

MongoDB 1K-5K Internet Software & Services

MongoDB’s Storage Layer Services team is hiring a Site Reliability Engineer to help re-architect the cloud storage layer for Atlas and ensure the reliability and operational safety of its distributed storage infrastructure.

AWS Azure DNS GCP Go Kubernetes Linux Python TCP/IP TLS
6 hours, 40 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring an Engineering Manager to lead its Resilience Engineering team in building production load testing and chaos engineering capabilities that improve the safety and reliability of its production systems.

AWS Java Kotlin Kubernetes Python
8 hours, 56 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers