Tyk API Management

Tyk API Management

Tyk is a leading API Management Platform that enables interconnectivity between systems and devices through its fast, scalable, and open-source API Gateway, Analytics, Dev Portal, and Dashboard.

Internet Software & Services
51-250
Founded 2015
$40M raised

Description

  • Maintain Tyk Cloud availability and help define SLA/SLO/SI targets.
  • Identify reliability issues and work with the squad to resolve them.
  • Create and improve metrics and dashboards to monitor platform health.
  • Participate in the on-call rotation and serve as first-line incident management support.
  • Conduct post-incident analysis and help define response processes.
  • Automate common operational tasks and improve support workflows.
  • Document operational knowledge, SRE processes, and policies.
  • Support the expansion of the platform across multi-region and multi-cloud environments.
  • Recommend and implement ways to improve operational efficiency and reduce running costs without affecting service.
  • Assist with cloud penetration testing by coordinating with the provider and preparing technical details and environment setup.

Requirements

  • Experience launching and operating production-scale Kubernetes clusters.
  • Experience designing and operating infrastructure on AWS and other cloud providers.
  • Experience operating MongoDB or similar document databases.
  • Experience operating Redis or similar key-value storage clusters.
  • Experience administering Linux servers and maintaining distributed software.
  • Experience operating Prometheus, Grafana, and logging collection/analysis systems.
  • Strong collaboration skills and a proactive, energetic, innovative, change-oriented mindset.
  • Advanced knowledge of Kubernetes and containers, AWS/EKS, and Linux.
  • Proficient with Terraform and infrastructure as code, and Helm.
  • Familiarity with Go, monitoring tools such as Thanos, and networking concepts including subnets, routing, peering, load balancing, NAT, DNS, TCP/IP, HTTP, TLS, and UDP.
  • Availability to participate in the on-call rotation, including 16:00–4:00 UTC.
  • Nice to have: experience with GCP or Azure, bare metal infrastructure, API management, large-scale distributed storage, Rancher, CKA/CKAD/CKS certifications, or production software delivery in Go.

Benefits

  • Unlimited paid holiday.
  • Remote working from anywhere in the world.
  • Flexible working hours.
  • Employee share scheme.
  • Generous maternity and paternity leave.
  • Company retreats.
  • An inclusive, values-driven culture that emphasizes authenticity, respect, responsibility, independence, honesty, diversity, and inclusion.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
4 hours, 53 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking an Engineering Manager to lead its Resilience Engineering team, building production load testing and chaos engineering capabilities that improve the safety and reliability of production systems.

AWS Java Kotlin Kubernetes Microservices Python
5 hours, 2 minutes ago

Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)

MongoDB 1K-5K Internet Software & Services

MongoDB’s Storage Layer Services team is hiring a Site Reliability Engineer to help re-architect the cloud storage layer for Atlas and ensure the reliability and operational safety of its distributed storage infrastructure.

AWS Azure DNS GCP Go Kubernetes Linux Python TCP/IP TLS
5 hours, 51 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring an Engineering Manager to lead its Resilience Engineering team in building production load testing and chaos engineering capabilities that improve the safety and reliability of its production systems.

AWS Java Kotlin Kubernetes Python
8 hours, 6 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers