Xsolla

Xsolla

Xsolla is an international payment solution provider for online games, offering tools to launch, monetize, and scale games worldwide with local payment methods and fraud prevention.

Internet Software & Services
251-1K
Founded 2005

Description

  • Continuously monitor the GTO Operational Dashboard in Datadog to detect anomalies and determine whether they require incident creation or immediate investigation.
  • Triage and investigate production incidents using Datadog, JIRA Service Management, and related observability data to identify blast radius and likely root cause domains.
  • Route incidents to the appropriate team using the smart routing model and escalate unresolved or code-level issues within defined thresholds.
  • Own lower-severity incidents end-to-end from detection through resolution without escalation when possible.
  • Support the TSO Lead during major incidents by surfacing live technical data, maintaining the incident timeline, linking evidence, and executing mitigation actions.
  • Draft internal and external incident communications, including Slack updates, stakeholder notifications, and status page posts.
  • Analyze incident trends, recurring issues, and production bugs, and contribute findings to reports for product and engineering teams.
  • Compile incident timelines, draft initial PIR documents, and track PIR action items through completion.
  • Build and maintain operational automation such as alert enrichment scripts, incident templates, Slack workflows, and dashboard widgets.
  • Create and maintain runbooks, conduct structured shift handoffs, participate in knowledge transfer, and cover for the TSO Lead when needed.
  • Publish periodic health reports for critical applications.

Requirements

  • 4+ years of experience in SRE, DevOps, production operations, NOC, or technical operations in a high-availability environment.
  • Experience supporting payments, e-commerce, SaaS, or gaming workloads is preferred.
  • Strong troubleshooting and investigation skills across logs, APM traces, infrastructure metrics, database queries, and network paths.
  • Hands-on experience with Datadog or a similar observability platform such as Grafana, Splunk, New Relic, or Elastic.
  • Proficiency in at least one scripting language: Python, Go, or Bash.
  • Clear written and verbal communication skills in English for incident tickets, updates, handoffs, status communications, and PIR drafts.
  • Working knowledge of Kubernetes and cloud infrastructure, with GCP preferred and AWS/Azure acceptable.
  • Understanding of SLOs, error budgets, and burn-rate alerting.
  • Experience with incident management tools such as JIRA/JIRA Service Management, PagerDuty/OpsGenie, Slack, and Confluence.
  • Experience with or strong interest in AI/ML-assisted operations such as anomaly detection, alert correlation, predictive monitoring, or automated remediation.
  • Comfort with 24x7 follow-the-sun shift work and rotating weekend on-call coverage.
  • Nice to have: experience in gaming, payments, or fintech environments.
  • Nice to have: familiarity with Datadog Service Catalog, synthetic monitoring, and RUM.
  • Nice to have: experience debugging distributed systems and cascading microservice failures.
  • Nice to have: exposure to MySQL, PostgreSQL, Redis, or Kafka for incident investigation.
  • Nice to have: familiarity with CI/CD and deployment tools such as GitLab CI, ArgoCD, or Helm.
  • Nice to have: JIRA Service Management administration experience.
  • Nice to have: ITIL Foundation certification.

Benefits

  • $90,000 - $115,000 annual salary for British Columbia, based on location and experience.
  • Medical, dental, and vision coverage.
  • PTO.
  • A personalized career roadmap for each employee.
  • Training and educational opportunities for professional development.
  • A supportive environment focused on employees’ physical, mental, and emotional well-being.
  • Remote work arrangement.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Mobile Mapping Operator

TSMG Professional Services

Terry Soot Management Group (TSMG) is hiring a full-time Mobile Mapping Operator in Würzburg to collect street and public-area imagery for an EMEA field project that will help improve a widely used online map.

5 minutes ago

Standortdaten-Spezialist

TSMG Professional Services

Terry Soot Management Group (TSMG) is hiring a remote full-time field data collection specialist in Passau to capture street-level imagery and related data for map improvement projects across public roads and areas in Germany.

5 minutes ago

Mobile Mapping Operator

TSMG Professional Services

Terry Soot Management Group (TSMG) is hiring a full-time Mobile Mapping Operator to collect street, landmark, and public-area imagery in and around Steinau an der Straße for a long-term mapping project.

5 minutes ago

Data collector / Driver

TSMG Professional Services

Terry Soot Management Group is hiring a full-time field data collector/driver in Spartanburg, SC to drive assigned routes and capture street and public-area imagery for mapping projects.

5 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers