Xsolla

Xsolla

Xsolla is an international payment solution provider for online games, offering tools to launch, monetize, and scale games worldwide with local payment methods and fraud prevention.

Internet Software & Services
251-1K
Founded 2005

Description

  • Continuously monitor the GTO Operational Dashboard in Datadog to detect anomalies and determine whether they require incident creation or immediate investigation.
  • Triage and investigate production incidents using Datadog, JIRA Service Management, and related observability data to identify blast radius and likely root cause domains.
  • Route incidents to the appropriate team using the smart routing model and escalate unresolved or code-level issues within defined thresholds.
  • Own lower-severity incidents end-to-end from detection through resolution without escalation when possible.
  • Support the TSO Lead during major incidents by surfacing live technical data, maintaining the incident timeline, linking evidence, and executing mitigation actions.
  • Draft internal and external incident communications, including Slack updates, stakeholder notifications, and status page posts.
  • Analyze incident trends, recurring issues, and production bugs, and contribute findings to reports for product and engineering teams.
  • Compile incident timelines, draft initial PIR documents, and track PIR action items through completion.
  • Build and maintain operational automation such as alert enrichment scripts, incident templates, Slack workflows, and dashboard widgets.
  • Create and maintain runbooks, conduct structured shift handoffs, participate in knowledge transfer, and cover for the TSO Lead when needed.
  • Publish periodic health reports for critical applications.

Requirements

  • 4+ years of experience in SRE, DevOps, production operations, NOC, or technical operations in a high-availability environment.
  • Experience supporting payments, e-commerce, SaaS, or gaming workloads is preferred.
  • Strong troubleshooting and investigation skills across logs, APM traces, infrastructure metrics, database queries, and network paths.
  • Hands-on experience with Datadog or a similar observability platform such as Grafana, Splunk, New Relic, or Elastic.
  • Proficiency in at least one scripting language: Python, Go, or Bash.
  • Clear written and verbal communication skills in English for incident tickets, updates, handoffs, status communications, and PIR drafts.
  • Working knowledge of Kubernetes and cloud infrastructure, with GCP preferred and AWS/Azure acceptable.
  • Understanding of SLOs, error budgets, and burn-rate alerting.
  • Experience with incident management tools such as JIRA/JIRA Service Management, PagerDuty/OpsGenie, Slack, and Confluence.
  • Experience with or strong interest in AI/ML-assisted operations such as anomaly detection, alert correlation, predictive monitoring, or automated remediation.
  • Comfort with 24x7 follow-the-sun shift work and rotating weekend on-call coverage.
  • Nice to have: experience in gaming, payments, or fintech environments.
  • Nice to have: familiarity with Datadog Service Catalog, synthetic monitoring, and RUM.
  • Nice to have: experience debugging distributed systems and cascading microservice failures.
  • Nice to have: exposure to MySQL, PostgreSQL, Redis, or Kafka for incident investigation.
  • Nice to have: familiarity with CI/CD and deployment tools such as GitLab CI, ArgoCD, or Helm.
  • Nice to have: JIRA Service Management administration experience.
  • Nice to have: ITIL Foundation certification.

Benefits

  • $90,000 - $115,000 annual salary for British Columbia, based on location and experience.
  • Medical, dental, and vision coverage.
  • PTO.
  • A personalized career roadmap for each employee.
  • Training and educational opportunities for professional development.
  • A supportive environment focused on employees’ physical, mental, and emotional well-being.
  • Remote work arrangement.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Medical Executive Assistant & Practice Operations Coordinator

Winning Assistants Health Care Providers & Services

Part-time remote Medical Executive Assistant & Practice Operations Coordinator is needed to support a radiologist and entrepreneur managing an aesthetic medicine practice, multiple businesses, and rental properties.

Cybersecurity HIPAA
37 minutes ago

AI Automation Specialist

teamified.com Hotels, Restaurants & Leisure

Teamified is seeking a hands-on AI Automation Specialist to work directly with clients on analyzing business processes, implementing AI-driven automations in Alexia.ai, and improving how remote teams operate.

CRM HubSpot OAuth Salesforce
1 hour, 19 minutes ago

Seasonal Property Operations Support

The Scion Group 1K-5K Real Estate

The Scion Group is hiring temporary full-time and part-time staff to support apartment turnover operations during a 4-8 week move-out and move-in period.

1 hour, 28 minutes ago

Seasonal Property Operations Support

The Scion Group 1K-5K Real Estate

The Scion Group is hiring temporary full-time and part-time staff to support apartment turnover and help ensure a smooth move-out and move-in experience for residents.

1 hour, 35 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers