Xsolla

Xsolla

Xsolla is an international payment solution provider for online games, offering tools to launch, monetize, and scale games worldwide with local payment methods and fraud prevention.

Internet Software & Services
251-1K
Founded 2005

Description

  • Serve as the primary dashboard monitor during shifts and continuously watch production health signals in Datadog.
  • Detect anomalies by correlating APM, logs, metrics, synthetic tests, and Real User Monitoring data.
  • Triage and investigate production incidents, create incident tickets in JIRA Service Management, and route issues to the correct team.
  • Own lower-severity incidents end-to-end from detection through resolution, including diagnosis and runbook execution.
  • Support the Technical Shift Operations Lead during major incidents as a technical partner in the war room.
  • Draft internal and customer-facing incident communications, including Slack updates and status page posts.
  • Analyze incident trends, recurring issues, and production bugs and contribute findings to reports and post-incident reviews.
  • Compile incident timelines, draft initial PIR documents, and track action items after reviews.
  • Build and maintain operational automation, incident templates, Slack workflows, dashboard widgets, and runbooks.
  • Conduct structured shift handoffs and participate in knowledge transfer sessions to improve independent resolution capability.
  • Cover for the TSO Lead when needed, including severity classification, escalation decisions, and basic incident commander functions.
  • Publish periodic health reports for critical applications.

Requirements

  • 4+ years of experience in SRE, DevOps, production operations, NOC, or technical operations in a high-availability environment.
  • Experience supporting payments, e-commerce, SaaS, or gaming workloads is preferred.
  • Strong troubleshooting and investigation skills across logs, traces, metrics, databases, and network paths.
  • Hands-on experience with Datadog or a similar observability platform such as Grafana, Splunk, New Relic, or Elastic.
  • Proficiency in at least one scripting language: Python, Go, or Bash.
  • Clear written and verbal communication skills in English.
  • Working knowledge of Kubernetes and cloud infrastructure; GCP is preferred, while AWS or Azure are acceptable.
  • Understanding of SLOs, error budgets, and burn-rate alerting.
  • Experience with JIRA or JIRA Service Management, PagerDuty or OpsGenie, Slack, and Confluence.
  • Interest in or experience with AI/ML-assisted operations such as anomaly detection, alert correlation, predictive monitoring, or automated remediation.
  • Comfort with 24x7 shift-based operations in a follow-the-sun model, including weekend on-call rotation.
  • Experience in gaming, payments, or fintech environments is a plus.
  • Familiarity with Datadog Service Catalog, synthetic monitoring, and RUM is a plus.
  • Exposure to database and platform tools such as MySQL, PostgreSQL, Redis, Kafka, GitLab CI, ArgoCD, and Helm is a plus.
  • JIRA Service Management administration experience or ITIL Foundation certification is preferred.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Medical Executive Assistant & Practice Operations Coordinator

Winning Assistants Health Care Providers & Services

Part-time remote Medical Executive Assistant & Practice Operations Coordinator is needed to support a radiologist and entrepreneur managing an aesthetic medicine practice, multiple businesses, and rental properties.

Cybersecurity HIPAA
40 minutes ago

AI Automation Specialist

teamified.com Hotels, Restaurants & Leisure

Teamified is seeking a hands-on AI Automation Specialist to work directly with clients on analyzing business processes, implementing AI-driven automations in Alexia.ai, and improving how remote teams operate.

CRM HubSpot OAuth Salesforce
1 hour, 22 minutes ago

Seasonal Property Operations Support

The Scion Group 1K-5K Real Estate

The Scion Group is hiring temporary full-time and part-time staff to support apartment turnover operations during a 4-8 week move-out and move-in period.

1 hour, 31 minutes ago

Seasonal Property Operations Support

The Scion Group 1K-5K Real Estate

The Scion Group is hiring temporary full-time and part-time staff to support apartment turnover and help ensure a smooth move-out and move-in experience for residents.

1 hour, 38 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers