Incident Engineer

1 hour, 11 minutes ago
Full-time
Mid Level
DevOps and Infrastructure
Netomi

Netomi

Netomi is an AI-first customer service platform revolutionizing customer support with industry-leading resolution rates and multilingual support.

IT Services
51-250
Founded 2015
$52M raised

Description

  • Own the incident lifecycle from detection and triage through escalation, resolution, and postmortems.
  • Act as the central command during major incidents, including war rooms and stakeholder updates.
  • Define and enforce SLAs/SLOs, incident severity frameworks, and runbooks.
  • Collaborate with Engineering, ML, and Integrations teams to resolve issues quickly.
  • Monitor system health across integrations, including agent desks, LLMs, and ASR/TTS pipelines.
  • Drive root cause analysis and implement preventive actions.
  • Improve observability, alerting, and incident response tooling.
  • Maintain clear internal and customer-facing communication during incidents.

Requirements

  • 3–6 years of experience in Incident Management, SRE, or Production Support roles.
  • Strong understanding of distributed systems, APIs, and cloud environments, especially AWS.
  • Experience with observability tools such as DataDog.
  • Familiarity with AI/ML systems, especially LLM integrations and voice stacks like ASR/TTS, is a plus.
  • Experience with monitoring and tracing tools such as Langfuse or similar.
  • Excellent communication and stakeholder management skills.
  • Ability to stay calm under pressure and drive structured resolution.
  • Exposure to OpenAI or similar LLM platforms is preferred.
  • Experience supporting customer-facing SaaS products is a plus.
  • An automation mindset, including runbooks, alert tuning, and incident tooling, is preferred.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

Capital Markets Gateway 51-250 Capital Markets

Capital Markets Gateway LLC (CMG) is hiring a remote Site Reliability Engineer in Latin America to strengthen the reliability, performance, and observability of its capital markets fintech platform used by buy-side firms and investment banks.

Azure Bash Datadog Docker Elasticsearch GitHub Grafana GraphQL JIRA Kubernetes Linux Microservices .NET OpenTelemetry PostgreSQL Prometheus Python React Redis Terraform TypeScript
11 minutes ago

Staff Site Reliability Engineer (Platform Reliability)

Qonto 1K-5K Banks

Qonto is hiring a Staff Site Reliability Engineer to lead platform reliability work, shape infrastructure decisions, and help scale its cloud platform for millions of customers across Europe.

Argo CD AWS Docker Elasticsearch GitLab CI GitOps Go Kafka Kubernetes Microservices OpenTelemetry OpsGenie PostgreSQL Prometheus Python Redis Terraform
41 minutes ago

Senior Cloud Performance Engineer

ClickHouse 51-250 IT Services

ClickHouse is hiring a Cloud Performance Engineering professional to help build and optimize its cloud-native ClickHouse Cloud platform for large-scale distributed systems, performance, and resilience work.

AWS Azure ClickHouse EC2 GCP Go Java Kubernetes Serverless
1 hour, 41 minutes ago

Sr. Site Reliability Engineer

Backblaze 251-1K IT Services

Backblaze is seeking a Senior Site Reliability Engineer to improve the stability, scalability, and reliability of its customer-facing cloud storage services and infrastructure.

Ansible AWS Azure Bash Docker ELK Stack GCP Go Grafana HashiCorp Vault Jenkins Kubernetes Linux Microservices Prometheus Python Terraform
3 hours, 26 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers