Omilia

Omilia

Omilia is a global leader in Conversational AI, offering AI-based self-service solutions for enhanced customer care fulfillment and success.

IT Services
251-1K
Founded 2002
$20M raised

Description

  • Ensure reliability and availability across production and pre-production environments through proactive monitoring, alerting, and automation.
  • Act as first response for incidents and contribute to problem management and root cause analysis.
  • Support development teams in improving service reliability and building a reliability-focused culture in the software lifecycle.
  • Develop troubleshooting documentation and operational runbooks for production support.
  • Collaborate with engineering and cloud teams to automate operational tasks and improve delivery processes.
  • Design, implement, and evolve observability solutions using metrics, logs, traces, and dashboards.
  • Use tools such as Prometheus, Grafana, and ELK to monitor platform health and performance.
  • Participate in on-call rotations and improve alert quality and incident response processes.
  • Champion continuous improvement in reliability, performance, and operational practices across teams.

Requirements

  • Bachelor’s degree or MS in Engineering, or equivalent experience.
  • Experience operating at least one container orchestration cluster such as Kubernetes or Docker Swarm.
  • Experience developing or maintaining software for production services at scale.
  • Experience with ELK.
  • Experience with AWS.
  • Experience with Grafana and Prometheus.
  • Strong scripting skills in Bash, Python, or Go.
  • Excellent communication skills and ability to work collaboratively across teams.
  • Agile/lean mindset with a willingness to iterate, learn, and challenge existing approaches.
  • Nice to have: telephony knowledge including SIP and VoIP.
  • Nice to have: Linux administration experience with RedHat, CentOS, or AL.
  • Nice to have: configuration management experience with Terraform or Ansible.
  • Nice to have: knowledge of TCP/IP and general networking concepts.
  • Nice to have: RDBMS experience with MySQL or Postgres.
  • Nice to have: NoSQL experience with Redis.

Benefits

  • Fixed compensation.
  • Long-term employment with vacation days.
  • Professional development support including courses and training.
  • Opportunity to work on cutting-edge technology products with global impact.
  • Collaborative and fun team environment.
  • Apple gear provided.
  • Equal opportunity employer with a diverse and inclusive workplace.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer (DevTools)

Nebius 51-250 Internet Software & Services

Nebius is hiring an SRE for its DevTools team to maintain and improve large-scale developer infrastructure that supports builds, artifacts, and version control workflows for its AI cloud platform.

CI/CD GitLab Go Java Kotlin Python Ruby Spring TeamCity
25 minutes ago

Senior Site Reliability Engineer (SRE)

The Investigo Group Professional Services

The Investigo Group is hiring a Senior Site Reliability Engineer to operate and mature its production Kubernetes and OpenShift platforms across secure on-premises and hybrid environments.

Ansible Argo CD CI/CD Flux GitHub Actions GitOps Go Grafana Helm Juniper Kubernetes Linux Load Balancing Machine Learning OpenID Connect OpenShift OpenTelemetry Palo Alto Prometheus Python SAML Shell Scripting Terraform
5 hours, 21 minutes ago

Staff Site Reliability Engineer, Production Engineering

Dropbox 1K-5K Internet Software & Services

Dropbox is hiring a Site Reliability Engineer to define and drive company-wide reliability strategy for an AI-enabled engineering environment, with the goal of strengthening stability, observability, incident response, and operational excellence at scale.

5 hours, 29 minutes ago

Senior Cloud Resilience Architect

Blink Health 251-1K Health Care Providers & Services

Blink Health is hiring a disaster recovery and resilience architecture leader to strengthen the reliability of its healthcare technology platforms and critical patient-facing systems.

Ansible AWS Azure CloudFormation DNS GCP Kubernetes Load Balancing Pulumi Terraform
5 hours, 42 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers