Omilia

Omilia

Omilia is a global leader in Conversational AI, offering AI-based self-service solutions for enhanced customer care fulfillment and success.

IT Services
251-1K
Founded 2002
$20M raised

Description

  • Ensure platform reliability and availability across production and pre-production environments through proactive monitoring, alerting, and automation.
  • Serve as first response for incidents and contribute to problem management and root cause analysis.
  • Support development teams in building a reliability-focused culture within the development lifecycle.
  • Develop troubleshooting documentation and production support materials.
  • Collaborate with engineering teams to create optimized runbooks, operational documentation, and automation for operational tasks.
  • Work with development and cloud engineering teams to embed reliability and performance into the software delivery lifecycle.
  • Design, implement, and evolve observability solutions using metrics, logs, traces, and dashboards.
  • Use tools such as Prometheus, Grafana, and ELK to improve monitoring and visibility.
  • Participate in on-call rotations and continuously improve alert quality and response processes.
  • Champion continuous improvement in reliability and performance across teams.

Requirements

  • Bachelor's degree or MS in Engineering, or equivalent experience.
  • Experience operating at least one container orchestration cluster, such as Kubernetes or Docker Swarm.
  • Experience developing or maintaining software for production services at scale.
  • Experience with ELK.
  • Experience with AWS.
  • Experience with the Grafana/Prometheus stack.
  • Strong scripting skills in Bash, Python, or Go.
  • Excellent communication skills.
  • Ability to think creatively, anticipate challenges, and question existing technologies and procedures.
  • Comfort working in agile/lean methods and iterating collaboratively.
  • Strong team-player mindset and ability to work across product, experience design, engineering, and other functions.
  • Telephony knowledge, including SIP and VoIP, is a plus.
  • Experience in Linux administration, including RedHat, CentOS, or AL, is a plus.
  • Working knowledge of configuration management tools such as Terraform and Ansible is a plus.
  • Experience with TCP/IP and general networking concepts is a plus.
  • RDBMS knowledge, such as MySQL or Postgres, is a plus.
  • NoSQL knowledge, such as Redis, is a plus.

Benefits

  • Fixed compensation.
  • Long-term employment with vacation days.
  • Professional development support, including courses and training.
  • Opportunity to work on cutting-edge products with global impact in the service industry.
  • A collaborative, fun-to-work-with team.
  • Apple gear provided.
  • Equal opportunity employer with a diverse and inclusive workplace.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer (SRE)

The Investigo Group Professional Services

The Investigo Group is hiring a Senior Site Reliability Engineer to operate and mature its production Kubernetes and OpenShift platforms across secure on-premises and hybrid environments.

Ansible Argo CD CI/CD Flux GitHub Actions GitOps Go Grafana Helm Juniper Kubernetes Linux Load Balancing Machine Learning OpenID Connect OpenShift OpenTelemetry Palo Alto Prometheus Python SAML Shell Scripting Terraform
2 hours, 45 minutes ago

Sustaining Engineering Lead

Actian 251-1K IT Services

Actian is hiring a remote Sustaining Engineering Lead in Europe to own end-to-end escalation handling for critical platform issues on its data intelligence platform.

CI/CD GitHub JIRA
4 hours, 8 minutes ago

Senior Site Reliability Engineer

Blink Health 251-1K Health Care Providers & Services

Blink Health is hiring a senior site reliability and platform engineering leader to improve the reliability, observability, and scalability of its healthcare technology infrastructure supporting prescription access products.

Agile Ansible AWS Azure Bash CloudFormation DNS GCP Go Helm Kubernetes Linux Load Balancing Microservices Pulumi Python React Secrets Management TCP/IP Terraform
5 hours, 26 minutes ago

Senior Cloud Resilience Architect

Blink Health 251-1K Health Care Providers & Services

Blink Health is hiring a disaster recovery and resilience architecture leader to strengthen the reliability of its healthcare technology platforms and critical patient-facing systems.

Ansible AWS Azure CloudFormation DNS GCP Kubernetes Load Balancing Pulumi Terraform
8 hours, 55 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers