Senior Site Reliability Engineer, AI Research

1 month, 1 week ago
Full-time
Senior
DevOps and Infrastructure
Algolia

Algolia

Algolia provides a hosted search platform that leverages AI to enhance user experience and developer engagement, enabling enterprises and developers to deliver fast, relevant search results across websites and mobile applications.

Internet Software & Services
251-1K
Founded 2012
$334M raised

Description

  • Support and evolve the reliability of platforms used by the AI Research team.
  • Ensure production services meet expectations for availability, latency, and operational readiness.
  • Design infrastructure and operational patterns that balance iteration speed with production safeguards.
  • Work closely with researchers and engineers as an advisor on infrastructure, reliability, and operations.
  • Participate in team planning and execution from early exploration through production rollout.
  • Help researchers self-serve infrastructure safely and effectively.
  • Build and maintain Kubernetes-based services on Google Cloud Platform using infrastructure-as-code and GitOps.
  • Own and improve CI/CD pipelines for Go-based services and some Python-based services.
  • Design and operate observability systems, including tools such as Datadog.
  • Participate in a light on-call rotation and respond to incidents while improving systems over time.

Requirements

  • Strong experience operating cloud-first infrastructure.
  • Hands-on experience running production services on Kubernetes.
  • Proficiency with infrastructure-as-code, especially Terraform, and CI/CD systems.
  • Experience supporting production services written in Go; Python experience is a plus.
  • Solid grounding in service reliability, incident response, and operational best practices.
  • Comfort working in ambiguous environments where problems are not always well defined.
  • Experience supporting mission-critical internal platforms is preferred.
  • Exposure to research or experimentation-heavy environments is preferred.
  • Familiarity working alongside researchers or highly specialized domain experts is preferred.
  • AI, ML, or deep learning experience is not required.
  • Model training, tuning, or ML framework expertise such as PyTorch or JAX is not required.

Benefits

  • Remote-friendly work culture with flexibility to work remotely or in a hybrid model.
  • Australia-based role with occasional off-hours collaboration as needed.
  • High-impact work that directly enables new AI-powered capabilities for customers.
  • High agency to help shape what gets built and how it is built.
  • Opportunity to collaborate with experienced SREs, engineers, and PhD researchers.
  • Growth in research-adjacent infrastructure and platform reliability expertise.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Staff Site Reliability Engineer, Production Engineering

Dropbox 1K-5K Internet Software & Services

Dropbox is hiring a Site Reliability Engineer to shape company-wide reliability strategy for AI-assisted and agentic software development while improving stability, observability, incident response, and operational excellence at scale.

1 hour, 15 minutes ago

Sr. Site Reliability Engineer III (6448)

MetroStar 251-1K IT Services

MetroStar is hiring a Sr. Site Reliability Engineer III to support mission-critical federal government systems by ensuring reliable, secure, and scalable application operations across modern infrastructure environments.

Ansible AWS Bash CI/CD Kubernetes Load Balancing
1 hour, 24 minutes ago

Senior Site Reliability Engineer

Honeycomb.io 51-250 Internet Software & Services

Honeycomb is hiring a Site Reliability Engineering professional to help scale backend systems, improve reliability, and support distributed engineering operations for a fast-growing observability platform.

AWS CI/CD Go Helm Kafka Kubernetes Terraform
1 hour, 24 minutes ago

Senior Production Engineer

Veeam Software 1K-5K Internet Software & Services

Veeam is hiring a Senior Production Engineer to design and operate reliable, scalable production systems for its Data Cloud platform and to lead improvements in incident response, automation, observability, and operational excellence.

Azure C# CI/CD Elasticsearch Go Grafana Java JavaScript OpenTelemetry Prometheus TypeScript
1 hour, 24 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers