Mistral AI

Mistral AI is a French AI company that builds frontier AI models, assistants, agents, and services for consumers and enterprises. Its mission is to make frontier AI accessible to everyone and to democratize AI through open-source, efficient, and innovative models, products, and solutions.

Artificial Intelligence
201-500
Founded 2023

Description

  • Design, build, and maintain scalable, highly available, and fault-tolerant infrastructure.
  • Operate production systems and troubleshoot incidents, interruptions, user issues, and infrastructure scaling needs.
  • Implement and improve monitoring, alerting, and incident response systems to reduce downtime.
  • Build and maintain CI/CD, containerization, orchestration, logging, and observability workflows for APIs and training runs.
  • Participate in on-call rotations and perform root cause analysis for incidents.
  • Drive automation, deployment, and orchestration improvements across the infrastructure stack.
  • Collaborate with software engineers on safe, reproducible model-training experiments and platform abstractions.
  • Develop new tooling, automation scripts, APIs, dashboards, and web apps to improve reliability and performance.
  • Work with the security team to ensure infrastructure meets security and compliance requirements.
  • Document processes and contribute to open-source projects, publications, blogs, and conferences.

Requirements

  • Master’s degree in Computer Science, Engineering, or a related field.
  • 5+ years of experience in a DevOps or Site Reliability Engineering role.
  • Strong experience with bare metal infrastructure and highly available distributed systems.
  • Experience handling reliability issues in critical environments, including root cause analysis and in-production troubleshooting.
  • Experience working against reliability KPIs such as observability, alerting, and SLAs.
  • Hands-on experience with CI/CD, containerization, and orchestration tools such as Docker and Kubernetes.
  • Knowledge of monitoring, logging, alerting, and observability tools such as Prometheus, Grafana, ELK Stack, or Datadog.
  • Familiarity with infrastructure-as-code tools such as Terraform or CloudFormation.
  • Proficiency in scripting languages such as Python, Go, or Bash, with knowledge of software development best practices.
  • Strong understanding of networking, security, and system administration concepts.
  • Experience in an AI/ML environment is preferred.
  • Experience with high-performance computing systems and workload managers such as Slurm is preferred.
  • Experience with modern AI-oriented infrastructure solutions such as Fluidstack, Coreweave, or Vast is preferred.

Benefits

  • Competitive salary and equity.
  • Health insurance.
  • Transportation allowance.
  • Sport allowance.
  • Meal vouchers.
  • Private pension plan.
  • Generous parental leave policy.
  • Visa sponsorship.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Top Secret Clearance)

SpaceX 10K-50K Aerospace & Defense

SpaceX is hiring a Site Reliability Engineer to support Classified IT Systems Engineering by building and operating scalable infrastructure for high-volume data products and GPU-accelerated machine learning workloads.

Bash Kubernetes Linux Python
7 hours, 49 minutes ago

Junior Site Reliability Engineer

Fable 11-50 Professional Services

Fable is hiring a Junior Site Reliability Engineer to support the reliability, performance, and scalability of the infrastructure behind its accessible digital products.

AWS Azure Bash CI/CD CloudFormation Datadog GCP Git GitHub Actions Grafana JavaScript Linux Prometheus Python Terraform Unix
7 hours, 49 minutes ago

Senior SRE - Platform (Managed Kubernetes Infrastructure)

Elastic 1K-5K Internet Software & Services

Elastic is hiring a Site Reliability Engineer on its Platform Engineering team to design and operate the multi-cloud platform that hosts Elastic Cloud services and supports rapid, reliable product delivery.

Docker Go InfluxDB Kubernetes Linux Prometheus Terraform
1 day, 7 hours ago

Site Reliability Engineer

Dropbox 1K-5K Internet Software & Services

Dropbox is hiring a Corporate Site Reliability Engineer to lead infrastructure reliability, observability, automation, and security for its IT Services environment.

Ansible AWS Bash Chef Datadog DHCP DNS Docker EC2 GitHub GitHub Actions GitOps Kubernetes Linux Python REST API Serverless Terraform Ubuntu WAF
1 day, 7 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers