Mistral AI

Mistral AI is a French AI company that builds frontier AI models, assistants, agents, and services for consumers and enterprises. Its mission is to make frontier AI accessible to everyone and to democratize AI through open-source, efficient, and innovative models, products, and solutions.

Artificial Intelligence
201-500
Founded 2023

Description

  • Design, build, and maintain scalable, highly available, fault-tolerant infrastructure for web services and ML workloads.
  • Keep platform, inference, and model training environments highly available across multiple HPC clusters.
  • Operate production systems, troubleshoot incidents, and handle on-call responses, user administration, data extraction, and infrastructure scaling.
  • Implement and improve monitoring, alerting, and incident response systems to reduce downtime.
  • Maintain CI/CD, containerization, orchestration, logging, and alerting workflows and tools for APIs and large training runs.
  • Participate in on-call rotations and perform root cause analysis to prevent recurring incidents.
  • Improve infrastructure automation, deployment, and orchestration using Kubernetes, Flux, and Terraform.
  • Collaborate with AI/ML researchers to enable safe and reproducible model-training experiments.
  • Build a cloud-agnostic platform that abstracts science from infrastructure.
  • Document processes and procedures and contribute to open source, publications, blog articles, and conferences.

Requirements

  • Master’s degree in Computer Science, Engineering, or a related field.
  • 7+ years of experience in a DevOps or Site Reliability Engineering role.
  • Strong experience with cloud computing and highly available distributed systems.
  • Experience with root cause analysis, in-production troubleshooting, and on-call rotations in critical environments.
  • Experience working against reliability KPIs such as observability, alerting, and SLAs.
  • Hands-on experience with CI/CD, containerization, and orchestration tools such as Docker and Kubernetes.
  • Knowledge of monitoring, logging, alerting, and observability tools such as Prometheus, Grafana, ELK Stack, or Datadog.
  • Familiarity with infrastructure-as-code tools such as Terraform or CloudFormation.
  • Proficiency in scripting languages such as Python, Go, or Bash, plus knowledge of software development best practices.
  • Strong understanding of networking, security, and system administration concepts.
  • Excellent problem-solving and communication skills.
  • Self-motivated and able to work effectively in a fast-paced startup environment.
  • Experience in an AI/ML environment is preferred.
  • Experience with high-performance computing systems and workload managers such as Slurm is preferred.
  • Experience with modern AI-oriented infrastructure solutions such as Fluidstack, Coreweave, or Vast is preferred.

Benefits

  • Competitive salary and equity.
  • Health insurance.
  • Transportation allowance.
  • Sport allowance.
  • Meal vouchers.
  • Private pension plan.
  • Generous parental leave policy.
  • Visa sponsorship.
  • Remote-friendly arrangement with covered travel and accommodation for Paris onboarding, plus at least 3 days per month in the Paris office for eligible remote hires.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Top Secret Clearance)

SpaceX 10K-50K Aerospace & Defense

SpaceX is hiring a Site Reliability Engineer to support Classified IT Systems Engineering by building and operating scalable infrastructure for high-volume data products and GPU-accelerated machine learning workloads.

Bash Kubernetes Linux Python
7 hours, 47 minutes ago

Junior Site Reliability Engineer

Fable 11-50 Professional Services

Fable is hiring a Junior Site Reliability Engineer to support the reliability, performance, and scalability of the infrastructure behind its accessible digital products.

AWS Azure Bash CI/CD CloudFormation Datadog GCP Git GitHub Actions Grafana JavaScript Linux Prometheus Python Terraform Unix
7 hours, 47 minutes ago

Senior SRE - Platform (Managed Kubernetes Infrastructure)

Elastic 1K-5K Internet Software & Services

Elastic is hiring a Site Reliability Engineer on its Platform Engineering team to design and operate the multi-cloud platform that hosts Elastic Cloud services and supports rapid, reliable product delivery.

Docker Go InfluxDB Kubernetes Linux Prometheus Terraform
1 day, 7 hours ago

Site Reliability Engineer

Dropbox 1K-5K Internet Software & Services

Dropbox is hiring a Corporate Site Reliability Engineer to lead infrastructure reliability, observability, automation, and security for its IT Services environment.

Ansible AWS Bash Chef Datadog DHCP DNS Docker EC2 GitHub GitHub Actions GitOps Kubernetes Linux Python REST API Serverless Terraform Ubuntu WAF
1 day, 7 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers