Mistral AI

Mistral AI is a French AI company that builds frontier AI models, assistants, agents, and services for consumers and enterprises. Its mission is to make frontier AI accessible to everyone and to democratize AI through open-source, efficient, and innovative models, products, and solutions.

Artificial Intelligence
201-500
Founded 2023

Description

  • Design, build, and maintain scalable, highly available, fault-tolerant infrastructure for web services and ML workloads.
  • Keep platform, inference, and model training environments highly available across multiple HPC clusters.
  • Operate production systems, troubleshoot incidents, and handle on-call responses, user administration, data extraction, and infrastructure scaling.
  • Implement and improve monitoring, alerting, and incident response systems to reduce downtime.
  • Maintain CI/CD, containerization, orchestration, logging, and alerting workflows and tools for APIs and large training runs.
  • Participate in on-call rotations and perform root cause analysis to prevent recurring incidents.
  • Improve infrastructure automation, deployment, and orchestration using Kubernetes, Flux, and Terraform.
  • Collaborate with AI/ML researchers to enable safe and reproducible model-training experiments.
  • Build a cloud-agnostic platform that abstracts science from infrastructure.
  • Document processes and procedures and contribute to open source, publications, blog articles, and conferences.

Requirements

  • Master’s degree in Computer Science, Engineering, or a related field.
  • 7+ years of experience in a DevOps or Site Reliability Engineering role.
  • Strong experience with cloud computing and highly available distributed systems.
  • Experience with root cause analysis, in-production troubleshooting, and on-call rotations in critical environments.
  • Experience working against reliability KPIs such as observability, alerting, and SLAs.
  • Hands-on experience with CI/CD, containerization, and orchestration tools such as Docker and Kubernetes.
  • Knowledge of monitoring, logging, alerting, and observability tools such as Prometheus, Grafana, ELK Stack, or Datadog.
  • Familiarity with infrastructure-as-code tools such as Terraform or CloudFormation.
  • Proficiency in scripting languages such as Python, Go, or Bash, plus knowledge of software development best practices.
  • Strong understanding of networking, security, and system administration concepts.
  • Excellent problem-solving and communication skills.
  • Self-motivated and able to work effectively in a fast-paced startup environment.
  • Experience in an AI/ML environment is preferred.
  • Experience with high-performance computing systems and workload managers such as Slurm is preferred.
  • Experience with modern AI-oriented infrastructure solutions such as Fluidstack, Coreweave, or Vast is preferred.

Benefits

  • Competitive salary and equity.
  • Health insurance.
  • Transportation allowance.
  • Sport allowance.
  • Meal vouchers.
  • Private pension plan.
  • Generous parental leave policy.
  • Visa sponsorship.
  • Remote-friendly arrangement with covered travel and accommodation for Paris onboarding, plus at least 3 days per month in the Paris office for eligible remote hires.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer (SRE)

Fable 11-50 Professional Services

Fable Global is seeking a Senior Site Reliability Engineer to help ensure the reliability, scalability, and cost-efficient operation of the infrastructure behind its accessible digital products and AI-enabled capabilities.

AWS Azure CI/CD CloudFormation Datadog GCP Go Grafana Java Node.js Prometheus Python Terraform
1 hour, 25 minutes ago

Senior Database Reliability Engineer

Rithum Internet Software & Services

Rithum is hiring a Senior Database Reliability Engineer to manage and improve the availability, reliability, observability, and security of its large-scale hybrid database environment.

AWS CI/CD DynamoDB Elasticsearch MongoDB MySQL PostgreSQL PowerShell Python Redis SQL Server
1 hour, 25 minutes ago

[Job-28831] Senior DevOps / SRE, Brazil

CI&T 5K-10K Internet Software & Services

CI&T is hiring a Senior DevOps/SRE for its Flow AI platform team in Brazil to build and evolve an Internal Developer Platform that enables teams to consume infrastructure and services through secure, standardized self-service.

CI/CD GitHub Actions GitOps Helm Kubernetes Python Solid.js Terraform
1 hour, 40 minutes ago

Staff Site Reliability Engineer Storage

Qonto 1K-5K Banks

Qonto is hiring a Staff Site Reliability Engineer for its storage platform to ensure the reliability and safe operation of critical PostgreSQL, Kafka, and Redis systems as the company scales toward banking-grade resilience.

AWS Kafka Kubernetes PostgreSQL Redis Terraform
1 hour, 55 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers