Mistral AI

Mistral AI is a French AI company that builds frontier AI models, assistants, agents, and services for consumers and enterprises. Its mission is to make frontier AI accessible to everyone and to democratize AI through open-source, efficient, and innovative models, products, and solutions.

Artificial Intelligence
201-500
Founded 2023

Description

  • Design, build, and maintain scalable, highly available, fault-tolerant infrastructure for web services and ML workloads.
  • Keep platform, inference, and model training environments highly available across multiple HPC clusters.
  • Operate production systems, troubleshoot incidents, and handle on-call responses, user administration, data extraction, and infrastructure scaling.
  • Implement and improve monitoring, alerting, and incident response systems to reduce downtime.
  • Maintain CI/CD, containerization, orchestration, logging, and alerting workflows and tools for APIs and large training runs.
  • Participate in on-call rotations and perform root cause analysis to prevent recurring incidents.
  • Improve infrastructure automation, deployment, and orchestration using Kubernetes, Flux, and Terraform.
  • Collaborate with AI/ML researchers to enable safe and reproducible model-training experiments.
  • Build a cloud-agnostic platform that abstracts science from infrastructure.
  • Document processes and procedures and contribute to open source, publications, blog articles, and conferences.

Requirements

  • Master’s degree in Computer Science, Engineering, or a related field.
  • 7+ years of experience in a DevOps or Site Reliability Engineering role.
  • Strong experience with cloud computing and highly available distributed systems.
  • Experience with root cause analysis, in-production troubleshooting, and on-call rotations in critical environments.
  • Experience working against reliability KPIs such as observability, alerting, and SLAs.
  • Hands-on experience with CI/CD, containerization, and orchestration tools such as Docker and Kubernetes.
  • Knowledge of monitoring, logging, alerting, and observability tools such as Prometheus, Grafana, ELK Stack, or Datadog.
  • Familiarity with infrastructure-as-code tools such as Terraform or CloudFormation.
  • Proficiency in scripting languages such as Python, Go, or Bash, plus knowledge of software development best practices.
  • Strong understanding of networking, security, and system administration concepts.
  • Excellent problem-solving and communication skills.
  • Self-motivated and able to work effectively in a fast-paced startup environment.
  • Experience in an AI/ML environment is preferred.
  • Experience with high-performance computing systems and workload managers such as Slurm is preferred.
  • Experience with modern AI-oriented infrastructure solutions such as Fluidstack, Coreweave, or Vast is preferred.

Benefits

  • Competitive salary and equity.
  • Health insurance.
  • Transportation allowance.
  • Sport allowance.
  • Meal vouchers.
  • Private pension plan.
  • Generous parental leave policy.
  • Visa sponsorship.
  • Remote-friendly arrangement with covered travel and accommodation for Paris onboarding, plus at least 3 days per month in the Paris office for eligible remote hires.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Software Engineer II - Inline Mailflow

Abnormal AI Internet Software & Services

Abnormal AI is hiring a Software Engineer for the Inline Mailflow team to build next-generation SMTP relay infrastructure for outbound email security and long-term secure email gateway displacement.

Apache Spark AWS Django DNS Docker Go Kubernetes Prometheus Python
13 hours, 50 minutes ago

Site Reliability Engineer

Capital Markets Gateway 51-250 Capital Markets

Capital Markets Gateway LLC is hiring a remote Site Reliability Engineer in Canada to strengthen reliability, observability, and incident response for its ECM fintech platform supporting global capital markets workflows.

Azure Bash Datadog Docker Elasticsearch GitHub Grafana GraphQL JIRA Kubernetes Linux Microservices .NET OpenTelemetry PostgreSQL Prometheus Python React Redis Terraform TypeScript
21 hours, 23 minutes ago

Staff Software Engineer - Reliability

Rubrik 1K-5K IT Services

Rubrik is hiring a Staff Site Reliability Engineer to lead reliability, automation, and cloud infrastructure architecture for its global SaaS and government-compliant environments, while also guiding the Application-SRE team and bridging customer issues back into engineering priorities.

AWS GCP Go Grafana Java Kubernetes MySQL OpenTelemetry Prometheus Pulumi Python Terraform
21 hours, 53 minutes ago

Sr. Database Reliability Engineer

SpaceX 10K-50K Aerospace & Defense

SpaceX is seeking a Senior Database Reliability Engineer to own and improve the reliability, performance, and operational support of the company’s Oracle and PostgreSQL database environment within its IT Engineering organization.

Bash Git Linux Machine Learning MySQL Oracle PostgreSQL Python SQL Windows Server
21 hours, 53 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers