Mistral AI

Mistral AI is a French AI company that builds frontier AI models, assistants, agents, and services for consumers and enterprises. Its mission is to make frontier AI accessible to everyone and to democratize AI through open-source, efficient, and innovative models, products, and solutions.

Artificial Intelligence
201-500
Founded 2023

Description

  • Design, build, and maintain scalable, highly available, and fault-tolerant infrastructure.
  • Operate production systems and troubleshoot incidents, interruptions, user issues, and infrastructure scaling needs.
  • Implement and improve monitoring, alerting, and incident response systems to reduce downtime.
  • Build and maintain CI/CD, containerization, orchestration, logging, and observability workflows for APIs and training runs.
  • Participate in on-call rotations and perform root cause analysis for incidents.
  • Drive automation, deployment, and orchestration improvements across the infrastructure stack.
  • Collaborate with software engineers on safe, reproducible model-training experiments and platform abstractions.
  • Develop new tooling, automation scripts, APIs, dashboards, and web apps to improve reliability and performance.
  • Work with the security team to ensure infrastructure meets security and compliance requirements.
  • Document processes and contribute to open-source projects, publications, blogs, and conferences.

Requirements

  • Master’s degree in Computer Science, Engineering, or a related field.
  • 5+ years of experience in a DevOps or Site Reliability Engineering role.
  • Strong experience with bare metal infrastructure and highly available distributed systems.
  • Experience handling reliability issues in critical environments, including root cause analysis and in-production troubleshooting.
  • Experience working against reliability KPIs such as observability, alerting, and SLAs.
  • Hands-on experience with CI/CD, containerization, and orchestration tools such as Docker and Kubernetes.
  • Knowledge of monitoring, logging, alerting, and observability tools such as Prometheus, Grafana, ELK Stack, or Datadog.
  • Familiarity with infrastructure-as-code tools such as Terraform or CloudFormation.
  • Proficiency in scripting languages such as Python, Go, or Bash, with knowledge of software development best practices.
  • Strong understanding of networking, security, and system administration concepts.
  • Experience in an AI/ML environment is preferred.
  • Experience with high-performance computing systems and workload managers such as Slurm is preferred.
  • Experience with modern AI-oriented infrastructure solutions such as Fluidstack, Coreweave, or Vast is preferred.

Benefits

  • Competitive salary and equity.
  • Health insurance.
  • Transportation allowance.
  • Sport allowance.
  • Meal vouchers.
  • Private pension plan.
  • Generous parental leave policy.
  • Visa sponsorship.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer (SRE)

Fable 11-50 Professional Services

Fable Global is seeking a Senior Site Reliability Engineer to help ensure the reliability, scalability, and cost-efficient operation of the infrastructure behind its accessible digital products and AI-enabled capabilities.

AWS Azure CI/CD CloudFormation Datadog GCP Go Grafana Java Node.js Prometheus Python Terraform
1 hour, 25 minutes ago

Senior Database Reliability Engineer

Rithum Internet Software & Services

Rithum is hiring a Senior Database Reliability Engineer to manage and improve the availability, reliability, observability, and security of its large-scale hybrid database environment.

AWS CI/CD DynamoDB Elasticsearch MongoDB MySQL PostgreSQL PowerShell Python Redis SQL Server
1 hour, 25 minutes ago

[Job-28831] Senior DevOps / SRE, Brazil

CI&T 5K-10K Internet Software & Services

CI&T is hiring a Senior DevOps/SRE for its Flow AI platform team in Brazil to build and evolve an Internal Developer Platform that enables teams to consume infrastructure and services through secure, standardized self-service.

CI/CD GitHub Actions GitOps Helm Kubernetes Python Solid.js Terraform
1 hour, 40 minutes ago

Staff Site Reliability Engineer Storage

Qonto 1K-5K Banks

Qonto is hiring a Staff Site Reliability Engineer for its storage platform to ensure the reliability and safe operation of critical PostgreSQL, Kafka, and Redis systems as the company scales toward banking-grade resilience.

AWS Kafka Kubernetes PostgreSQL Redis Terraform
1 hour, 55 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers