Description

About the Role

At Mistral AI, we are seeking an experienced Applied AI Engineer, Site Reliability Engineer to join our team. As a key member of our Applied AI team, you will be responsible for building and operating the framework to ensure Mistral's solution delivery is reliable and sustainable.

Responsibilities

Design for a fleet of Mistral platforms and apps, building proactivity to reduce reactivity.
Productize reliability, author runbooks, create SLO templates, implement observability.
Operate the Tier-1 customer environments that Mistral are contracted to operate, ensuring SLO compliance, owning on-call and incident response, managing drift, partnering with Technical Support as L3 escalation, championing high signal post-mortems.
Productize how Mistral deploy, secure, and scale our Applied AI solutions, engineering on-demand provisioning, authoring security baseline packages, embedding security guardrails, automating everything.
Own the security operations layer for our customer-side deployments, leading CVE response across the fleet, shipping supply-chain integrity controls (SBOM, signed images, provenance), co-paging with InfoSec on security incidents, enforcing secure-config baselines.

How We Work in Applied AI

We care about people and outputs.
What matters is what you ship, not the time you spend on it.
Bureaucracy is where urgency goes to vanish. You talk to whoever you need to talk to. The best idea wins, whether it comes from a principal engineer or someone in their first week.
Always ask why. The best solutions come from deep understanding, not from copying what worked before.
We say what we mean. Feedback is direct, timely, and given because we care.
No politics. Low ego, high standards.
We embrace an unstructured environment and find joy in it.

About You

Fluent in English.
5+ years in SRE, Production Engineering, or DevOps, with a record of shipping tooling.
Strong multi-tenant Kubernetes fluency, namespace segmentation, network policy, RBAC, admission control, operations at scale.
On-call discipline: incident response, blameless post-mortem culture, runbook-first mindset.
Observability stack in production: Prometheus, Grafana, OpenTelemetry, Loki, Tempo, Signoz.
Infrastructure as code: Terraform, Ansible (or close equivalents).
Proficient in Python and/or Golang for tooling and automation.
Security mindset: you treat secure-SDLC, CVE response, and supply-chain integrity as reliability properties of the shipped artifact, not as someone else's job.
Strong written communication skills: runbooks, post-mortems, and customer-facing incident comms are core deliverables of this role.
Comfortable operating with high autonomy in an ambiguous, fast-paced environment , and disciplined enough to defend the team's scope when work tries to spill in.
Solid Linux internals, networking debug, and distributed-systems fundamentals.

Strong Plus

Cloud or application security background (AppSec, K8s security, supply chain , SBOM, cosign, SLSA).
Experience operating LLM / model-serving stacks in production.
Experience with multi-cloud or on-prem hybrid customer environments (AWS, GCP, Azure, sovereign clouds).
Open-source contributions, particularly in SRE, observability, or security tooling.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://jobs.lever.co/mistral/a93b2891-9aaa-4c18-855e-37ef159d4eed