Description
About the Role
At Mistral AI, we are seeking an experienced Applied AI Engineer, Site Reliability Engineer to join our team. As a key member of our Applied AI team, you will be responsible for building and operating the framework to ensure Mistral's solution delivery is reliable and sustainable.
Responsibilities
- Design for a fleet of Mistral platforms and apps, building proactivity to reduce reactivity.
- Productize reliability, author runbooks, create SLO templates, implement observability.
- Operate the Tier-1 customer environments that Mistral are contracted to operate, ensuring SLO compliance, owning on-call and incident response, managing drift, partnering with Technical Support as L3 escalation, championing high signal post-mortems.
- Productize how Mistral deploy, secure, and scale our Applied AI solutions, engineering on-demand provisioning, authoring security baseline packages, embedding security guardrails, automating everything.
- Own the security operations layer for our customer-side deployments, leading CVE response across the fleet, shipping supply-chain integrity controls (SBOM, signed images, provenance), co-paging with InfoSec on security incidents, enforcing secure-config baselines.
How We Work in Applied AI
- We care about people and outputs.
- What matters is what you ship, not the time you spend on it.
- Bureaucracy is where urgency goes to vanish. You talk to whoever you need to talk to. The best idea wins, whether it comes from a principal engineer or someone in their first week.
- Always ask why. The best solutions come from deep understanding, not from copying what worked before.
- We say what we mean. Feedback is direct, timely, and given because we care.
- No politics. Low ego, high standards.
- We embrace an unstructured environment and find joy in it.
About You
- Fluent in English.
- 5+ years in SRE, Production Engineering, or DevOps, with a record of shipping tooling.
- Strong multi-tenant Kubernetes fluency, namespace segmentation, network policy, RBAC, admission control, operations at scale.
- On-call discipline: incident response, blameless post-mortem culture, runbook-first mindset.
- Observability stack in production: Prometheus, Grafana, OpenTelemetry, Loki, Tempo, Signoz.
- Infrastructure as code: Terraform, Ansible (or close equivalents).
- Proficient in Python and/or Golang for tooling and automation.
- Security mindset: you treat secure-SDLC, CVE response, and supply-chain integrity as reliability properties of the shipped artifact, not as someone else's job.
- Strong written communication skills: runbooks, post-mortems, and customer-facing incident comms are core deliverables of this role.
- Comfortable operating with high autonomy in an ambiguous, fast-paced environment , and disciplined enough to defend the team's scope when work tries to spill in.
- Solid Linux internals, networking debug, and distributed-systems fundamentals.
Strong Plus
- Cloud or application security background (AppSec, K8s security, supply chain , SBOM, cosign, SLSA).
- Experience operating LLM / model-serving stacks in production.
- Experience with multi-cloud or on-prem hybrid customer environments (AWS, GCP, Azure, sovereign clouds).
- Open-source contributions, particularly in SRE, observability, or security tooling.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://jobs.lever.co/mistral/a93b2891-9aaa-4c18-855e-37ef159d4eed