Description
We're looking for an infrastructure engineer to own and evolve the security infrastructure that underpins our foundation models. In this role, you'll work across compute, storage, networking, and data platforms, making sure our systems are secure, reliable, and built to scale.
You'll shape controls, architecture, and tooling so that security is part of how the platform works by default. You'll partner closely with research and product teams, enabling them to move quickly while keeping our models, data, and environments protected.
Key responsibilities include:
Architecting security patterns for platforms and services, including network segmentation, service-to-service authentication, RBAC, and policy enforcement in Kubernetes and cloud environments.
Managing identity, access, and secrets for humans and services: workload and cross-cloud identity, least-privilege IAM, and secrets management.
Building secure platforms for data ingestion, processing, and curation: classification, encryption, access controls, and safe sharing patterns across teams.
Writing threat models and reviewing designs with researchers and engineers to help them ship features and experiments in a safe, scalable way.
Automating security checks and building guardrails: policy-as-code, secure infrastructure baselines, validation in CI/CD, and tools that make the secure path the easiest one.
Requirements include:
Bachelor's degree or equivalent experience in engineering, or similar.
Strong background with containers and orchestration (e.g., Kubernetes) and how to secure them (namespaces, network policies, pod security, admission controls, etc.).
Practical experience with Infrastructure as Code (Terraform or similar), including secure patterns for provisioning networks, IAM, and shared services.
Solid understanding of cloud networking and security: VPCs, load balancers, service discovery, mTLS, firewalls, and zero-trust-style architectures.
Proficiency with a systems language such as Rust and scripting in Python for building platform components and internal tools.
Evidence of owning complex, production-critical systems, including debugging issues that span infra, security, and application layers.
Preferred qualifications include experience with ML infrastructure, GPU clusters, or large-scale training environments, as well as background in AI labs, HPC environments, or ML-heavy organizations.