Description
We're seeking a highly skilled Member of Technical Staff to join our Compute Infrastructure team. As a key member of this team, you will design, build, and operate massive-scale clusters and orchestration platforms that power frontier AI training, inference, and agent workloads at unprecedented scale.
In this role, you will push the boundaries of container orchestration far beyond existing systems like Kubernetes, manage exascale compute resources, optimize for high-performance training runs and production serving, and collaborate closely with research and systems teams to deliver reliable, ultra-scalable infrastructure that enables xAI's next-generation models and applications.
Responsibilities include building and managing massive-scale clusters, designing, developing, and extending an in-house container orchestration platform, collaborating with research teams to architect and optimize compute clusters, profiling, debugging, and resolving complex system-level performance bottlenecks, and owning end-to-end infrastructure initiatives.
To succeed in this role, you will need deep expertise in virtualization technologies and advanced containerization/sandboxing, strong proficiency in systems programming languages such as C/C++ and Rust, and proven track record profiling, debugging, and optimizing complex system-level performance issues.
Preferred skills and experience include experience in Linux kernel development, hypervisor extensions, or low-level system programming for compute-intensive workloads, operating or designing large-scale AI training/inference clusters, and familiarity with performance tools, tracing, and debugging in production distributed environments.