# Member of Technical Staff - Compute Infrastructure

**Company**: xAI
**Location**: Palo Alto, CA
**Work arrangement**: onsite
**Experience**: staff
**Job type**: full-time
**Salary**: $180,000 - $440,000 USD
**Category**: Engineering
**Industry**: Technology
**Wikidata**: https://www.wikidata.org/wiki/Q120599684

**Apply**: https://job-boards.greenhouse.io/xai/jobs/5052040007
**Canonical**: https://yubhub.co/jobs/job_24176cb8-311

## Description

We're seeking a highly skilled Member of Technical Staff to join our Compute Infrastructure team. As a key member of this team, you will design, build, and operate massive-scale clusters and orchestration platforms that power frontier AI training, inference, and agent workloads at unprecedented scale.

In this role, you will push the boundaries of container orchestration far beyond existing systems like Kubernetes, manage exascale compute resources, optimize for high-performance training runs and production serving, and collaborate closely with research and systems teams to deliver reliable, ultra-scalable infrastructure that enables xAI's next-generation models and applications.

Responsibilities include building and managing massive-scale clusters, designing, developing, and extending an in-house container orchestration platform, collaborating with research teams to architect and optimize compute clusters, profiling, debugging, and resolving complex system-level performance bottlenecks, and owning end-to-end infrastructure initiatives.

To succeed in this role, you will need deep expertise in virtualization technologies and advanced containerization/sandboxing, strong proficiency in systems programming languages such as C/C++ and Rust, and proven track record profiling, debugging, and optimizing complex system-level performance issues.

Preferred skills and experience include experience in Linux kernel development, hypervisor extensions, or low-level system programming for compute-intensive workloads, operating or designing large-scale AI training/inference clusters, and familiarity with performance tools, tracing, and debugging in production distributed environments.

## Skills

### Required
- Deep expertise in virtualization technologies (KVM, Xen, QEMU) and advanced containerization/sandboxing (Kata, Firecracker, gVisor, Sysbox, or equivalent)
- Strong proficiency in systems programming languages such as C/C++ and Rust
- Proven track record profiling, debugging, and optimizing complex system-level performance issues, with deep knowledge of Linux kernel internals, resource management, scheduling, memory management, and low-level engineering
- Hands-on experience building or significantly enhancing distributed compute platforms, orchestration systems, or high-performance infrastructure at scale

### Nice to have
- Experience in Linux kernel development, hypervisor extensions, or low-level system programming for compute-intensive workloads
- Proven track record operating or designing large-scale AI training/inference clusters (GPU/TPU scale)
- Experience with custom runtimes, isolation techniques, or bespoke platforms for specialized AI compute
- Familiarity with performance tools, tracing, and debugging in production distributed environments
