xAI

Member of Technical Staff - Infrastructure Reliability

xAI
onsite staff full-time $180,000 - $400,000 USD Palo Alto, CA
Apply →

First indexed 18 Apr 2026

Description

We are seeking a Member of Technical Staff - Infrastructure Reliability to join our team. As a key member of our infrastructure team, you will own the availability, performance, and evolution of our core compute, storage, and networking infrastructure. This is a joint xAI/X role: you will own 24×7 reliability for the world's largest GPU training superclusters and one of the highest-QPS production systems on the planet.

You will define and execute the technical strategy for infrastructure reliability and scalability, build and maintain the automation, observability, and control planes that keep multi-datacenter, hybrid cloud/on-prem environments healthy, lead incident response, deep-dive root cause analysis, and post-mortems that drive real fixes, identify, instrument, and eliminate systemic failure patterns, design and implement high-leverage systems software in Python and Rust, and push the state of the art in large-scale GPU cluster operations and AI workload reliability.

To succeed in this role, you will need 5+ years shipping production software and/or operating distributed infrastructure at scale, expert-level knowledge of Linux systems, TCP/IP networking, and systems programming, strong coding skills with proven production experience in Rust (strongly preferred) and at least one of Python, Go, or C++, deep experience with large-scale distributed systems in on-prem and cloud environments, hands-on expertise with container orchestration, container runtimes, and infrastructure-as-code, intimate understanding of common failure modes in distributed systems and how to mitigate them, and a track record of participating in (or building) effective on-call rotations in high-stakes environments.

In addition to a competitive base salary, you will receive equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://job-boards.greenhouse.io/xai/jobs/4801451007