# Member of Technical Staff - Multimodal Understanding

**Company**: xAI
**Location**: Palo Alto, CA
**Work arrangement**: onsite
**Experience**: staff
**Job type**: full-time
**Salary**: $180,000 - $440,000 USD
**Category**: Engineering
**Industry**: Technology
**Wikidata**: https://www.wikidata.org/wiki/Q120599684

**Apply**: https://job-boards.greenhouse.io/xai/jobs/5111374007
**Canonical**: https://yubhub.co/jobs/job_540ce49c-271

## Description

## About the Role

You will join the multimodal team to push toward superhuman multimodal intelligence. Advance understanding and generation across modalities,image, video, audio, and text,spanning the full stack: data curation/acquisition, tokenizer training, large-scale pre-training, post-training/alignment, infrastructure/scaling, evaluation, tooling/demos, and end-to-end product experiences.

Collaborate cross-functionally with pre-training, post-training, reasoning, data, applied, and product teams to deliver frontier capabilities in multimodal reasoning, world modeling, tool use, agentic behaviors, and interactive human-AI collaboration. Contribute to building models that can see, hear, reason about, and interact with the world in real time at unprecedented levels.

## Responsibilities

- Design, build, and optimize large-scale distributed systems for multimodal pre-training, post-training, inference, data processing, and tokenization at web/petabyte scale.

- Develop high-throughput pipelines for data acquisition, preprocessing, filtering, generation, decoding, loading, crawling, visualization, and management (images, videos, audio + text).

- Advance multimodal capabilities including spatial-temporal compression, cross-modal alignment, world modeling, reasoning, emergent abilities, audio/image/video understanding & generation, real-time video processing, and noisy data handling.

- Drive data quality and studies: curation (human/synthetic), filtering techniques, analysis, and scalable pipelines to support trillion-parameter models.

- Create evaluation frameworks, internal benchmarks, reward models, and metrics that capture real-world usage, failure modes, interactive dynamics, and human-AI synergy.

- Innovate on algorithms, modeling approaches, hardware/software/algorithm co-design, and scaling paradigms for state-of-the-art performance.

- Build research tooling, user-friendly interfaces, prototypes/demos, full-stack applications, and enable rapid iteration based on feedback.

- Work across the stack (pre-training → SFT/RL/post-training) to enable reasoning, tool calling, agentic behaviors, orchestration, and seamless real-time interactions.

## Basic Qualifications

- Hands-on experience with multimodal pre-training, post-training, or fine-tuning (vision, audio, video, or cross-modal).

- Expert-level proficiency in Python (core language), with strong experience in at least one of: JAX / PyTorch / XLA.

- Proven track record building or optimizing large-scale distributed ML systems (training/inference optimization, GPU utilization, multi-GPU/TPU setups, hardware co-design).

- Deep experience designing and running data pipelines at scale: curation, filtering, generation, quality studies, especially for noisy/real-world multimodal data.

- Strong fundamentals in evaluation design, benchmarks, reward modeling, or RL techniques (particularly for interactive/agentic behaviors).

- Proactive self-starter who thrives in high-intensity environments and is passionate about pushing multimodal AI frontiers.

- Willingness to own end-to-end initiatives and do whatever it takes to deliver breakthrough user experiences.

## Preferred Skills and Experience

- Experience leading major improvements in model capabilities through better data, modeling, algorithms, or scaling.

- Familiarity with state-of-the-art in multimodal LLMs, scaling laws, tokenizers, compression techniques, reasoning, or agentic systems.

- Proficiency in Rust and/or C++ for performance-critical components.

- Hands-on work with large-scale orchestration tools such as Spark, Ray, or Kubernetes.

- Background building full-stack tooling: performant interfaces, real-time research demos/apps, or end-to-end product ownership.

- Passion for end-to-end user experience in interactive, real-time multimodal AI systems.

## Skills

### Required
- Multimodal pre-training
- Post-training
- Fine-tuning
- Python
- JAX
- PyTorch
- XLA
- Large-scale distributed ML systems
- Data pipelines
- Evaluation design
- Benchmarks
- Reward modeling
- RL techniques

### Nice to have
- State-of-the-art in multimodal LLMs
- Scaling laws
- Tokenizers
- Compression techniques
- Reasoning
- Agentic systems
- Rust
- C++
- Spark
- Ray
- Kubernetes
- Full-stack tooling