# Research Engineer, Infrastructure, Numerics

**Company**: Thinking Machines Lab
**Location**: San Francisco
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Salary**: $350,000 - $475,000 USD
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/thinkingmachines/jobs/5013937008
**Canonical**: https://yubhub.co/jobs/job_07a3c83e-51e

## Description

We're looking for an infrastructure research engineer to design and build the core systems that enable efficient large-scale model training with a focus on numerics. You will focus on improving the numerical foundations of our distributed training stack, from precision formats and kernel optimizations to communication frameworks that make training trillion-parameter models stable, scalable, and fast.

This role is ideal for someone who thrives at the intersection of research and systems engineering: a builder who understands both the math of optimization and the realities of distributed compute.

Responsibilities:

- Design and optimize distributed training infrastructure for large-scale LLMs, focusing on performance, stability, and reproducibility across multi-GPU and multi-node setups.

- Implement and evaluate low-precision numerics (for example, BF16, MXFP8, NVFP4) to improve efficiency without sacrificing model quality.

- Develop kernels and communication primitives that use hardware-level support for mixed and low-precision arithmetic.

- Collaborate with research teams to co-design model architectures and training recipes that align with emerging numeric formats and stability constraints.

- Prototype and benchmark scaling strategies such as data, tensor, and pipeline parallelism that integrate precision-adaptive computation and quantized communication.

- Contribute to the design of our internal orchestration and monitoring systems to ensure that thousands of distributed experiments can run efficiently and reproducibly.

- Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure.

Skills and Qualifications:

Minimum qualifications:

- Bachelor’s degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or similar.

- Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.

- Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.

- A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.

- Strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases in areas such as floating-point numerics, low-precision arithmetic, and distributed systems.

Preferred qualifications , we encourage you to apply if you meet some but not all of these:

- Familiarity with distributed frameworks such as PyTorch/XLA, DeepSpeed, Megatron-LM.

- Experience implementing FP8, INT8, or block-floating point (MX) formats and understanding their numerical trade-offs.

- Prior contributions to open-source deep learning infrastructure such as PyTorch, DeepSpeed, or XLA.

- Publications, patents, or projects related to numerical optimization, communication-efficient training, or systems for large models.

- Experience training and supporting large-scale AI models.

- Track record of improving research productivity through infrastructure design or process improvements.

Logistics:

- Location: This role is based in San Francisco, California.

- Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.

- Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.

- Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.

## Skills

### Required
- Bachelor’s degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or similar
- Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures
- Thriving in a highly collaborative environment involving many, different cross-functional partners and subject matter experts
- Strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases in areas such as floating-point numerics, low-precision arithmetic, and distributed systems
- Familiarity with distributed frameworks such as PyTorch/XLA, DeepSpeed, Megatron-LM

### Nice to have
- Experience implementing FP8, INT8, or block-floating point (MX) formats and understanding their numerical trade-offs
- Prior contributions to open-source deep learning infrastructure such as PyTorch, DeepSpeed, or XLA
- Publications, patents, or projects related to numerical optimization, communication-efficient training, or systems for large models
- Experience training and supporting large-scale AI models
- Track record of improving research productivity through infrastructure design or process improvements
