Description
We're looking for an infrastructure research engineer to design and build the core systems that enable efficient large-scale model training with a focus on numerics. You will focus on improving the numerical foundations of our distributed training stack, from precision formats and kernel optimizations to communication frameworks that make training trillion-parameter models stable, scalable, and fast.
This role is ideal for someone who thrives at the intersection of research and systems engineering: a builder who understands both the math of optimization and the realities of distributed compute.
Responsibilities:
- Design and optimize distributed training infrastructure for large-scale LLMs, focusing on performance, stability, and reproducibility across multi-GPU and multi-node setups.
- Implement and evaluate low-precision numerics (for example, BF16, MXFP8, NVFP4) to improve efficiency without sacrificing model quality.
- Develop kernels and communication primitives that use hardware-level support for mixed and low-precision arithmetic.
- Collaborate with research teams to co-design model architectures and training recipes that align with emerging numeric formats and stability constraints.
- Prototype and benchmark scaling strategies such as data, tensor, and pipeline parallelism that integrate precision-adaptive computation and quantized communication.
- Contribute to the design of our internal orchestration and monitoring systems to ensure that thousands of distributed experiments can run efficiently and reproducibly.
- Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure.
Skills and Qualifications:
Minimum qualifications:
- Bachelor’s degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or similar.
- Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.
- Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.
- A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.
- Strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases in areas such as floating-point numerics, low-precision arithmetic, and distributed systems.
Preferred qualifications , we encourage you to apply if you meet some but not all of these:
- Familiarity with distributed frameworks such as PyTorch/XLA, DeepSpeed, Megatron-LM.
- Experience implementing FP8, INT8, or block-floating point (MX) formats and understanding their numerical trade-offs.
- Prior contributions to open-source deep learning infrastructure such as PyTorch, DeepSpeed, or XLA.
- Publications, patents, or projects related to numerical optimization, communication-efficient training, or systems for large models.
- Experience training and supporting large-scale AI models.
- Track record of improving research productivity through infrastructure design or process improvements.
Logistics:
- Location: This role is based in San Francisco, California.
- Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.
- Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.
- Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.