Description
We're seeking an infrastructure research engineer to design and build scalable, efficient training systems for large models. As a key member of our team, you'll take ownership of the training stack end-to-end, ensuring every GPU cycle drives scientific progress. Your goal is to make experimentation and training at Thinking Machines fast and reliable, allowing our research teams to focus on science, not system bottlenecks.
Key responsibilities include designing, implementing, and optimizing distributed training systems, developing high-performance optimizations, and establishing standards for reliability, maintainability, and security. You'll collaborate with researchers and engineers to build scalable infrastructure and publish learnings through internal documentation, open-source libraries, or technical reports.
We're looking for someone who blends deep systems and performance expertise with a curiosity for machine learning at scale. A strong understanding of deep learning frameworks, such as PyTorch, and experience working on distributed training for large models are preferred. If you have a track record of improving research productivity through infrastructure design or process improvements, that's a plus.
This role is based in San Francisco, California, and offers a competitive salary range of $350,000 - $475,000 USD per year, depending on background, skills, and experience. We sponsor visas and offer generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.