Thinking Machines Lab

Research Engineer, Infrastructure, RL Systems

Thinking Machines Lab
onsite senior full-time $350,000 - $475,000 USD San Francisco
Apply →

First indexed 18 Apr 2026

Description

We're looking for an infrastructure research engineer to design and build the core systems that enable scalable, efficient training of large models through reinforcement learning.

This role sits at the intersection of research and large-scale systems engineering: a builder who understands both the algorithms behind RL and the realities of distributed training and inference at scale. You'll wear many hats, from optimising rollout and reward pipelines to enhancing reliability, observability, and orchestration, collaborating closely with researchers and infra teams to make reinforcement learning stable, fast, and production-ready.

Responsibilities:

  • Design, build, and optimise the infrastructure that powers large-scale reinforcement learning and post-training workloads.
  • Improve the reliability and scalability of RL training pipeline, distributed RL workloads, and training throughput.
  • Develop shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility for RL systems.
  • Collaborate with researchers to translate algorithmic ideas into production-grade training pipelines.
  • Build evaluation and benchmarking infrastructure that measures model progress on helpfulness, safety, and factuality.
  • Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure.

We're looking for someone with strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases. You should have a good understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.

Experience training or supporting large-scale language models with tens of billions of parameters or more is a plus. Familiarity with monitoring and observability tools (Prometheus, Grafana, OpenTelemetry) is also a plus.

Logistics:

  • Location: This role is based in San Francisco, California.
  • Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.
  • Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.
  • Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://job-boards.greenhouse.io/thinkingmachines/jobs/5013930008