# Tech Lead Manager- MLRE, ML Systems

**Company**: Scale
**Location**: San Francisco, CA; New York, NY
**Work arrangement**: hybrid
**Experience**: senior
**Job type**: full-time
**Salary**: $264,800-$331,000 USD
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/scaleai/jobs/4618046005
**Canonical**: https://yubhub.co/jobs/job_539e2a23-ddf

## Description

You will lead the development of our internal distributed framework for large language model training. The platform powers MLEs, researchers, data scientists, and operators for fast and automatic training and evaluation of LLMs. It also serves as the underlying training framework for the data quality evaluation pipeline.

You will work closely with Scale’s ML teams and researchers to build the foundation platform which supports all our ML research and development works. You will be building and optimising the platform to enable our next generation LLM training, inference and data curation.

Key responsibilities include:

- Building, profiling and optimising our training and inference framework.

- Collaborating with ML and research teams to accelerate their research and development, and enable them to develop the next generation of models and data curation.

- Researching and integrating state-of-the-art technologies to optimise our ML system.

The ideal candidate will have:

- Passionate about system optimisation.

- Experience with multi-node LLM training and inference.

- Experience with developing large-scale distributed ML systems.

- Experience with post-training methods like RLHF/RLVR and related algorithms like PPO/GRPO etc.

- Strong software engineering skills, proficient in frameworks and tools such as CUDA, PyTorch, transformers, flash attention, etc.

Nice to haves include demonstrated expertise in post-training methods and/or next generation use cases for large language models including instruction tuning, RLHF, tool use, reasoning, agents, and multimodal, etc.

Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position, determined by work location and additional factors, including job-related skills, experience, interview performance, and relevant education or training.

## Skills

### Required
- system optimisation
- multi-node LLM training and inference
- large-scale distributed ML systems
- post-training methods
- software engineering skills
- CUDA
- PyTorch
- transformers
- flash attention

### Nice to have
- next generation use cases for large language models
- instruction tuning
- RLHF
- tool use
- reasoning
- agents
- multimodal