# Evals Engineer, Applied AI

**Company**: Scale AI
**Location**: San Francisco, CA; New York, NY
**Work arrangement**: hybrid
**Experience**: mid
**Job type**: full-time
**Salary**: $216,000-$270,000 USD
**Category**: Engineering
**Industry**: Technology
**Wikidata**: https://www.wikidata.org/wiki/Q112629176

**Apply**: https://job-boards.greenhouse.io/scaleai/jobs/4629589005
**Canonical**: https://yubhub.co/jobs/job_9a42f26c-511

## Description

We are seeking a technically rigorous and driven AI Research Engineer to join our Enterprise Evaluations team. This high-impact role is critical to our mission of delivering the industry's leading GenAI Evaluation Suite.

As a hands-on contributor to the core systems that ensure the safety, reliability, and continuous improvement of LLM-powered workflows and agents for the enterprise, you will partner with Scale's Operations team and enterprise customers to translate ambiguity into structured evaluation data. This involves guiding the creation and maintenance of gold-standard human-rated datasets and expert rubrics that anchor AI evaluation systems.

Your responsibilities will also include analysing feedback and collected data to identify patterns, refine evaluation frameworks, and establish iterative improvement loops that enhance the quality and relevance of human-curated assessments. You will design, research, and develop LLM-as-a-Judge autorater frameworks and AI-assisted evaluation systems, including creating models that critique, grade, and explain agent outputs.

To succeed in this role, you will need a strong foundational knowledge of large language models, a passion for tackling complex evaluation challenges, and the ability to thrive in a dynamic, fast-paced research environment. You should be able to think outside the box, stay current with the latest literature in AI evaluation, and be passionate about integrating novel research ideas into our workflows to build best-in-class evaluation systems.

In addition to your technical expertise, you will need excellent communication and collaboration skills, as you will work closely with cross-functional teams to drive project success.

If you are a motivated and detail-oriented individual with a passion for AI research and evaluation, we encourage you to apply for this exciting opportunity.

## Skills

### Required
- Python
- PyTorch
- TensorFlow
- Large Language Models
- Generative AI
- Machine Learning
- Applied Research
- Evaluation Infrastructure

### Nice to have
- Advanced degree in Computer Science, Machine Learning, or a related quantitative field
- Published research in leading ML or AI conferences
- Experience designing, building, or deploying LLM-as-a-Judge frameworks or other automated evaluation systems
- Experience collaborating with operations or external teams to define high-quality human annotator guidelines
- Expertise in ML research engineering, stochastic systems, observability, or LLM-powered applications for model evaluation and analysis