Description

Join us in building the future of finance.

Our mission is to democratize finance for all. An estimated $124 trillion of assets will be inherited by younger generations in the next two decades. The largest transfer of wealth in human history.

We are building an elite team, applying frontier technologies to the world's biggest financial problems. We're looking for bold thinkers. Sharp problem-solvers. Builders who are wired to make an impact.

As a Staff Machine Learning Engineer (IC6), you will define and uphold the quality bar for agentic systems across the organization. You will design evaluation frameworks, guide model selection, and partner with product, data science, and engineering teams to ensure systems meet clear standards for correctness, safety, latency, and user satisfaction.

This role is based in our Bellevue, WA or Menlo Park, CA office, with in-person attendance expected at least 3 days per week.

At Robinhood, we believe in the power of in-person work to accelerate progress, spark innovation, and strengthen community. Our office experience is intentional, energizing, and designed to fully support high-performing teams.

Responsibilities:

Define and implement evaluation frameworks that measure agent performance, including task success, correctness, tool usage reliability, latency, safety, and user satisfaction

Evaluate frontier and fine-tuned models across quality, latency, cost, and edge cases to determine appropriate use cases

Partner with product managers, data scientists, and engineers to translate evaluation results into clear launch criteria for agentic systems

Analyze production issues, identify root causes, and prioritize improvements to increase system reliability and performance

Build visibility into agent performance through metrics, monitoring, and reporting that inform roadmap decisions

What you bring:

You have deep experience defining and measuring quality for agentic or machine learning systems using evaluation frameworks, datasets, and scorecards

You have experience evaluating large language models or similar systems, including understanding tradeoffs in performance, cost, and latency

You have demonstrated ability to analyze production issues and lead initiatives that improve system quality across multiple teams

You are comfortable working with engineers, data scientists, and product partners to deliver measurable improvements in system performance

You have experience building or operating systems in regulated environments or working with AI evaluation and observability tools (nice to have)

What we offer:

Challenging, high-impact work to grow your career

Performance driven compensation with multipliers for outsized impact, bonus programs, equity ownership, and 401(k) matching

Best in class benefits to fuel your work, including 100% paid health insurance for employees with 90% coverage for dependents

Lifestyle wallet - a highly flexible benefits spending account for wellness, learning, and more

Employer-paid life & disability insurance, fertility benefits, and mental health benefits

Time off to recharge including company holidays, paid time off, sick time, parental leave, and more!

Experience Level: staff Employment Type: full-time Workplace Type: hybrid Category: Engineering Industry: Finance Salary Range: $255,000-$300,000 USD Salary Min: 255000 Salary Max: 300000 Salary Currency: USD Salary Period: year Required Skills: ["evaluation frameworks", "large language models", "regulated environments", "AI evaluation and observability tools"] Preferred Skills: []

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://job-boards.greenhouse.io/robinhood/jobs/7676714