Description
We're seeking a Staff / Senior Software Engineer, AI Reliability to join our team. As a key member of our AIRE (AI Reliability Engineering) team, you will partner with teams across Anthropic to improve reliability across our most critical serving paths. You will develop Service Level Objectives for large language model serving systems, design and implement monitoring and observability systems, assist in the design and implementation of high-availability serving infrastructure, lead incident response for critical AI services, and support the reliability of safeguard model serving.
You may be a good fit for this role if you have strong distributed systems, infrastructure, or reliability backgrounds, are curious and brave, think holistically about how systems compose and where the seams are, can build lasting relationships across teams, care about users and feel ownership over outcomes, have excellent communication and collaboration skills, and bring diverse experience.
Strong candidates may also have experience operating large-scale model serving or training infrastructure, experience with one or more ML hardware accelerators, understanding of ML-specific networking optimizations, expertise in AI-specific observability tools and frameworks, experience with chaos engineering and systematic resilience testing, and contributions to open-source infrastructure or ML tooling.
We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. We value impact and believe that the highest-impact AI research will be big science. We work as a single cohesive team on just a few large-scale research efforts and value communication skills.
If you're interested in this role, please submit an application even if you don't believe you meet every single qualification. We encourage diversity and strive to include a range of diverse perspectives on our team.