Description

We are seeking a Linux OS and System Programming Subject Matter Expert to join our Infrastructure team. In this role, you'll work on accelerating and optimizing our virtualization and VM workloads that power our AI infrastructure.

Your expertise in low-level system programming, kernel optimization, and virtualization technologies will be crucial in ensuring Anthropic can scale our compute infrastructure efficiently and reliably for training and serving frontier AI models.

Responsibilities:

Optimize our virtualization stack, improving performance, reliability, and efficiency of our VM environments

Design and implement kernel modules, drivers, and system-level components to enhance our compute infrastructure

Investigate and resolve performance bottlenecks in virtualized environments

Collaborate with cloud engineering teams to optimize interactions between our workloads and underlying hardware

Develop tooling for monitoring and improving virtualization performance

Work with our ML engineers to understand their computational needs and optimize our systems accordingly

Contribute to the design and implementation of our next-generation compute infrastructure

Share knowledge with team members on low-level systems programming and Linux kernel internals

Partner with cloud providers to influence hardware and platform features for AI workloads

You may be a good fit if you:

Have experience with Linux kernel development, system programming, or related low-level software engineering

Understand virtualization technologies (KVM, Xen, QEMU, etc.) and their performance characteristics

Have experience optimizing system performance for compute-intensive workloads

Are familiar with modern CPU architectures and memory systems

Have strong C/C++ programming skills and ideally experience with systems languages like Rust

Understand Linux resource management, scheduling, and memory management

Have experience profiling and debugging system-level performance issues

Are comfortable diving into unfamiliar codebases and technical domains

Are results-oriented, with a bias towards practical solutions and measurable impact

Care about the societal impacts of AI and are passionate about building safe, reliable systems

Strong candidates may also have experience with:

GPU virtualization and acceleration technologies

Cloud infrastructure at scale (AWS, GCP)

Container technologies and their underlying implementation (Docker, containerd, runc, OCI)

eBPF programming and kernel tracing tools

OS-level security hardening and isolation techniques

Developing custom scheduling algorithms for specialized workloads

Performance optimization for ML/AI specific workloads

Network stack optimization and high-performance networking

Experience with TPUs, custom ASICs, or other ML accelerators

Representative projects:

Optimizing kernel parameters and VM configurations to reduce inference latency for large language models

Implementing custom memory management schemes for large-scale distributed training

Developing specialized I/O schedulers to prioritize ML workloads

Creating lightweight virtualization solutions tailored for AI inference

Building monitoring and instrumentation tools to identify system-level bottlenecks

Enhancing communication between VMs for distributed training workloads

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://job-boards.greenhouse.io/anthropic/jobs/5025591008