Description

About Anthropic

Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

About the Role

Anthropic manages one of the largest and fastest-growing accelerator fleets in the industry — spanning multiple accelerator families and clouds. The Accelerator Capacity Engineering (ACE) team is responsible for making sure every chip in that fleet is accounted for, well-utilized, and efficiently allocated. We own the data, tooling, and operational systems that let Anthropic plan, measure, and maximize utilization across first-party and third-party compute.

As an engineer on ACE, you will build the production systems that power this work: data pipelines that ingest and normalize telemetry from heterogeneous cloud environments, observability tooling that gives the org real-time visibility into fleet health, and performance instrumentation that measures how efficiently every major workload uses the hardware it’s running on. You will be expected to write production-quality code every day, operate alongside Kubernetes-native infrastructure at meaningful scale, and directly influence decisions around one of Anthropic’s largest areas of spend.

You’ll collaborate closely with research engineering, infrastructure, inference, and finance teams. The work requires someone who can move between data engineering, systems engineering, and observability with comfort — and who thrives in a high-autonomy, high-ambiguity environment.

What This Team Owns

The team’s work spans three functional areas. Depending on your background and interests, you’ll focus primarily in one, but the boundaries are fluid and the problems overlap:

Data infrastructure — collecting, normalizing, and serving the fleet-wide data that powers everything else. This means building pipelines that ingest occupancy and utilization telemetry from Kubernetes clusters, normalizing billing and usage data across cloud providers, and maintaining the BigQuery layer that the rest of the org queries against. Correctness, completeness, and latency matter here.

Fleet observability — making the state of the accelerator fleet legible and actionable in real time. This means building cluster health tooling, capacity planning platforms, alerting on occupancy drops and allocation problems, and driving systemic improvements to scheduling and fragmentation. The work sits at the intersection of Kubernetes operations and cross-team coordination.

Compute efficiency — measuring and improving how effectively every major workload uses the hardware it’s running on. This means instrumenting utilization metrics across training, inference, and eval systems, building benchmarking infrastructure, establishing per-config baselines, and collaborating directly with system-owning teams to close efficiency gaps.

What You’ll Do

Build and operate data pipelines that ingest accelerator occupancy, utilization, and cost data from multiple cloud providers into BigQuery. Own data completeness, latency SLOs, gap detection, and backfill automation.

Develop and maintain observability infrastructure— Prometheus recording rules, Grafana dashboards, and alerting systems — that surface actionable signals about fleet health, occupancy, and efficiency.

Instrument and analyze compute efficiency metrics across training, inference, and eval workloads. Build benchmarking infrastructure, establish per-config baselines, and work with system-owning teams to improve utilization.

Build internal tooling and platforms that enable capacity planning, workload attribution, and cluster debugging. The consumers are other engineering teams, finance, and leadership — not external users.

Operate Kubernetes-native systems at scale— deploying data collection agents, managing workload labeling infrastructure, and understanding how taints, reservations, and scheduling affect capacity.

Normalize and reconcile data across heterogeneous sources— including AWS, GCP, and Azure billing exports, vendor-specific telemetry formats, and internal systems with different schemas and billing arrangements.

Collaborate across organizational boundaries with research engineering, infrastructure, inference, and finance teams. Gather requirements from technical stakeholders, translate them into useful systems, and communicate trade-offs to non-technical audiences.

You May Be a Good Fit If You Have

5+ years of software engineering experience with a strong track record building and operating production systems. You write code every day — this is a hands-on engineering role, not a planning or coordination role.

Kubernetes fluency at operational depth— you’ve operated production K8s at meaningful scale, not just written manifests. Comfort with scheduling, taints, labels, node management, and cluster debugging.

Experience with data engineering and observability— you’ve built data pipelines, normalized data across heterogeneous sources, and developed observability infrastructure.

Strong communication and collaboration skills— you can gather requirements from technical stakeholders, translate them into useful systems, and communicate trade-offs to non-technical audiences.

Ability to thrive in a high-autonomy, high-ambiguity environment— you can move between data engineering, systems engineering, and observability with comfort and make decisions with minimal guidance.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://job-boards.greenhouse.io/anthropic/jobs/5126702008