# Principal Engineer - Perf and Benchmarking

**Company**: CoreWeave
**Location**: Sunnyvale, CA / Bellevue, WA
**Work arrangement**: hybrid
**Experience**: senior
**Job type**: full-time
**Salary**: $206,000 to $333,000
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/coreweave/jobs/4627302006
**Canonical**: https://yubhub.co/jobs/job_ff4d3a91-b20

## Description

We're looking for a Principal Engineer to be the technical lead of CoreWeave's Benchmarking & Performance team. You will be responsible for our planet-scale performance data warehouse: Ingesting, storing, transforming and analyzing performance events in all the data centers across our global infrastructure.

You will also be an integral part of achieving industry-leading end-to-end performance benchmarking publications: If MLPerf (Training & Inference), Working closely with NVIDIA (Megatron-LM, TensorRT-LLM & DGX cloud) and the open-source community (llm-d, vLLM and all popular ML frameworks) speak to you, come help us demonstrate CoreWeave's performance reliability leadership in the field.

**Responsibilities**

- Strategy & Leadership - Define the multi-year benchmarking strategy and roadmap; prioritize models/workloads (LLMs, diffusion, vision, speech) and hardware tiers. Build, lead, and mentor a high-performing team of performance engineers and data analysts. Establish governance for claims: documented methodologies, versioning, reproducibility, and audit trails.

- Perf Ownership - Lead end-to-end MLPerf Inference and Training submissions: workload selection, cluster planning, runbooks, audits, and result publication. Coordinate optimization tracks with NVIDIA (CUDA, cuDNN, TensorRT/TensorRT-LLM, Triton, NCCL) to hit competitive results; drive upstream fixes where needed.

- Internal Latency & Throughput Benchmarks - Design a Kubernetes-native, repeatable benchmarking service that exercises CoreWeave stacks across SUNK (Slurm on Kubernetes), Kueue, and Kubeflow pipelines. Measure and report p50/p95/p99 latency, jitter, tokens/s, time-to-first-token, cold-start/warm-start, and cost-per-token/request across models, precisions (BF16/FP8/FP4), batch sizes, and GPU types. Maintain a corpus of representative scenarios (streaming, batch, multi-tenant) and data sets; automate comparisons across software releases and hardware generations.

- Tooling & Automation - Build CI/CD pipelines and K8s controllers/operators to schedule benchmarks at scale; integrate with observability stacks (Prometheus, Grafana, OpenTelemetry) and results warehouses. Implement supply-chain integrity for benchmark artifacts (SBOMs, Cosign signatures).

- Cross-functional & Community - Partner with NVIDIA, key ISVs, and OSS projects (vLLM, Triton, KServe, PyTorch/DeepSpeed, ONNX Runtime) to co-develop optimizations and upstream improvements. Support Sales/SEs with authoritative numbers for RFPs and competitive evaluations; brief analysts and press with rigorous, defensible data.

**Requirements**

- 10+ years building distributed systems or HPC/cloud services, with deep expertise on large-scale ML training or similar high-performance workloads.

- Proven track record of architecting or building planet-scale data systems (e.g., telemetry platforms, observability stacks, cloud data warehouses, large-scale OLAP engines).

- Deep understanding of GPU performance (CUDA, NCCL, RDMA, NVLink/PCIe, memory bandwidth), model-server stacks (Triton, vLLM, TensorRT-LLM, TorchServe), and distributed training frameworks (PyTorch FSDP/DeepSpeed/Megatron-LM).

- Proficient with Kubernetes and ML control planes; familiarity with SUNK, Kueue, and Kubeflow in production environments.

- Excellent communicator able to interface with executives, customers, auditors, and OSS communities.

**Nice to have**

- Experience with time-series databases, log-structured merge trees (LSM), or custom storage engine development.

- Experience running MLPerf submissions (Inference and/or Training) or equivalent audited benchmarks at scale.

- Contributions to MLPerf, Triton, vLLM, PyTorch, KServe, or similar OSS projects.

- Experience benchmarking multi-region fleets and large clusters (thousands of GPUs).

- Publications/talks on ML performance, latency engineering, or large-scale benchmarking methodology.

## Skills

### Required
- Distributed systems
- HPC/cloud services
- Large-scale ML training
- GPU performance
- Model-server stacks
- Distributed training frameworks
- Kubernetes
- ML control planes
- Time-series databases
- Log-structured merge trees
- Custom storage engine development

### Nice to have
- MLPerf submissions
- Audited benchmarks
- Contributions to OSS projects
- Benchmarking multi-region fleets
- Large clusters
- Publications/talks on ML performance
