# Principal Engineer, Cluster Orchestration

**Company**: CoreWeave
**Location**: Bellevue, WA / Sunnyvale, CA
**Work arrangement**: hybrid
**Experience**: senior
**Job type**: full-time
**Salary**: $206,000 to $303,000
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/coreweave/jobs/4658799006
**Canonical**: https://yubhub.co/jobs/job_0f249232-d14

## Description

As a Principal Engineer in AI Infrastructure, you will lead the design and evolution of the cluster orchestration systems that make this possible. This includes Slurm, Kubernetes, SUNK, and the control planes that support AI training, inference, and model onboarding at scale.

You will define long-term architecture, solve hard scaling problems, and set technical direction across teams. Your work will directly affect how quickly customers can run models, how efficiently we use GPUs, and how reliably the platform behaves at scale.

Key responsibilities include:

- Defining the long-term architecture for CoreWeave's orchestration platforms across Kubernetes, Slurm, SUNK, Kueue, and related systems.

- Acting as a technical authority on scheduling, quota enforcement, fairness, pre-emption, and multi-tenant GPU isolation.

- Making design decisions that balance performance, reliability, cost, and operational complexity.

In addition to these responsibilities, you will also lead the evolution of Kubernetes-native control planes, including SUNK and custom operators, and design systems that support workload admission, validation, and rollout, including model onboarding flows.

You will work closely with cross-functional teams to ensure that the systems you design and implement meet the needs of our customers and are scalable, reliable, and efficient.

If you have a passion for building large-scale distributed systems and are looking for a challenging and rewarding role, we encourage you to apply.

## Skills

### Required
- Kubernetes
- Slurm
- SUNK
- Go
- Cloud-native systems development
- GPU-heavy platforms for AI training, inference, or HPC workloads

### Nice to have
- Kueue
- Kubeflow
- Argo Workflows
- Ray
- Istio
- Knative
