Description
CoreWeave is The Essential Cloud for AI. As a Senior Software Engineer on the Cluster Orchestration team, you will advance CoreWeave's orchestration platform, including SUNK (Slurm on Kubernetes), ensuring AI workloads run seamlessly and efficiently across massive GPU clusters.
Responsibilities:
- Own multiple services within the orchestration platform
- Lead design/code reviews, decompose projects into milestones, and drive measurable improvements in reliability and performance
- Define SLIs/SLOs for services, strengthen operational practices, and mentor junior engineers
- Ensure customers see consistent improvements in throughput, latency, and system resilience
Requirements:
- 3-5 years of professional software engineering experience building distributed systems or cloud services
- Strong coding in Go (Python or C++ a plus) with solid CS fundamentals
- Hands-on experience running Kubernetes at production scale
- Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry)
- Proven ability to improve service reliability and performance using metrics
Preferred Qualifications:
- Familiarity with orchestration and workflow technologies such as Ray, Kubeflow, Kueue, Istio, Knative, or Argo Workflows
- Experience with distributed workloads, GPU-based applications, or ML pipelines
- Knowledge of scheduling concepts like quota enforcement, pre-emption, and scaling strategies
- Exposure to reliability practices including SLOs, alarms, and post-incident reviews
Benefits:
- Competitive salary ($139,000 to $204,000)
- Discretionary bonus, equity awards, and comprehensive benefits program
- Medical, dental, and vision insurance (100% paid)
- Flexible Spending Account, Health Savings Account, Tuition Reimbursement, and more
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://job-boards.greenhouse.io/coreweave/jobs/4666814006