Description
CoreWeave is seeking a Senior GPU Infrastructure Software Engineer to join our team. As a senior engineer, you will be responsible for leading designs, raising engineering standards, and delivering measurable improvements to latency, throughput, and reliability across multiple services. You will partner with fleet, product, and hardware teams to evolve our GPU performance testing platform to ensure we deliver a reliable and performant experience to our customers.
Key responsibilities include:
- Design and implement solutions to problems of scale for testing and validation of CoreWeave's global infrastructure.
- Design and develop Kubernetes-native controllers and operators to automate infrastructure workflows.
- Build and maintain scalable backend services and APIs (gRPC/REST) in Go or Python.
- Develop performance tests and automation workflows to expand hardware validation across the CoreWeave fleet.
- Write and maintain Kubernetes custom controllers and operators to automate infrastructure testing.
- Adapt and extend open source tooling to enhance visibility into system metrics, performance, and health.
To be successful in this role, you should have:
- ~5 to 8 years experience.
- Proficiency in Go and/or Python software development.
- Hands-on experience with writing Kubernetes operators/controllers.
Nice to have:
- Experience testing hardware at scale.
- HPC Experience.
- Experience with AI/ML infrastructure and training / inference.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://job-boards.greenhouse.io/coreweave/jobs/4627277006