Description
We're building the Applied Training team to fix the problem of researchers spending their first month on cluster setup instead of research. You'll be an early member of a small team, responsible for our Kubernetes-native research cluster platform, or the sandbox client for agentic training and evaluation, or possibly a new project altogether.
Your responsibilities will include contributing to the roadmap for Applied Training, designing and building a complete research cluster experience, owning the Python SDK, and writing documentation for running popular OSS training frameworks on CoreWeave.
You'll work with infrastructure teams and customers directly, understanding how they structure their internal supercomputing stacks and bringing that knowledge back to what we build.
As a staff software engineer, you'll have 8-12+ years of experience building distributed systems, ML infrastructure, or developer platforms, with real Kubernetes experience and a passion for rigorous engineering enabled by AI-based workflows.
You'll be a good communicator, able to work with customers, translate researcher complaints into system designs, and contribute to the growth and success of our team.
If you're excited about this opportunity, please apply!