Description
The Fleet Reliability Operations team is responsible for the day-to-day provisioning, management, and uptime of CoreWeave's ever-expanding fleet of server nodes. This team plays a central role in CoreWeave's growth strategy, configuring, updating, and remotely troubleshooting our highest-tier supercomputing clusters and their networking, delivery platforms, and tools dependencies.
We are seeking curious, creative, and persistent problem solvers to join our Fleet Reliability Operations team to help drive batches of server nodes through our provisioning and validation processes while efficiently and effectively troubleshooting node or cluster problems as they arise.
Key responsibilities include:
- Configuring and maintaining large-scale high-performance supercomputing clusters running state-of-the-art GPUs
- Troubleshooting hardware and software issues; escalating and coordinating as needed with data center, network, hardware, and platform teams to drive resolution
- Monitoring and analyzing system performance and taking appropriate remediation actions for cloud health
- Approaching work with flexibility and optimism, anticipating shifting business and technical priorities
- Creating and maintaining documentation of team processes, knowledge, and best practices for system management
- Thinking critically about day-to-day work and working collaboratively to improve team processes and efficiency
As a member of our team, you will be part of a dynamic and fast-paced environment where you will have the opportunity to grow and develop your skills. We offer a competitive salary range of $83,000 to $110,000, as well as a comprehensive benefits package, including medical, dental, and vision insurance, company-paid life insurance, and flexible PTO.
If you are a motivated and detail-oriented individual who is passionate about working with cutting-edge technology, we encourage you to apply for this exciting opportunity.