# Operations Engineer, Fleet Reliability

**Company**: CoreWeave
**Location**: New York, NY /Plano, TX /  Bellevue, WA / Sunnyvale, CA
**Work arrangement**: hybrid
**Experience**: mid
**Job type**: full-time
**Salary**: $83,000 to $110,000
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/coreweave/jobs/4617382006
**Canonical**: https://yubhub.co/jobs/job_2ab9c635-07a

## Description

The Fleet Reliability Operations team is responsible for the day-to-day provisioning, management, and uptime of CoreWeave's ever-expanding fleet of server nodes. This team plays a central role in CoreWeave's growth strategy, configuring, updating, and remotely troubleshooting our highest-tier supercomputing clusters and their networking, delivery platforms, and tools dependencies.

We are seeking curious, creative, and persistent problem solvers to join our Fleet Reliability Operations team to help drive batches of server nodes through our provisioning and validation processes while efficiently and effectively troubleshooting node or cluster problems as they arise.

Key responsibilities include:

- Configuring and maintaining large-scale high-performance supercomputing clusters running state-of-the-art GPUs

- Troubleshooting hardware and software issues; escalating and coordinating as needed with data center, network, hardware, and platform teams to drive resolution

- Monitoring and analyzing system performance and taking appropriate remediation actions for cloud health

- Approaching work with flexibility and optimism, anticipating shifting business and technical priorities

- Creating and maintaining documentation of team processes, knowledge, and best practices for system management

- Thinking critically about day-to-day work and working collaboratively to improve team processes and efficiency

As a member of our team, you will be part of a dynamic and fast-paced environment where you will have the opportunity to grow and develop your skills. We offer a competitive salary range of $83,000 to $110,000, as well as a comprehensive benefits package, including medical, dental, and vision insurance, company-paid life insurance, and flexible PTO.

If you are a motivated and detail-oriented individual who is passionate about working with cutting-edge technology, we encourage you to apply for this exciting opportunity.

## Skills

### Required
- Linux system administration
- Troubleshooting hardware and software issues
- System maintenance tasks
- Scripting languages (bash, python, powershell, etc)
- Grafana, Prometheus, promsql queries or similar observability platforms

### Nice to have
- Kubernetes administration
- HPC - administering GPU-related workloads
- Data center environments including server racks, HVAC systems, fiber trays