# Operations Engineer, Fleet Reliability

**Company**: fal
**Location**: Remote
**Work arrangement**: remote
**Experience**: mid
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/fal/jobs/4248332009?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_f0ce3a0d-f14

## Description

As generative media reshapes industries, fal is hiring Operations Engineers to keep the fleet alive. This hands-on role involves provisioning, validating, and troubleshooting GPU nodes across clusters. You'll be on-call, comfortable with ambiguity, and able to script your way out of repetitive work.

Responsibilities: Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters Troubleshoot hardware and software issues across compute, network, and storage Monitor fleet health, take remediation action, push fixes upstream when needed Write the runbooks. Improve the ones that exist. Delete the ones that don't work

You're a fit if you've: Administered Linux Systems in the critical path before Troubleshooted GPU node issues: NVLink, NCCL, IB, driver and firmware bugs Has experience in observability systems like Grafana and Prometheus Scripted your way out of repetitive work (bash, python, go, whatever)

Who you are: Curious. You don't accept 'it's flaky' as a root cause Comfortable with ambiguity. The runbook doesn't exist yet for half of what you'll do On-call doesn't scare you You'd rather automate a problem than fix it twice

## Skills

### Required
- Linux Systems Administration
- GPU Node Troubleshooting
- Observability Systems
- Scripting (bash, python, go)

---

Source: [Apply at job-boards.greenhouse.io](https://job-boards.greenhouse.io/fal/jobs/4248332009?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
