# Operations Engineer, HPC Networking

**Company**: fal
**Location**: Remote
**Work arrangement**: remote
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/fal/jobs/4248335009?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_edecae6c-ca0

## Description

We're hiring an Operations Engineer for HPC Networking to keep our InfiniBand and Ethernet fabrics healthy as we scale.

This is a hands-on role. You'll bring up new fabrics alongside DC ops, monitor the ones in production, and chase down the weird stuff: link flaps, congestion, NCCL stalls, firmware bugs that only show up at scale.

### You're a fit if you've:

- Operated InfiniBand fabrics in production: subnet manager, routing, partitioning, monitoring.

- Debugged the full stack: cables, transceivers, switch firmware, HCAs, drivers, NCCL.

- Brought up new fabrics from cable pull through validation.

- Scripted your way through repetitive operational work (bash, python, go, whatever).

- Nice to have: Ethernet RoCE, Spectrum-X, or large-scale GPU cluster networking.

### Who you are:

- Detail-oriented. Cable plant hygiene is a personality trait.

- Calm under fire. A fabric incident during a customer training run doesn't rattle you.

- You read vendor release notes for fun, or at least out of self-defense.

- You'd rather find the root cause than reboot the switch.

### Responsibilities:

- Monitor health and performance of InfiniBand and Ethernet fabrics: switches, HCAs, transceivers, links.

- Investigate and resolve fabric issues: connectivity, congestion, performance regressions.

- Support fabric bring-up alongside DC ops and customer-facing teams.

- Run maintenance and upgrades on switches and control plane components.

- Partner with cluster ops on cross-domain incidents where the line between compute and network is blurry.

- Improve the tooling and runbooks so the next incident resolves faster than the last.

## Skills

### Required
- InfiniBand
- Ethernet
- subnet manager
- routing
- partitioning
- monitoring
- cables
- transceivers
- switch firmware
- HCAs
- drivers
- NCCL

### Nice to have
- Ethernet RoCE
- Spectrum-X
- large-scale GPU cluster networking

---

Source: [Apply at job-boards.greenhouse.io](https://job-boards.greenhouse.io/fal/jobs/4248335009?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
