Description
We're hiring an Operations Engineer for HPC Networking to keep our InfiniBand and Ethernet fabrics healthy as we scale.
This is a hands-on role. You'll bring up new fabrics alongside DC ops, monitor the ones in production, and chase down the weird stuff: link flaps, congestion, NCCL stalls, firmware bugs that only show up at scale.
You're a fit if you've:
- Operated InfiniBand fabrics in production: subnet manager, routing, partitioning, monitoring.
- Debugged the full stack: cables, transceivers, switch firmware, HCAs, drivers, NCCL.
- Brought up new fabrics from cable pull through validation.
- Scripted your way through repetitive operational work (bash, python, go, whatever).
- Nice to have: Ethernet RoCE, Spectrum-X, or large-scale GPU cluster networking.
Who you are:
- Detail-oriented. Cable plant hygiene is a personality trait.
- Calm under fire. A fabric incident during a customer training run doesn't rattle you.
- You read vendor release notes for fun, or at least out of self-defense.
- You'd rather find the root cause than reboot the switch.
Responsibilities:
- Monitor health and performance of InfiniBand and Ethernet fabrics: switches, HCAs, transceivers, links.
- Investigate and resolve fabric issues: connectivity, congestion, performance regressions.
- Support fabric bring-up alongside DC ops and customer-facing teams.
- Run maintenance and upgrades on switches and control plane components.
- Partner with cluster ops on cross-domain incidents where the line between compute and network is blurry.
- Improve the tooling and runbooks so the next incident resolves faster than the last.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://job-boards.greenhouse.io/fal/jobs/4248335009