# Senior Software Engineer - NVLink Rack Scale Stability and Reliability

**Company**: NVIDIA
**Location**: Santa Clara
**Work arrangement**: remote
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Software-Engineer---NVLink-Rack-Scale-Stability-and-Reliability_JR2018426?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_c8eba919-8ad

## Description

We are looking for highly motivated Senior Software Engineers to join our Fabric Networking team with a targeted focus on NVLink Rack-Scale Systems Stability & Reliability.

In this role, you will partner closely with architects and developers building our next-generation NVLink and NVSwitch systems, helping transform first-of-their-kind platforms into stable, reliable, and volume production-ready systems. You will work on complex system-level challenges spanning resiliency, diagnostics, recovery, and large-scale AI infrastructure, contributing directly to the software foundation powering next-generation datacenter deployments.

### Key Responsibilities:

- Drive platform bringup, feature enablement, end-to-end software validation, and debug for next-generation NVLink-based GPU and rack-scale systems.

- Develop tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet support.

- Lead reliability and MTBI validation through stress testing, telemetry analysis, failure injection, and issue resolution.

- Triage complex software, firmware, networking, and platform issues across validation, deployment, and production environments.

- Collaborate with architecture, hardware, firmware, software, and Customer engagement teams to improve system quality and reliability.

- Build and maintain SRE-style validation infrastructure, including provisioning, monitoring, and operational readiness.

- Create automation, dashboards, runbooks, and debug workflows that improve root-cause analysis and operational efficiency.

### Requirements:

- BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or related field, or equivalent experience.

- 5+ years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems.

- Strong programming skills in C/C++ and Python; Bash/Shell scripting experience is a plus.

- Strong system-level debugging across software, firmware, hardware, and networking layers.

- Solid networking fundamentals, including TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis.

- Experience with large-scale AI systems, including platform bringup, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging.

- Ability to triage complex multi-domain issues using logs, telemetry, experiments, and structured debugging methods.

- Strong communication and collaboration skills across engineering, customer, and operations teams.

### Nice to Have:

- Experience with NVIDIA GPU systems, NVLink, NVSwitch, CUDA, and large-scale AI/HPC clusters such as NVIDIA GB200 NVL72.

- Strong understanding of large-scale AI system architecture, including PCIe, memory hierarchy, DMA, high-speed interconnects, and distributed training/inference systems.

- Experience with server management technologies, data center operations, cluster provisioning, scaling, and fleet monitoring.

- Proven experience building diagnostics, automation, CI/CD pipelines, dashboards, and reliability tooling.

## Skills

### Required
- C/C++
- Python
- Bash/Shell scripting
- System-level debugging
- Networking fundamentals
- Large-scale AI systems
- Reliability engineering
- Stress testing
- Telemetry analysis
- Root-cause debugging

### Nice to have
- NVIDIA GPU systems
- NVLink
- NVSwitch
- CUDA
- Large-scale AI/HPC clusters
- Server management technologies
- Data center operations
- Cluster provisioning
- Scaling
- Fleet monitoring
- Diagnostics
- Automation
- CI/CD pipelines
- Dashboards
- Reliability tooling

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Software-Engineer---NVLink-Rack-Scale-Stability-and-Reliability_JR2018426?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)