New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
NVIDIA

Systems Quality and Reliability Lead

NVIDIA
Apply →
onsite senior full-time Santa Clara

First indexed 18 May 2026

Description

We are seeking a Lead Systems Quality and Reliability Engineer to join our LPU team! You will own, build, and manage the RMA and FA debug and root-cause analysis for existing and new Nvidia AI/ML products.

Responsibilities:

  • Conduct and lead debug and root-cause analysis of field RMAs. Collaborate with Systems Engineers, Hardware engineers, Software engineers, and operations engineers as required
  • Scale root cause FA capabilities within your organization
  • Create FA result reports that align with standard 8D or similar process
  • Analyze RMA, FA and repair data. Identify trends and raise quality alerts when necessary. Drive resolution, containment, and mitigation plans for such quality alerts
  • Oversee hardware quality performance, monitoring field quality data and associated metrics including RMA rates, MTBF, and Reliability Ratio
  • Manage operational perf of FA at CMs, ensuring partner achieve key perf indicators including FA cycle times, fault duplication rates and fault isolation rates
  • Oversee the setup of new products into Failure Analysis operations

Requirements:

  • BS/MS in EE, Physics or a related degree (or equivalent experience)
  • 8+ yrs of hands on systems test and/or validation engineering experience
  • Proven hands-on management and leadership experience
  • Competence using lab equipment such as oscilloscopes, logic analyzers, power analyzers etc.
  • Experience with enabling reliability tests such as HTOL and quality tests such as Burn in
  • Ideal candidate will have working knowledge of FA techniques and tools such as FIB, SEM, TDR, VNA and CSAM
  • Strong knowledge of Fault isolation techniques such as OBIRCH, DLS/LADA, LVP and LVI
  • Proficiency with high speed interfaces (SerDes, PCIe, DDR)
  • Proficiency in Python, PERL, C++, or other languages on UNIX /Linux
  • Excellent knowledge of PCB card and system level test and debug as well as be able to manage factory floor partners (CMs) for RMA/FA activities