New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
NVIDIA

AI and Systems Software Intern

NVIDIA
Apply →
onsite entry internship 20 USD - 71 USD per hour Santa Clara

First indexed 28 May 2026

Description

We are looking for an intern to join our team in AI and Systems Software for datacenter applications. As an intern, you will be deeply involved in system-level debugging, analyzing our large-scale infrastructure reliability, and correlating complex failure modes to underlying hardware or system issues.

Your responsibilities will include investigating and triaging failures within large-scale compute clusters, performing deep-dive analysis to distinguish between software glitches, configuration errors, and hardware faults. You will also analyze logs and telemetry to correlate specific job failures to system-level issues and diagnostic test failures, helping to reduce noise and identify root causes.

Additionally, you will assist with the tracking, calculation, and reporting on key reliability metrics, specifically Mean Time Between Failures (MTBF) and Mean Time Between Interruptions (MTBI), to drive infrastructure improvements. You will also assist in analyzing large-scale workload issues, searching for application and infrastructure improvement opportunities to ensure jobs run as fast and reliably as possible.

As an intern, you will work closely with a mentor to learn about hardware validation suite architecture, document debugging methodologies, and help the team make intelligent, data-backed engineering decisions.