New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
NVIDIA

Senior Product Manager, AI Factory Infra

NVIDIA
Apply →
hybrid senior full-time Santa Clara

First indexed 28 May 2026

Description

You will lead all aspects of resilient automation at AI Factory, managing break-fix automation, developing product strategy, improving operator experience, and guiding the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs.

Responsibilities:

  • Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs.
  • Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety.
  • Build the operator UX for repair queues, workflow transparency, and audit trails , ensuring on-call engineers have the context they need to act quickly and confidently.
  • Drive the integration between failure attribution and automated repair actions, following through from detection to resolution.
  • Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability.
  • Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale.

Requirements:

  • 12+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background.
  • BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience.
  • Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation.
  • Track record owning products with real-world operational consequences , you understand blast radius and build accordingly.
  • Strong operator UX instincts , proven ability to translate complex system state into workflows that on-call engineers can act on under pressure.
  • Ability to build alignment across engineering, SRE, and external vendor partner teams.

Nice to Have:

  • Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments.
  • Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale.
  • Background in reliability engineering, SLO build, or chaos/fault-injection testing.
  • Prior experience at a cloud service provider or Hyperscalers infrastructure team.
  • Experience building Agentic AI workflow software
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Product-Manager--AI-Factory-Infra_JR2018887