Description

You will lead all aspects of resilient automation at AI Factory, managing break-fix automation, developing product strategy, improving operator experience, and guiding the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs.

Responsibilities:

Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs.
Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety.
Build the operator UX for repair queues, workflow transparency, and audit trails , ensuring on-call engineers have the context they need to act quickly and confidently.
Drive the integration between failure attribution and automated repair actions, following through from detection to resolution.
Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability.
Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale.

Requirements:

12+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background.
BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience.
Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation.
Track record owning products with real-world operational consequences , you understand blast radius and build accordingly.
Strong operator UX instincts , proven ability to translate complex system state into workflows that on-call engineers can act on under pressure.
Ability to build alignment across engineering, SRE, and external vendor partner teams.

Nice to Have:

Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments.
Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale.
Background in reliability engineering, SLO build, or chaos/fault-injection testing.
Prior experience at a cloud service provider or Hyperscalers infrastructure team.
Experience building Agentic AI workflow software

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Product-Manager--AI-Factory-Infra_JR2018887