Description
You will lead all aspects of resilient automation at AI Factory, managing break-fix automation, developing product strategy, improving operator experience, and guiding the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs.
Responsibilities:
- Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs.
- Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety.
- Build the operator UX for repair queues, workflow transparency, and audit trails , ensuring on-call engineers have the context they need to act quickly and confidently.
- Drive the integration between failure attribution and automated repair actions, following through from detection to resolution.
- Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability.
- Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale.
Requirements:
- 12+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background.
- BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience.
- Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation.
- Track record owning products with real-world operational consequences , you understand blast radius and build accordingly.
- Strong operator UX instincts , proven ability to translate complex system state into workflows that on-call engineers can act on under pressure.
- Ability to build alignment across engineering, SRE, and external vendor partner teams.
Nice to Have:
- Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments.
- Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale.
- Background in reliability engineering, SLO build, or chaos/fault-injection testing.
- Prior experience at a cloud service provider or Hyperscalers infrastructure team.
- Experience building Agentic AI workflow software
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Product-Manager--AI-Factory-Infra_JR2018887