# Senior Product Manager, AI Factory Infra

**Company**: NVIDIA
**Location**: Santa Clara
**Work arrangement**: hybrid
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Product-Manager--AI-Factory-Infra_JR2018887?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_0623f199-730

## Description

You will lead all aspects of resilient automation at AI Factory, managing break-fix automation, developing product strategy, improving operator experience, and guiding the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs.

**Responsibilities:**

- Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs.

- Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety.

- Build the operator UX for repair queues, workflow transparency, and audit trails , ensuring on-call engineers have the context they need to act quickly and confidently.

- Drive the integration between failure attribution and automated repair actions, following through from detection to resolution.

- Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability.

- Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale.

**Requirements:**

- 12+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background.

- BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience.

- Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation.

- Track record owning products with real-world operational consequences , you understand blast radius and build accordingly.

- Strong operator UX instincts , proven ability to translate complex system state into workflows that on-call engineers can act on under pressure.

- Ability to build alignment across engineering, SRE, and external vendor partner teams.

**Nice to Have:**

- Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments.

- Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale.

- Background in reliability engineering, SLO build, or chaos/fault-injection testing.

- Prior experience at a cloud service provider or Hyperscalers infrastructure team.

- Experience building Agentic AI workflow software

## Skills

### Required
- product management
- infrastructure
- platform
- MLOps
- distributed systems
- workflow orchestration
- automation
- operator UX
- RMA logistics
- vendor SLA oversight
- hardware repair

### Nice to have
- GPU infrastructure
- datacenter operations
- AI factory environments
- reliability engineering
- SLO build
- chaos/fault-injection testing
- cloud service provider
- Hyperscalers infrastructure team
- Agentic AI workflow software

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Product-Manager--AI-Factory-Infra_JR2018887?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
