# Principal Software Engineer, DGX Cloud Production Engineering

**Company**: NVIDIA
**Location**: Santa Clara
**Work arrangement**: remote
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Principal-Software-Engineer--DGX-Cloud-Production-Engineering_JR2018233?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_f2bc7ba5-258

## Description

### Job Description

We are looking for a Principal Software Engineer to help shape the technical direction for production engineering, Kubernetes-based operations, automation, and reliability across large-scale GPU clusters.

As a senior technical leader, you will define architecture, lead through influence, build critical systems, and turn ambiguous infrastructure problems into durable software and operating models.

### Responsibilities

- Define and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environments.

- Lead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readiness.

- Establish patterns for Kubernetes-based GPU cluster operations across partner and on-prem environments.

- Identify and eliminate operational toil through software, APIs, automation, and agent-assisted workflows.

- Set technical standards for production readiness, SLOs, incident response, handoff gates, and operational acceptance.

- Mentor engineers and influence platform, infrastructure, storage, networking, security, and workload teams.

### Requirements

- 15+ years of experience building and operating large-scale distributed systems or cloud infrastructure.

- Deep experience with Kubernetes, Linux, infrastructure automation, and production operations.

- Strong programming experience in Go, Python, or similar.

- Proven ability to lead complex cross-org technical initiatives.

- Experience designing reliable systems with clear SLOs, observability, incident response, and automation.

- BS/MS in Computer Science or equivalent experience.

### Benefits

- Eligible for equity and benefits.

## Skills

### Required
- Kubernetes
- Linux
- infrastructure automation
- production operations
- Go
- Python

### Nice to have
- GPU clusters
- AI/ML infrastructure
- Kubernetes operators
- GitOps
- BMaaS/VMaaS
- managed Kubernetes
- multi-cloud fleet operations

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Principal-Software-Engineer--DGX-Cloud-Production-Engineering_JR2018233?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
