New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
NVIDIA

Principal Software Engineer, DGX Cloud Production Engineering

NVIDIA
Apply →
remote senior full-time Santa Clara

First indexed 20 May 2026

Description

Job Description

We are looking for a Principal Software Engineer to help shape the technical direction for production engineering, Kubernetes-based operations, automation, and reliability across large-scale GPU clusters.

As a senior technical leader, you will define architecture, lead through influence, build critical systems, and turn ambiguous infrastructure problems into durable software and operating models.

Responsibilities

  • Define and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environments.
  • Lead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readiness.
  • Establish patterns for Kubernetes-based GPU cluster operations across partner and on-prem environments.
  • Identify and eliminate operational toil through software, APIs, automation, and agent-assisted workflows.
  • Set technical standards for production readiness, SLOs, incident response, handoff gates, and operational acceptance.
  • Mentor engineers and influence platform, infrastructure, storage, networking, security, and workload teams.

Requirements

  • 15+ years of experience building and operating large-scale distributed systems or cloud infrastructure.
  • Deep experience with Kubernetes, Linux, infrastructure automation, and production operations.
  • Strong programming experience in Go, Python, or similar.
  • Proven ability to lead complex cross-org technical initiatives.
  • Experience designing reliable systems with clear SLOs, observability, incident response, and automation.
  • BS/MS in Computer Science or equivalent experience.

Benefits

  • Eligible for equity and benefits.