Description
Job Description
We are looking for a Principal Software Engineer to help shape the technical direction for production engineering, Kubernetes-based operations, automation, and reliability across large-scale GPU clusters.
As a senior technical leader, you will define architecture, lead through influence, build critical systems, and turn ambiguous infrastructure problems into durable software and operating models.
Responsibilities
- Define and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environments.
- Lead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readiness.
- Establish patterns for Kubernetes-based GPU cluster operations across partner and on-prem environments.
- Identify and eliminate operational toil through software, APIs, automation, and agent-assisted workflows.
- Set technical standards for production readiness, SLOs, incident response, handoff gates, and operational acceptance.
- Mentor engineers and influence platform, infrastructure, storage, networking, security, and workload teams.
Requirements
- 15+ years of experience building and operating large-scale distributed systems or cloud infrastructure.
- Deep experience with Kubernetes, Linux, infrastructure automation, and production operations.
- Strong programming experience in Go, Python, or similar.
- Proven ability to lead complex cross-org technical initiatives.
- Experience designing reliable systems with clear SLOs, observability, incident response, and automation.
- BS/MS in Computer Science or equivalent experience.
Benefits
- Eligible for equity and benefits.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Principal-Software-Engineer--DGX-Cloud-Production-Engineering_JR2018233