Description
We are looking for Senior Software Engineers to help build the automation, tooling, and operational systems that make GPU clusters reliable, scalable, and safe to run.
As a Senior Software Engineer on our DGX Cloud Production Engineering team, you will build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments. You will develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations. You will improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations. You will reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows. You will participate in on-call, incident response, debugging, and durable follow-up work. You will partner with platform, storage, networking, security, and workload teams to make infrastructure production-ready.
To succeed in this role, you will need 8+ years of experience building or operating production infrastructure. You will need strong programming skills in Python, Go, or similar. You will need experience with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation. You will need the ability to troubleshoot distributed systems in production. You will need clear communication and the ability to work across teams. You will need a BS/MS in Computer Science or equivalent experience.
In addition to the above requirements, experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, or fleet automation is highly desirable. Experience with SLOs, on-call, incident response, observability, and reliability practices is also highly desirable. Exposure to BMaaS, VMaaS, managed Kubernetes, or multi-cloud infrastructure is highly desirable.