Description
We are looking for Senior Software Engineers to help build the automation, tooling, and operational systems that make GPU clusters reliable, scalable, and safe to run.
This role is part of a production engineering team focused on Kubernetes-based infrastructure, GPU cluster operations, reliability, automation, GitOps, and Day 2 operability across DGX Cloud environments.
Key responsibilities include:
- Building and operating automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments.
- Developing tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations.
- Improving Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations.
- Reducing manual production touches through APIs, GitOps, automation, and agent-assisted workflows.
- Participating in on-call, incident response, debugging, and durable follow-up work.
- Partnering with platform, storage, networking, security, and workload teams to make infrastructure production-ready.
Requirements include:
- 8+ years of experience building or operating production infrastructure.
- Strong programming skills in Python, Go, or similar.
- Experience with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation.
- Ability to troubleshoot distributed systems in production.
- Clear communication and ability to work across teams.
- BS/MS in Computer Science or equivalent experience.
Preferred qualifications include experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, or fleet automation.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Software-Engineer--DGX-Cloud-Production-Engineering_JR2019319