New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
NVIDIA

Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

NVIDIA
Apply →
remote senior full-time US

First indexed 18 May 2026

Description

We are hiring experienced software engineers to help scale up our AI Infrastructure. As a Senior Software Engineer, Distributed Systems Engineer, you will be part of the DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads.

Your primary responsibilities will include designing and developing a massively distributed scalable platform to identify, diagnose, and remediate non-performant GPU assets. You will work with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance. Evaluating system failures and improving services based on a well-defined incident management process.

To succeed in this role, you will need direct experience in a software engineering role within a highly technical organization with demonstrable impact from your work. You should be highly motivated with strong communication skills, able to work successfully with multi-functional teams, principles, and architects, and coordinate effectively across organizational boundaries and geographies.

The ideal candidate will have 12+ years of experience in similar roles and experience on large-scale production systems. They should possess a BS in Computer Science, Engineering, Physics, Mathematics, or a comparable degree or equivalent experience. They should also have technical knowledge, including a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.

Additionally, the successful candidate will have technical competency in managing and automating large-scale distributed systems independent of cloud providers. They should have advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Base Command Manager). Prior experience in asynchronous workflows and/or event-driven architecture is also desirable.

As a Senior Software Engineer, Distributed Systems Engineer, you will be eligible for equity and benefits. Applications for this job will be accepted at least until May 22, 2026.