# Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

**Company**: NVIDIA
**Location**: US
**Work arrangement**: remote
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-NC-Remote/Senior-Software-Engineer--Distributed-Systems-Engineer---DGX-Cloud_JR2017916?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_f8c0adc0-fb4

## Description

We are hiring experienced software engineers to help scale up our AI Infrastructure. As a Senior Software Engineer, Distributed Systems Engineer, you will be part of the DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads.

Your primary responsibilities will include designing and developing a massively distributed scalable platform to identify, diagnose, and remediate non-performant GPU assets. You will work with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance. Evaluating system failures and improving services based on a well-defined incident management process.

To succeed in this role, you will need direct experience in a software engineering role within a highly technical organization with demonstrable impact from your work. You should be highly motivated with strong communication skills, able to work successfully with multi-functional teams, principles, and architects, and coordinate effectively across organizational boundaries and geographies.

The ideal candidate will have 12+ years of experience in similar roles and experience on large-scale production systems. They should possess a BS in Computer Science, Engineering, Physics, Mathematics, or a comparable degree or equivalent experience. They should also have technical knowledge, including a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.

Additionally, the successful candidate will have technical competency in managing and automating large-scale distributed systems independent of cloud providers. They should have advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Base Command Manager). Prior experience in asynchronous workflows and/or event-driven architecture is also desirable.

As a Senior Software Engineer, Distributed Systems Engineer, you will be eligible for equity and benefits. Applications for this job will be accepted at least until May 22, 2026.

## Skills

### Required
- software engineering
- cluster operations
- operator development
- node health monitoring
- GPU resource scheduling
- Kubernetes
- Slurm
- Base Command Manager
- asynchronous workflows
- event-driven architecture

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-NC-Remote/Senior-Software-Engineer--Distributed-Systems-Engineer---DGX-Cloud_JR2017916?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
