Description
We are looking for a Principal Software Engineer to join our DGX Cloud team and build the foundational systems that drive NVIDIA's high-performance GPU infrastructure. You will play a meaningful role in crafting scalable automation solutions, integrating diverse systems, and enabling seamless workflows across global cloud operations.
As a Principal Engineer in DGX Cloud, you will be at the pinnacle of technical leadership. You will directly craft the platform that fuels the future of AI and cloud computing.
Responsibilities:
- Lead the build and development of next-generation APIs, state management, and workflow orchestration systems that automate fleet lifecycle operations at a massive scale.
- Drive technical alignment across dependent systems and partner teams to ensure cohesive integration, clear interfaces, and reliable end-to-end workflows, with a strong focus on delivery.
- Act as a force-multiplier by coaching, mentoring, and encouraging senior engineers, elevating the technical standards and guidelines across the organization.
- Maintain an incredible focus on the customer experience and product requirements, translating deep technical insight into high-impact business solutions.
- Partner with executive and engineering leadership to codify critical business processes into self-measuring, scalable, and operationally consistent platforms, drastically reducing manual toil.
- Direct the integration strategy for key technologies, including common AI schedulers (e.g., Kubernetes, Slurm) and innovative observability systems (e.g., Prometheus, OpenTelemetry, Grafana).
Requirements:
- 16+ years of progressive industry experience
- Master's or Bachelor's degree, or equivalent experience defining and shipping complex distributed systems.
- Deep, hands-on expertise in establishing, operating, and scaling services in a fast-paced, high-reliability environment.
- Thrive in ambiguous, fast-paced environments by rapidly testing ideas, iterating toward working solutions, and then hardening the winners into reliable, scalable systems.
- Outstanding proficiency in modern systems programming languages such as Go, Java, or Python.
- Proven track record of defining, owning, and evolving the architecture of high-scale distributed systems, including advanced patterns for APIs, control planes, and data pipelines.
- Deep understanding of global cloud infrastructure (AWS, GCP, Azure) and container ecosystems (Docker, Kubernetes).
- Demonstrated ability to drive technical strategy and influence outcomes across organizational boundaries.
- Outstanding ability to communicate complex technical concepts, drive organizational consensus, and mentor high-performing engineers.
Benefits:
- Widely considered to be one of the technology world's most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package.
- Eligible for equity and benefits.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Principal-Software-Engineer---DGX-Cloud_JR2012048-1