# Senior AI Infrastructure Software Engineer - DGX Cloud

**Company**: NVIDIA
**Location**: Santa Clara
**Work arrangement**: remote
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-AI-Infrastructure-Software-Engineer---DGX-Cloud_JR2018042?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_5fecc5a7-6a4

## Description

Joining NVIDIA's DGX Cloud Lepton Team means contributing to the leading cloud product that powers innovative AI research and developers. We focus on building the AI/ML platform for improving productivity, optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI infrastructure services globally.

As a senior DGX Cloud AI Infrastructure software engineer at NVIDIA, you will have the opportunity to work on innovative technologies that power the future of AI and be part of a dynamic and supportive team that values learning and growth. The role provides the autonomy to work on meaningful projects with the support and mentorship needed to succeed, and contributes to a culture of blameless postmortems, iterative improvement, and risk-taking.

**Responsibilities:**

- Develop platform and tools for large-scale AI, LLM, and GenAI infrastructure.

- Develop and optimize tools to improve AI/ML workload efficiency and resiliency.

- Root cause and analyze and triage failures from the application level to the hardware level

- Enhance infrastructure and products underpinning NVIDIA's AI platforms.

- Co-design and implement APIs for integration with NVIDIA's resiliency stacks on the platform.

- Define meaningful and actionable reliability metrics to track and improve system and service reliability.

**Requirements:**

- Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems.

- Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).

- Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.

- Proven track record in building and scaling large-scale distributed systems.

- Experience with AI training and inferencing and data infrastructure services.

- Familiar in Kubernetes and operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki).

- Proficiency in programming languages such as Python, C/C++, script languages

- Excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential.

**Nice to Have:**

- Experience in working with the large scale AI cluster and cloud-native infrastructure

- Strong understanding of NVIDIA GPUs, network technologies (RDMA, IB, NCCL)

- Good understanding on DL frameworks internal PyTorch, TensorFlow, JAX, Dynamo, and Ray

- Experience and root cause analysis of failures and datacenter scale

- Strong background in software design and development.

## Skills

### Required
- Python
- C/C++
- script languages
- Kubernetes
- ELK
- Prometheus
- Loki
- PyTorch
- TensorFlow
- JAX
- Dynamo
- Ray

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-AI-Infrastructure-Software-Engineer---DGX-Cloud_JR2018042?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
