Description
Joining NVIDIA's DGX Cloud Lepton Team means contributing to the leading cloud product that powers innovative AI research and developers. We focus on building the AI/ML platform for improving productivity, optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI infrastructure services globally.
As a senior DGX Cloud AI Infrastructure software engineer at NVIDIA, you will have the opportunity to work on innovative technologies that power the future of AI and be part of a dynamic and supportive team that values learning and growth. The role provides the autonomy to work on meaningful projects with the support and mentorship needed to succeed, and contributes to a culture of blameless postmortems, iterative improvement, and risk-taking.
Responsibilities:
- Develop platform and tools for large-scale AI, LLM, and GenAI infrastructure.
- Develop and optimize tools to improve AI/ML workload efficiency and resiliency.
- Root cause and analyze and triage failures from the application level to the hardware level
- Enhance infrastructure and products underpinning NVIDIA's AI platforms.
- Co-design and implement APIs for integration with NVIDIA's resiliency stacks on the platform.
- Define meaningful and actionable reliability metrics to track and improve system and service reliability.
Requirements:
- Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems.
- Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
- Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.
- Proven track record in building and scaling large-scale distributed systems.
- Experience with AI training and inferencing and data infrastructure services.
- Familiar in Kubernetes and operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki).
- Proficiency in programming languages such as Python, C/C++, script languages
- Excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential.
Nice to Have:
- Experience in working with the large scale AI cluster and cloud-native infrastructure
- Strong understanding of NVIDIA GPUs, network technologies (RDMA, IB, NCCL)
- Good understanding on DL frameworks internal PyTorch, TensorFlow, JAX, Dynamo, and Ray
- Experience and root cause analysis of failures and datacenter scale
- Strong background in software design and development.