Description
NVIDIA's AI Networking Codesign and Benchmarking R&D group is seeking a senior software engineer to profile, analyze, and optimize AI workloads on large-scale GPU and CPU clusters used for distributed Deep Learning LLM training and inference. The role focuses on collectives communication and networking across hardware components and software layers.
Responsibilities:
- Characterize AI workloads and deep learning models for large-scale LLM training and inference on NVIDIA supercomputers, focusing on distributed systems with high-performance networking and NVIDIA communication libraries.
- Benchmark, profile, and analyze performance to identify bottlenecks and areas for improvement, particularly in networking aspects.
- Develop PyTorch trace-based profiling, analysis, and replaying toolset for benchmarking, debugging, and co-designing network systems for LLM workloads.
- Collaborate with multiple teams to provide performance analysis insights.
- Define performance test plans, set performance expectations, and work to achieve performance targets.
Requirements:
- B.Sc in Computer Science or Software Engineering or equivalent experience.
- 15+ years of experience with high-performance networking (RDMA, MPI, NCCL, SHARP).
- Demonstrated ability in performance evaluation techniques and approaches.
- Experience with NVIDIA GPUs and the CUDA library, deep learning frameworks like TensorFlow or PyTorch, and networking collective communication libraries such as NCCL.
- Proficiency in programming languages: Python, Bash, and C++.
- Experience with container-based development environments.
Benefits:
- Competitive salaries
- Generous benefits package
- Equity eligibility
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Principal-Developer--AI-Networking_JR2019187