Description

NVIDIA's AI Networking Codesign and Benchmarking R&D group is seeking a senior software engineer to profile, analyze, and optimize AI workloads on large-scale GPU and CPU clusters used for distributed Deep Learning LLM training and inference. The role focuses on collectives communication and networking across hardware components and software layers.

Responsibilities:

Characterize AI workloads and deep learning models for large-scale LLM training and inference on NVIDIA supercomputers, focusing on distributed systems with high-performance networking and NVIDIA communication libraries.
Benchmark, profile, and analyze performance to identify bottlenecks and areas for improvement, particularly in networking aspects.
Develop PyTorch trace-based profiling, analysis, and replaying toolset for benchmarking, debugging, and co-designing network systems for LLM workloads.
Collaborate with multiple teams to provide performance analysis insights.
Define performance test plans, set performance expectations, and work to achieve performance targets.

Requirements:

B.Sc in Computer Science or Software Engineering or equivalent experience.
15+ years of experience with high-performance networking (RDMA, MPI, NCCL, SHARP).
Demonstrated ability in performance evaluation techniques and approaches.
Experience with NVIDIA GPUs and the CUDA library, deep learning frameworks like TensorFlow or PyTorch, and networking collective communication libraries such as NCCL.
Proficiency in programming languages: Python, Bash, and C++.
Experience with container-based development environments.

Benefits:

Competitive salaries
Generous benefits package
Equity eligibility

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Principal-Developer--AI-Networking_JR2019187