# Principal Developer, AI Networking

**Company**: NVIDIA
**Location**: Santa Clara, CA
**Work arrangement**: remote
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Principal-Developer--AI-Networking_JR2019187?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_e84cdb21-ac9

## Description

NVIDIA's AI Networking Codesign and Benchmarking R&D group is seeking a senior software engineer to profile, analyze, and optimize AI workloads on large-scale GPU and CPU clusters used for distributed Deep Learning LLM training and inference. The role focuses on collectives communication and networking across hardware components and software layers.

**Responsibilities:**

- Characterize AI workloads and deep learning models for large-scale LLM training and inference on NVIDIA supercomputers, focusing on distributed systems with high-performance networking and NVIDIA communication libraries.

- Benchmark, profile, and analyze performance to identify bottlenecks and areas for improvement, particularly in networking aspects.

- Develop PyTorch trace-based profiling, analysis, and replaying toolset for benchmarking, debugging, and co-designing network systems for LLM workloads.

- Collaborate with multiple teams to provide performance analysis insights.

- Define performance test plans, set performance expectations, and work to achieve performance targets.

**Requirements:**

- B.Sc in Computer Science or Software Engineering or equivalent experience.

- 15+ years of experience with high-performance networking (RDMA, MPI, NCCL, SHARP).

- Demonstrated ability in performance evaluation techniques and approaches.

- Experience with NVIDIA GPUs and the CUDA library, deep learning frameworks like TensorFlow or PyTorch, and networking collective communication libraries such as NCCL.

- Proficiency in programming languages: Python, Bash, and C++.

- Experience with container-based development environments.

**Benefits:**

- Competitive salaries

- Generous benefits package

- Equity eligibility

## Skills

### Required
- High-performance networking
- RDMA
- MPI
- NCCL
- SHARP
- NVIDIA GPUs
- CUDA
- Deep learning frameworks
- Python
- Bash
- C++
- Container-based development

### Nice to have
- PyTorch
- TensorFlow
- AI workloads
- Benchmarking
- Performance analysis

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Principal-Developer--AI-Networking_JR2019187?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
