Description

We optimize and benchmark GenAI inference on NVIDIA's latest accelerators, defining the industry's performance standards across language models, video generation, and speech workloads. This team sits at the intersection of GPU performance engineering and public accountability.

As an AI Inference Performance Engineer, you will:

Drive industry benchmark results: own the end-to-end optimization pipeline, implement and integrate optimizations in quantization, scheduling, memory management, and distributed inference across TensorRT-LLM, SGLang, and vLLM.

Define and optimize cutting-edge workloads: identify and shape next-generation inference benchmarks, multi-turn coding, agentic workflows, and other emerging AI use cases. Collaborate with framework and kernel teams to push performance to its extreme on large-scale LLM-MoE models, vision-language models, video diffusion models, recommendation, and speech workloads.

Architect distributed inference: Design and optimize execution from single-GPU to rack-scale clusters, managing performance across clusters of GPUs.

Establish performance methodology: Apply roofline analysis and systematic profiling to decompose bottlenecks across CUDA kernels, frameworks, and serving layers.

Influence the ecosystem: contribute to TensorRT-LLM, vLLM, SGLang, and other open-source projects. Partner with architecture, kernel, and compiler teams to shape GPU roadmaps based on real workload data.

Technical Leadership: Raise the technical bar for the team, drive cross-functional execution on tight benchmark timelines, and lead a world-class team.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/AI-Inference-Performance-Engineer---New-College-Grad-2026_JR2014441