Description
Are you passionate about pushing the limits of real-time large language model inference? Join NVIDIA's TensorRT Edge-LLM team and help shape the next generation of edge AI for automotive and robotics.
We build the software stack that enables Large Language, Vision-Language, and Multimodal (LLM/VLM/VLA) models to run efficiently on embedded and edge platforms , delivering cutting-edge generative AI experiences directly on-device.
Responsibilities:
- Develop and evolve a state-of-the-art inference framework in modern C++ that extends TensorRT with autoregressive model serving capabilities, including speculative decoding, LoRA, MoE, and KV cache management.
- Design and implement compiler and runtime optimizations tailored for transformer-based models running on constrained, real-time platforms.
- Collaborate with teams across CUDA, kernel libraries, compilers, and robotics to deliver high-performance, production-ready solutions.
- Contribute to CUDA kernel and operator development for critical transformer components such as attention, GEMM, and MoE.
- Benchmark, profile, and optimize inference performance across diverse embedded and automotive environments.
- Stay ahead of the rapidly evolving LLM/VLM ecosystem and bring emerging techniques into product-grade software.
Requirements:
- BS, MS, PhD, or equivalent experience in Computer Science, Electrical/Computer Engineering, or a closely related field.
- 4+ years of relevant software development experience.
- Deep understanding of transformer models and inference optimization techniques (e.g., quantization, tensor parallelism, or memory-efficient scheduling).
- Proficient programming ability with modern C++ (C++11/14/17 and beyond).
- Familiarity with popular LLM frameworks and libraries such as TensorRT, TensorRT-LLM, vLLM, SGLang, MLC-LLM, or FlashInfer.
- A track record of strong software design, execution, and collaboration across fields.
Preferred Qualifications:
- Demonstrated development experience or open-source contributions to LLM inference frameworks and libraries, such as SGLang, vLLM, or FlashInfer.
- Proficiency with CUDA, including efficient kernel development, performance profiling, and GPU architecture fundamentals.
- Prior work on autoregressive LLM serving systems, including speculative decoding or KV cache management.
- Familiarity with compiler infrastructure for large language model inference.
- Exposure to robotics or embedded AI pipelines, including optimizing for low-latency, resource-constrained systems.
NVIDIA is widely considered to be one of the technology world's most desirable employers. We hire some of the most brilliant and forward-thinking people in the world. If you thrive on innovation, autonomy, and technical excellence, come join us to shape the future of edge AI.