Description
We are looking for a Senior Deep Learning Software Engineer to design and build our automated inference and deployment solution. As part of the team, you will be instrumental in defining a scalable architecture for DL inference with emphasis on ease-of-use and compute efficiency.
Your work will span multiple layers of the DL deployment stack, encompassing developing features in high-level frameworks like PyTorch and JAX, designing and implementing a high-performance execution environment, low-level GPU optimizations and developing custom GPU kernels in CUDA and/or Triton.
This is an exceptional opportunity for software engineers straddling the boundaries of research and engineering, with a strong background in both machine learning fundamentals and software architecture & engineering.
Responsibilities:
- Play a pivotal role in defining of a modular, scalable platform to seamlessly bridge training and deployment workflows,enabling tight integration of deployment tooling with training frameworks such as Megatron and Nemo
- Leverage and build upon the torch 2.0 ecosystem (TorchDynamo, torch.export, torch.compile, etc...) to analyze and extract standardized model graph representation from arbitrary torch models for our automated deployment solution.
- Develop support for inference optimization techniques such as speculative decoding and LoRA.
- Collaborate with teams across NVIDIA to use performant kernel implementations within the automated deployment solution.
- Analyze and profile GPU kernel-level performance to identify hardware and software optimization opportunities.
- Continuously innovate on the inference performance to ensure NVIDIA's inference software solutions (TRT, TRT-LLM, TRT Model Optimizer) can maintain and increase its leadership in the market.
Requirements:
- Masters, PhD, or equivalent experience in Computer Science, AI, Applied Math, or related field.
- 8+ years of relevant work or research experience in Deep Learning.
- Excellent software design skills, including debugging, performance analysis, and test design.
- Strong proficiency in Python, PyTorch, and related ML tools.
- Strong algorithms and programming fundamentals.
- Good written and verbal communication skills and the ability to work independently and collaboratively in a fast-paced environment.
Nice to Have:
- Contributions to PyTorch, JAX, or other Machine Learning Frameworks.
- Knowledge of GPU architecture and compilation stack, and capability of understanding and debugging end-to-end performance.
- Familiarity with NVIDIA's deep learning SDKs such as TensorRT.
- Prior experience in writing high-performance GPU kernels for machine learning workloads in frameworks such as CUDA, CUTLASS, or Triton.