New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
NVIDIA

Deep Learning Performance Architect, CUTLASS DSL

NVIDIA
Apply →
onsite mid full-time Shanghai

First indexed 2 Jun 2026

Description

Are you passionate about programming languages, compiler technology, and GPU performance? We are looking for outstanding engineers to build CUTLASS DSL, a Python-native language for GPU kernel development, along with the MLIR dialects and lowering passes behind it.

In this role, you will design, develop, and optimize CUTLASS DSL, a Python-native language for high-performance GPU kernel development.

Responsibilities:

  • Design, develop, and optimize CUTLASS DSL, a Python-native language for high-performance GPU kernel development
  • Build and advance the MLIR dialects, lowering passes, and code generation flows that power the CUTLASS DSL stack
  • Drive innovations that improve kernel compilation speed while maintaining performance on par with CUTLASS C++

Requirements:

  • MS, PhD, or equivalent experience in Computer Science, Software Engineering, or a related field
  • 2+ years of relevant work experience
  • Excellent programming skills in Python and strong proficiency in C++
  • Hands-on experience with DSLs, compilers, or code generation systems
  • Strong command of the MLIR/LLVM stack, including IR design and pass optimization

Preferred Qualifications:

  • Deep understanding of the CUDA GPU programming model, GPU microarchitecture, and performance analysis and optimization techniques
  • Familiarity with key high-performance computing abstractions such as Layout, Tile, MMA, and TMA in the CuTeecosystem