# Deep Learning Performance Architect, CUTLASS DSL

**Company**: NVIDIA
**Location**: Shanghai
**Work arrangement**: onsite
**Experience**: mid
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/China-Shanghai/Deep-Learning-Performance-Architect--CUTLASS-DSL_JR2018773?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_2ae33630-365

## Description

Are you passionate about programming languages, compiler technology, and GPU performance? We are looking for outstanding engineers to build CUTLASS DSL, a Python-native language for GPU kernel development, along with the MLIR dialects and lowering passes behind it.

In this role, you will design, develop, and optimize CUTLASS DSL, a Python-native language for high-performance GPU kernel development.

Responsibilities:

- Design, develop, and optimize CUTLASS DSL, a Python-native language for high-performance GPU kernel development

- Build and advance the MLIR dialects, lowering passes, and code generation flows that power the CUTLASS DSL stack

- Drive innovations that improve kernel compilation speed while maintaining performance on par with CUTLASS C++

Requirements:

- MS, PhD, or equivalent experience in Computer Science, Software Engineering, or a related field

- 2+ years of relevant work experience

- Excellent programming skills in Python and strong proficiency in C++

- Hands-on experience with DSLs, compilers, or code generation systems

- Strong command of the MLIR/LLVM stack, including IR design and pass optimization

Preferred Qualifications:

- Deep understanding of the CUDA GPU programming model, GPU microarchitecture, and performance analysis and optimization techniques

- Familiarity with key high-performance computing abstractions such as Layout, Tile, MMA, and TMA in the CuTeecosystem

## Skills

### Required
- Python
- C++
- MLIR
- LLVM
- GPU kernel development
- Compiler technology
- Performance analysis and optimization

### Nice to have
- CUDA
- GPU microarchitecture
- Layout
- Tile
- MMA
- TMA

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/China-Shanghai/Deep-Learning-Performance-Architect--CUTLASS-DSL_JR2018773?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
