# Staff Software Engineer - GenAI Performance and Kernel

**Company**: Databricks
**Location**: San Francisco, California
**Work arrangement**: onsite
**Experience**: staff
**Job type**: full-time
**Salary**: $190,900-$232,800 USD per year
**Category**: Engineering
**Industry**: Technology
**Wikidata**: https://www.wikidata.org/wiki/Q18350420

**Apply**: https://job-boards.greenhouse.io/databricks/jobs/8202700002
**Canonical**: https://yubhub.co/jobs/job_faffae87-882

## Description

As a staff software engineer for GenAI Performance and Kernel, you will own the design, implementation, optimization, and correctness of the high-performance GPU kernels powering our GenAI inference stack. You will lead development of highly-tuned, low-level compute paths, manage trade-offs between hardware efficiency and generality, and mentor others in kernel-level performance engineering.

Key responsibilities include:

- Leading the design, implementation, benchmarking, and maintenance of core compute kernels optimized for various hardware backends (GPU, accelerators)

- Driving the performance roadmap for kernel-level improvements: vectorization, tensorization, tiling, fusion, mixed precision, sparsity, quantization, memory reuse, scheduling, auto-tuning, etc.

- Integrating kernel optimizations with higher-level ML systems

- Building and maintaining profiling, instrumentation, and verification tooling to detect correctness, performance regressions, numerical issues, and hardware utilization gaps

- Leading performance investigations and root-cause analysis on inference bottlenecks, e.g. memory bandwidth, cache contention, kernel launch overhead, tensor fragmentation

- Establishing coding patterns, abstractions, and frameworks to modularize kernels for reuse, cross-backend portability, and maintainability

- Influencing system architecture decisions to make kernel improvements more effective (e.g. memory layout, dataflow scheduling, kernel fusion boundaries)

- Mentoring and guiding other engineers working on lower-level performance, providing code reviews, and helping set best practices

- Collaborating with infrastructure, tooling, and ML teams to roll out kernel-level optimizations into production, and monitoring their impact

Requirements include:

- BS/MS/PhD in Computer Science, or a related field

- Deep hands-on experience writing and tuning compute kernels (CUDA, Triton, OpenCL, LLVM IR, assembly or similar sort) for ML workloads

- Strong knowledge of GPU/accelerator architecture: warp structure, memory hierarchy (global, shared, register, L1/L2 caches), tensor cores, scheduling, SM occupancy, etc.

- Experience with advanced optimization techniques: tiling, blocking, software pipelining, vectorization, fusion, loop transformations, auto-tuning

- Familiarity with ML-specific kernel libraries (cuBLAS, cuDNN, CUTLASS, oneDNN, etc.) or open kernels

- Strong debugging and profiling skills (Nsight, NVProf, perf, vtune, custom instrumentation)

- Experience reasoning about numerical stability, mixed precision, quantization, and error propagation

- Experience in integrating optimized kernels into real-world ML inference systems; exposure to distributed inference pipelines, memory management, and runtime systems

- Experience building high-performance products leveraging GPU acceleration

- Excellent communication and leadership skills , able to drive design discussions, mentor colleagues, and make trade-offs visible

- A track record of shipping performance-critical, high-quality production software

- Bonus: published in systems/ML performance venues (e.g. MLSys, ASPLOS, ISCA, PPoPP), experience with custom accelerators or FPGA, experience with sparsity or model compression techniques

The pay range for this role is $190,900-$232,800 USD per year, depending on location and experience.

## Skills

### Required
- Compute kernels
- GPU/accelerator architecture
- Advanced optimization techniques
- ML-specific kernel libraries
- Debugging and profiling skills
- Numerical stability
- Mixed precision
- Quantization
- Error propagation
- Distributed inference pipelines
- Memory management
- Runtime systems
- High-performance products
- GPU acceleration