Description

We are seeking highly motivated and skilled systems engineers to join our team to help develop an AI Platform that offers an efficient infrastructure for inference and training large-scale models.

As a systems engineer, you will play a crucial role in building a unified solution that brings our innovative NVIDIA technologies such as high-performance, inference/training frameworks, ML compilers, performance predictor, and cluster scheduler into a single, cohesive platform.

Responsibilities:

Take part in the development of the NVIDIA's AI platform for training, fine-tuning, and serving latest and greatest AI models with the best performance and efficiency.

Design and build solutions for scheduling large-scale AI training and inference workloads on GPU clusters over many cloud infrastructure.

Explore and find solutions for open problems like industry-scale resource management, GPU scheduling, performance prediction, and live workload migration.

Work with and contribute to adjacent teams like TensorRT/Dynamo inference engine, ML compiler, KAI/Grove scheduler, Lepton cloud, etc.

Requirements:

Bachelor's degree or equivalent experience in Computer Science, Computer Engineering, relevant technical field.

5+ years of experience.

Experience building large-scale systems from scratch. Prior experience in container-based deployment systems like Kubernetes is beneficial.

Strong coding skills in programming languages like Python, Go, Rust, and/or C/C++.

Solid foundation in other computer science and computer engineering topics: algorithms and data structures, operating systems, computer architecture, etc. Strong understanding of AI and related technologies is a huge plus.

Ability to quickly grasp new concepts and thrive in evolving situations.

Ways to stand out from the crowd:

Graduate-level education or relevant practical background, particularly in research, is beneficial.

Practical experience in building and optimizing AI applications is highly desired.

Proficiency in container software such as containerd, CRI-O, Linux namespace, CRIU, and NVIDIA GPU technology such as CUDA graphs, Driver/runtime is greatly advantageous.

You will also be eligible for equity and benefits.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/Canada-Toronto/DL-System-Software-Engineer---AI-Platform_JR2002456