Description
We are seeking highly motivated and skilled systems engineers to join our team to help develop an AI Platform that offers an efficient infrastructure for inference and training large-scale models.
As a systems engineer, you will play a crucial role in building a unified solution that brings our innovative NVIDIA technologies such as high-performance, inference/training frameworks, ML compilers, performance predictor, and cluster scheduler into a single, cohesive platform.
Responsibilities:
- Take part in the development of the NVIDIA's AI platform for training, fine-tuning, and serving latest and greatest AI models with the best performance and efficiency.
- Design and build solutions for scheduling large-scale AI training and inference workloads on GPU clusters over many cloud infrastructure.
- Explore and find solutions for open problems like industry-scale resource management, GPU scheduling, performance prediction, and live workload migration.
- Work with and contribute to adjacent teams like TensorRT/Dynamo inference engine, ML compiler, KAI/Grove scheduler, Lepton cloud, etc.
Requirements:
- Bachelor's degree or equivalent experience in Computer Science, Computer Engineering, relevant technical field.
- 5+ years of experience.
- Experience building large-scale systems from scratch. Prior experience in container-based deployment systems like Kubernetes is beneficial.
- Strong coding skills in programming languages like Python, Go, Rust, and/or C/C++.
- Solid foundation in other computer science and computer engineering topics: algorithms and data structures, operating systems, computer architecture, etc. Strong understanding of AI and related technologies is a huge plus.
- Ability to quickly grasp new concepts and thrive in evolving situations.
Ways to stand out from the crowd:
- Graduate-level education or relevant practical background, particularly in research, is beneficial.
- Practical experience in building and optimizing AI applications is highly desired.
- Proficiency in container software such as containerd, CRI-O, Linux namespace, CRIU, and NVIDIA GPU technology such as CUDA graphs, Driver/runtime is greatly advantageous.
You will also be eligible for equity and benefits.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/Canada-Toronto/DL-System-Software-Engineer---AI-Platform_JR2002456