Description
NVIDIA is seeking a Solutions Architect, Networking to help design and deploy large-scale AI Factories across Canada. In this role, you will collaborate with customers to build end-to-end infrastructure. You will become a trusted technical advisor working on exciting projects, focused on how high-performance networking enables generative AI, large language models, and production AI inference pipelines. You will also collaborate with a diverse set of internal engineering, product, and business teams on performance analysis and modeling of these large GPU clusters. You should be comfortable working in a dynamic environment and have hands-on experience with NVIDIA networking and GPU technologies. This is an excellent opportunity to be at the center of Canada's rapidly growing AI infrastructure landscape.
Key Responsibilities:
- Become the trusted technical advisor for NVIDIA Cloud Partners in Canada to rapidly bring NVIDIA Data Center GPU and networking platforms to market at scale.
- Collaborate directly with customers to build, deploy, and optimize large-scale AI training and inference infrastructure using NVIDIA technology.
- Analyze deployment and performance data, identify product health trends, system bottlenecks, and operational risks.
- Solve challenging technical problems involving GPUs, networking, drivers, containers, firmware, and distributed system interactions.
- Deliver streamlined executive-level communication on status, risks, progress, and required decisions.
Requirements:
- Bachelor's degree in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related field (or equivalent experience)
- 5+ years of Solution Architecture (or similar Sales Engineering, Systems Engineering, Cloud Engineering, Solution Engineering)
- Understanding of high-performance networking technologies (e.g., RDMA, congestion control, high-bandwidth interconnects), and their role in distributed AI workloads.
- Hands-on experience with bring-up and validation of large-scale NVIDIA GPU platforms, including multi-GPU and multi-node architectures.
- Familiarity with NVIDIA system software stacks: CUDA, NCCL, NVSwitch/NVLink, driver behavior, and performance tuning.
- Ability to identify performance bottlenecks at the cluster, node, accelerator, network, or application layer.
- Strong Linux fundamentals across drivers, kernel subsystems, cgroups, containers, and node-level performance analysis.
- Excellent presentation, communication, and collaboration skills.
Nice to Have:
- Prior experience deploying or optimizing deep learning training and inference at scale in production environments on large GPU clusters.
- Familiarity with NVIDIA hardware (such as GPUs, networking, storage) and systems technology such as NCCL, DCGM, UFM, Mission Control, Base Command Manager.
- Demonstrated leadership resolving multi-team infrastructure challenges across engineering, product, and customer groups.
- A consistent record of taking GPU or infrastructure products from pilot to high-volume deployment in large data center environments.