Description
NVIDIA is looking for a Senior Solutions Architect to join its NVIDIA Infrastructure Specialist Team. The successful candidate will be responsible for building AI/HPC infrastructure for new and existing customers, supporting operational and reliability aspects of large-scale AI clusters, and engaging in the whole lifecycle of services from inception and design through deployment, operation, and refinement.
Primary responsibilities will include:
- Building AI/HPC infrastructure for new and existing customers
- Supporting operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, real-time monitoring, logging, and alerting
- Engaging in and improving the whole lifecycle of services,from inception and design through deployment, operation, and refinement
- Maintaining services once they are live by measuring and monitoring availability, latency, and overall system health
The ideal candidate will have:
- A BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields
- At least 5+ years of professional experience in networking fundamentals, Ethernet or InfiniBand World
- Hands-on experience with network switch/router platforms like Cumulus Linux, SONiC, IOS, JunosOS, and EOS, etc.
- Solid working knowledge of Ethernet/InfiniBand/RDMA core principles
- Proficiency in end-to-end IB/Eth cluster deployment, adapter configuration and firmware maintenance, and able to conduct professional performance benchmarking with mainstream RDMA testing tools
- Ability to independently diagnose and troubleshoot typical IB/Eth network anomalies, including link flapping, connection failure, as well as bandwidth and latency jitter issues
- Master practical RDMA network optimization strategies such as QP tuning, MTU configuration and congestion control optimization
- Hands-on working experience in RDMA-accelerated business scenarios, including distributed storage and high-performance computing clusters
- Extensive experience delivering automated network provisioning solutions using tools like Ansible, Salt, and Python
- Ability to develop CI/CD pipelines for network operations
Preferred qualifications include:
- Familiarity with cloud networks (AWS, GCP, Azure)
- Advanced Linux or Networking Certifications
- Experience with High-performance computing architectures. Understanding of how job schedulers (Slurm, PBS) work
- Lustre management technologies knowledge (bonus credit for BCM (Base Command Manager))
- Experience with GPU (Graphics Processing Unit) focused hardware/software
NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/India-Pune/Senior-Solutions-Architect--Infiniband-and-Networking-Ethernet---NVIS_JR2019584