Description
We are looking for a Senior Solutions Architect specializing in Data Center Systems & Performance to join our elite solutions architecture team. In this role, you will work at the intersection of groundbreaking hardware and complex software stacks. As a Solutions Architect, you will act as a pivotal technical expert uniting engineering, field teams, and customers with highly intensive requirements. You will be responsible for analyzing and optimizing the performance of world-class AI, deep learning, and HPC ecosystems.
Responsibilities:
- Work together with our partners and customers to identify, analyze, and resolve complex performance bottlenecks across interconnected GPU, CPU, and networking systems.
- Complete and maintain robust performance benchmarking suites to stress-test high-performance clusters and establish performance baselines.
- Apply industry-standard performance tools to monitor hardware performance counters and extract deep system telemetry.
- Deeply investigate system and software configurations to find and fix subtle discrepancies that impact peak performance.
- Partner closely with internal engineering units and outside collaborators and customers to collectively develop solutions and boost infrastructure performance.
Requirements:
- BS or MS in Engineering, Electrical Engineering, Physics, or Computer Science (or equivalent experience).
- 8+ years of work-related experience in the high-tech industry, particularly in system build, performance analysis, and technical customer-facing roles.
- A strong understanding of how CPUs, GPUs, and high-speed networking fabrics interact within massive clusters.
- Practical experience with performance counters, profiling tools, and telemetry collection systems (e.g., Perf, eBPF, Prometheus, Grafana).
- Practical experience working with containers, cloud provisioning, and scheduling tools such as Docker, Docker Swarm, Kubernetes, SLURM, Ansible.
- Proven track record of transforming raw logs and telemetry into structured time series data, dashboards, and heat maps.
- The ability to translate complex, low-level technical performance anomalies into clear, actionable narratives for cross-functional teams.
- Strong collaborative skills and a proven history of building successful relationships across diverse engineering and operations teams.
Preferred Qualifications:
- Deep knowledge of multi-GPU communication libraries like NCCL, and how they optimize inter-GPU topologies.
- Deep, hands-on experience working directly with NVIDIA hardware architectures, NVLink, NVSwitch, or NVIDIA Nsight tools.
- Practical experience optimizing distributed AI training workloads, LLMs, or large-scale high-performance computing environments.
- Experience developing or integrating Agentic AI frameworks to autonomously parse telemetry logs, diagnose configuration drifts, or automate cluster triage.
- Eligible for equity and benefits.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Solutions-Architect--AI-Cluster-Performance-and-Telemetry_JR2019329