# Senior Solutions Architect, AI Cluster Performance and Telemetry

**Company**: NVIDIA
**Location**: Santa Clara
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Solutions-Architect--AI-Cluster-Performance-and-Telemetry_JR2019329?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_93173b43-49e

## Description

We are looking for a Senior Solutions Architect specializing in Data Center Systems & Performance to join our elite solutions architecture team. In this role, you will work at the intersection of groundbreaking hardware and complex software stacks. As a Solutions Architect, you will act as a pivotal technical expert uniting engineering, field teams, and customers with highly intensive requirements. You will be responsible for analyzing and optimizing the performance of world-class AI, deep learning, and HPC ecosystems.

**Responsibilities:**

- Work together with our partners and customers to identify, analyze, and resolve complex performance bottlenecks across interconnected GPU, CPU, and networking systems.

- Complete and maintain robust performance benchmarking suites to stress-test high-performance clusters and establish performance baselines.

- Apply industry-standard performance tools to monitor hardware performance counters and extract deep system telemetry.

- Deeply investigate system and software configurations to find and fix subtle discrepancies that impact peak performance.

- Partner closely with internal engineering units and outside collaborators and customers to collectively develop solutions and boost infrastructure performance.

**Requirements:**

- BS or MS in Engineering, Electrical Engineering, Physics, or Computer Science (or equivalent experience).

- 8+ years of work-related experience in the high-tech industry, particularly in system build, performance analysis, and technical customer-facing roles.

- A strong understanding of how CPUs, GPUs, and high-speed networking fabrics interact within massive clusters.

- Practical experience with performance counters, profiling tools, and telemetry collection systems (e.g., Perf, eBPF, Prometheus, Grafana).

- Practical experience working with containers, cloud provisioning, and scheduling tools such as Docker, Docker Swarm, Kubernetes, SLURM, Ansible.

- Proven track record of transforming raw logs and telemetry into structured time series data, dashboards, and heat maps.

- The ability to translate complex, low-level technical performance anomalies into clear, actionable narratives for cross-functional teams.

- Strong collaborative skills and a proven history of building successful relationships across diverse engineering and operations teams.

**Preferred Qualifications:**

- Deep knowledge of multi-GPU communication libraries like NCCL, and how they optimize inter-GPU topologies.

- Deep, hands-on experience working directly with NVIDIA hardware architectures, NVLink, NVSwitch, or NVIDIA Nsight tools.

- Practical experience optimizing distributed AI training workloads, LLMs, or large-scale high-performance computing environments.

- Experience developing or integrating Agentic AI frameworks to autonomously parse telemetry logs, diagnose configuration drifts, or automate cluster triage.

- Eligible for equity and benefits.

## Skills

### Required
- performance counters
- profiling tools
- telemetry collection systems
- containers
- cloud provisioning
- scheduling tools
- Docker
- Docker Swarm
- Kubernetes
- SLURM
- Ansible
- raw logs
- structured time series data
- dashboards
- heat maps
- multi-GPU communication libraries
- NCCL
- NVIDIA hardware architectures
- NVLink
- NVSwitch
- NVIDIA Nsight tools
- distributed AI training workloads
- LLMs
- large-scale high-performance computing environments
- Agentic AI frameworks

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Solutions-Architect--AI-Cluster-Performance-and-Telemetry_JR2019329?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
