# Senior HPC Cluster Engineer

**Company**: NVIDIA
**Location**: Santa Clara
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-HPC-Cluster-Engineer_JR2014289?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_1bcdc0d4-53c

## Description

We are seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for EDA (Electronic Design Automation) and high-performance computing workloads used across multiple teams and projects.

As an HPC Cluster Engineer, you will be responsible for developing and enhancing our ecosystem around GPU-accelerated computing, including developing scalable automation solutions. You will continuously improve infrastructure provisioning, management, observability, and day-to-day operation through automation.

Key responsibilities include:

- Providing technical leadership and strategic guidance for managing large-scale HPC systems, including the deployment of compute, networking, and storage.

- Fostering strong customer and multi-functional partnerships to ensure consistent cluster support and rapidly adapt to evolving user needs.

- Supporting researchers to run their EDA workloads, including performance analysis and optimizations.

- Conducting root cause analysis and suggesting corrective action. Proactively finding and fixing issues before they occur.

- Building innovative tooling to accelerate researchers' velocity, debugging, and software performance at scale.

Requirements include:

- Bachelor's degree in Computer Science, Electrical Engineering, or related field or equivalent experience.

- Minimum of 5 years of proven experience crafting and operating large-scale compute infrastructure, including cluster configuration management tools such as BCM or Ansible.

- Experience with AI/HPC job schedulers and orchestrators, such as Slurm, LSF, PBS, or K8s. Applied experience with AI/HPC workflows that use MPI and NCCL.

- Proficient in using Linux, including Rocky/CentOS/RHEL, and/or Ubuntu Linux distributions. A solid understanding of container technologies such as Enroot and Docker.

- Proficiency in Python and Bash.

- Experience analysing and tuning performance for a variety of EDA workloads. Excellent problem-solving to analyse complex systems, identify bottlenecks, and implement scalable solutions.

- Excellent communication and collaboration skills, with the ability to work effectively with various teams and individuals.

- Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC infrastructure fields.

## Skills

### Required
- HPC Cluster Engineer
- GPU Compute Clusters
- EDA Workloads
- High-Performance Computing
- Linux
- Python
- Bash
- Slurm
- LSF
- PBS
- K8s
- MPI
- NCCL
- Enroot
- Docker

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-HPC-Cluster-Engineer_JR2014289?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
