Description

As a member of the Hardware Infrastructure Farm team, you will provide leadership in the design and implementation of groundbreaking compute clusters that power all silicon development across NVIDIA. You will be responsible for building and operating these clusters at high reliability, efficiency, and performance, and driving foundational improvements and automation to improve engineers' productivity.

Your responsibilities will include troubleshooting incoming support requests in a large-scale HPC environment, contributing enhancements to existing deployment automation, configuration management, observability, and operational monitoring, and ensuring compute servers are running the correct Operating System and configuration.

You will also troubleshoot complex issues, perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency, and collaborate with specialist teams to drive issues to closure.

Additionally, you will collaborate with domain experts to improve how our chip development process utilizes our infrastructure, directly contribute to the overall quality and improve time to market for our next-generation chips, and ensure that our systems relate to each other in a way that supports efficient and effective operations.

We are looking for someone with proficient administration of CentOS/RHEL Linux distributions, understanding of container technologies like Docker, proficiency in Python and UNIX scripting languages such as bash, excellent problem-solving skills, and excellent communication and teamwork skills.

You should have a BS in Computer Science or similar degree, with 2+ years of relevant post-degree experience, and solid understanding of cluster configuration management tools such as Ansible. Familiarity with job scheduler administration, knowledge of key Linux technologies, and experience building and operating large-scale compute infrastructure are also desirable.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Site-Reliability-Engineer--HPC-and-LSF_JR2006583