Description

We are looking for a Senior Cloud Infrastructure/DevOps Solutions Architect to join our NVIDIA Infrastructure Specialist Team. As a key member of our team, you will be responsible for designing, implementing, and maintaining large-scale cloud infrastructure and DevOps solutions. Your expertise will be utilized to analyze, define, and implement large-scale Networking projects, including a combination of Networking, System Design, and Automation. You will interact with customers, partners, and internal teams to ensure seamless delivery of our solutions.

Key Responsibilities:

Maintain large-scale HPC/AI clusters with monitoring, logging, and alerting
Manage Linux job/workload schedulers and orchestration tools
Develop and maintain continuous integration and delivery pipelines
Develop tooling to automate deployment and management of large-scale infrastructure environments
Deploy monitoring solutions for servers, network, and storage
Perform troubleshooting from bare metal to application level
Develop, redefine, and document standard methodologies to share with internal teams

Requirements:

Bachelor's degree in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields
At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture
Knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software
Extensive knowledge and hands-on experience with Kubernetes, including container orchestration for AI/ML workloads, resource scheduling, scaling, and integration with HPC environments
Experience in managing and installing HPC clusters, including deployment, optimization, and troubleshooting
Experience with job scheduling workloads and orchestration technologies such as Slurm, Kubernetes, and Singularity
Excellent knowledge of Windows and Linux systems, including internals, ACLs, OS-level security protections, and common protocols like TCP, DHCP, DNS, etc.
Experience with multiple storage solutions, including Lustre, GPFS, ZFS, and XFS
Proficiency in Python programming and bash scripting
Knowledge of CI/CD pipelines for software deployment and automation
Comfortable with automation and configuration management tools, including Jenkins, Ansible, Puppet/Chef, etc.

Preferred Qualifications:

Knowledge of CPU and/or GPU architecture
Knowledge of Kubernetes, container-related microservice technologies
Experience with GPU-focused hardware/software (DGX, CUDA)
Background with RDMA (InfiniBand or RoCE) fabrics

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/Japan-Remote/Senior-Solutions-Architect--Cloud-Infrastructure-and-DevOps---NVIS_JR1997336