Description
We are looking for a Senior Cloud Infrastructure/DevOps Solutions Architect to join our NVIDIA Infrastructure Specialist Team. As a key member of our team, you will be responsible for designing, implementing, and maintaining large-scale cloud infrastructure and DevOps solutions. Your expertise will be utilized to analyze, define, and implement large-scale Networking projects, including a combination of Networking, System Design, and Automation. You will interact with customers, partners, and internal teams to ensure seamless delivery of our solutions.
Key Responsibilities:
- Maintain large-scale HPC/AI clusters with monitoring, logging, and alerting
- Manage Linux job/workload schedulers and orchestration tools
- Develop and maintain continuous integration and delivery pipelines
- Develop tooling to automate deployment and management of large-scale infrastructure environments
- Deploy monitoring solutions for servers, network, and storage
- Perform troubleshooting from bare metal to application level
- Develop, redefine, and document standard methodologies to share with internal teams
Requirements:
- Bachelor's degree in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields
- At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture
- Knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software
- Extensive knowledge and hands-on experience with Kubernetes, including container orchestration for AI/ML workloads, resource scheduling, scaling, and integration with HPC environments
- Experience in managing and installing HPC clusters, including deployment, optimization, and troubleshooting
- Experience with job scheduling workloads and orchestration technologies such as Slurm, Kubernetes, and Singularity
- Excellent knowledge of Windows and Linux systems, including internals, ACLs, OS-level security protections, and common protocols like TCP, DHCP, DNS, etc.
- Experience with multiple storage solutions, including Lustre, GPFS, ZFS, and XFS
- Proficiency in Python programming and bash scripting
- Knowledge of CI/CD pipelines for software deployment and automation
- Comfortable with automation and configuration management tools, including Jenkins, Ansible, Puppet/Chef, etc.
Preferred Qualifications:
- Knowledge of CPU and/or GPU architecture
- Knowledge of Kubernetes, container-related microservice technologies
- Experience with GPU-focused hardware/software (DGX, CUDA)
- Background with RDMA (InfiniBand or RoCE) fabrics
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/Japan-Remote/Senior-Solutions-Architect--Cloud-Infrastructure-and-DevOps---NVIS_JR1997336