New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
NVIDIA

Senior Solutions Architect, Cloud Infrastructure and DevOps

NVIDIA
Apply →
remote senior full-time Competitive salary and benefits package Japan

First indexed 29 May 2026

Description

We are looking for a Senior Cloud Infrastructure/DevOps Solutions Architect to join our NVIDIA Infrastructure Specialist Team. As a key member of our team, you will be responsible for designing, implementing, and maintaining large-scale cloud infrastructure and DevOps solutions. Your expertise will be utilized to analyze, define, and implement large-scale Networking projects, including a combination of Networking, System Design, and Automation. You will interact with customers, partners, and internal teams to ensure seamless delivery of our solutions.

Key Responsibilities:

  • Maintain large-scale HPC/AI clusters with monitoring, logging, and alerting
  • Manage Linux job/workload schedulers and orchestration tools
  • Develop and maintain continuous integration and delivery pipelines
  • Develop tooling to automate deployment and management of large-scale infrastructure environments
  • Deploy monitoring solutions for servers, network, and storage
  • Perform troubleshooting from bare metal to application level
  • Develop, redefine, and document standard methodologies to share with internal teams

Requirements:

  • Bachelor's degree in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields
  • At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture
  • Knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software
  • Extensive knowledge and hands-on experience with Kubernetes, including container orchestration for AI/ML workloads, resource scheduling, scaling, and integration with HPC environments
  • Experience in managing and installing HPC clusters, including deployment, optimization, and troubleshooting
  • Experience with job scheduling workloads and orchestration technologies such as Slurm, Kubernetes, and Singularity
  • Excellent knowledge of Windows and Linux systems, including internals, ACLs, OS-level security protections, and common protocols like TCP, DHCP, DNS, etc.
  • Experience with multiple storage solutions, including Lustre, GPFS, ZFS, and XFS
  • Proficiency in Python programming and bash scripting
  • Knowledge of CI/CD pipelines for software deployment and automation
  • Comfortable with automation and configuration management tools, including Jenkins, Ansible, Puppet/Chef, etc.

Preferred Qualifications:

  • Knowledge of CPU and/or GPU architecture
  • Knowledge of Kubernetes, container-related microservice technologies
  • Experience with GPU-focused hardware/software (DGX, CUDA)
  • Background with RDMA (InfiniBand or RoCE) fabrics