New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
NVIDIA

System Software Engineer, Platform Operations

NVIDIA
Apply →
onsite senior full-time Shanghai

First indexed 28 Apr 2026

Description

We're seeking an operationally-focused System Software Engineer to ensure the stability, reliability, and flawless execution of all NVIDIA Deep Learning Institute (DLI) training events. You will also oversee the broader day-to-day operational health of the entire learning platform. Your operational acumen will be instrumental in powering our latest educational experiences focused on safe, trustworthy, and ethical AI, ensuring a seamless experience for instructors and learners.

Join a close-knit team where your contributions truly matter. As a core member of our learning systems platform team, you'll collaborate with creative educators to ensure our hands-on training sets the standard for user experience. You'll play a crucial role in making our purpose-built Learning Management System (LMS) platform a delightful and efficient tool that empowers both learners and instructors.

What you'll be doing:

  • Develop comprehensive operational plans and de-risking strategies to ensure flawless technical execution of technical training events.
  • Provide expert, hands-on technical leadership during live training events, managing deployments and rapidly resolving emergent issues for an optimal user experience.
  • Oversee the stability, scalability, and reliability of the DLI learning platform, implementing SRE principles and leading incident response for optimal performance and reliability.
  • Lead cross-functional coordination, establish and enforce operational best practices, and drive continuous improvement initiatives to enhance platform services.

What we need to see:

  • Bachelor's degree in Computer Science, a related technical field, or equivalent experience with over 5 years of DevOps experience optimizing, deploying and running containerized applications (Docker, Kubernetes) across AWS, Azure, and GCP, including hands-on work with EKS, AKS, and GKE.
  • Proficient in Python and Linux shell scripting for automation, application development, system administration, and problem resolution.
  • Validated experience architecting, implementing, and managing cloud infrastructure using Terraform.
  • Demonstrated ability as a meticulous problem-solver with strong analytical skills, capable of diagnosing and resolving complex technical challenges under pressure.
  • Excellent communication, teamwork, and collaboration skills, with an ability to articulate technical concepts clearly to diverse audiences and lead technical responses during incidents.

Ways to stand out from the crowd:

  • Proven experience designing and implementing event-driven architectures using pub/sub patterns with platforms like AWS SNS / SQS, Google Pub / Sub, or Azure Service Bus.
  • Knowledge of generative AI architectures (LLMs, diffusion models) and concepts such as Retrieval Augmented Generation (RAG) and vector databases.
  • Hands-on experience with the NVIDIA AI stack (NeMo, Triton Inference Server, TensorRT) for model development, serving, and optimization. Production experience with NVIDIA NIM is a strong plus.
  • Experienced in building and running CI/CD pipelines (Jenkins, GitLab CI) and managed software development environments, applying SRE principles to automate, enhance reliability, and improve performance.
  • Familiarity with Python-based Learning Management Systems (LMS) such as Open edX.