# System Software Engineer, Platform Operations

**Company**: NVIDIA
**Location**: Shanghai
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/China-Shanghai/System-Software-Engineer--Platform-Operations_JR2012011?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_3d698cbb-386

## Description

We're seeking an operationally-focused System Software Engineer to ensure the stability, reliability, and flawless execution of all NVIDIA Deep Learning Institute (DLI) training events. You will also oversee the broader day-to-day operational health of the entire learning platform. Your operational acumen will be instrumental in powering our latest educational experiences focused on safe, trustworthy, and ethical AI, ensuring a seamless experience for instructors and learners.

Join a close-knit team where your contributions truly matter. As a core member of our learning systems platform team, you'll collaborate with creative educators to ensure our hands-on training sets the standard for user experience. You'll play a crucial role in making our purpose-built Learning Management System (LMS) platform a delightful and efficient tool that empowers both learners and instructors.

**What you'll be doing:**

- Develop comprehensive operational plans and de-risking strategies to ensure flawless technical execution of technical training events.

- Provide expert, hands-on technical leadership during live training events, managing deployments and rapidly resolving emergent issues for an optimal user experience.

- Oversee the stability, scalability, and reliability of the DLI learning platform, implementing SRE principles and leading incident response for optimal performance and reliability.

- Lead cross-functional coordination, establish and enforce operational best practices, and drive continuous improvement initiatives to enhance platform services.

**What we need to see:**

- Bachelor's degree in Computer Science, a related technical field, or equivalent experience with over 5 years of DevOps experience optimizing, deploying and running containerized applications (Docker, Kubernetes) across AWS, Azure, and GCP, including hands-on work with EKS, AKS, and GKE.

- Proficient in Python and Linux shell scripting for automation, application development, system administration, and problem resolution.

- Validated experience architecting, implementing, and managing cloud infrastructure using Terraform.

- Demonstrated ability as a meticulous problem-solver with strong analytical skills, capable of diagnosing and resolving complex technical challenges under pressure.

- Excellent communication, teamwork, and collaboration skills, with an ability to articulate technical concepts clearly to diverse audiences and lead technical responses during incidents.

**Ways to stand out from the crowd:**

- Proven experience designing and implementing event-driven architectures using pub/sub patterns with platforms like AWS SNS / SQS, Google Pub / Sub, or Azure Service Bus.

- Knowledge of generative AI architectures (LLMs, diffusion models) and concepts such as Retrieval Augmented Generation (RAG) and vector databases.

- Hands-on experience with the NVIDIA AI stack (NeMo, Triton Inference Server, TensorRT) for model development, serving, and optimization. Production experience with NVIDIA NIM is a strong plus.

- Experienced in building and running CI/CD pipelines (Jenkins, GitLab CI) and managed software development environments, applying SRE principles to automate, enhance reliability, and improve performance.

- Familiarity with Python-based Learning Management Systems (LMS) such as Open edX.

## Skills

### Required
- DevOps
- Docker
- Kubernetes
- AWS
- Azure
- GCP
- EKS
- AKS
- GKE
- Python
- Linux shell scripting
- Terraform
- Cloud infrastructure

### Nice to have
- Event-driven architectures
- Pub/sub patterns
- Generative AI architectures
- LLMs
- Diffusion models
- Retrieval Augmented Generation
- Vector databases
- NVIDIA AI stack
- NeMo
- Triton Inference Server
- TensorRT
- CI/CD pipelines
- Managed software development environments
- SRE principles

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/China-Shanghai/System-Software-Engineer--Platform-Operations_JR2012011?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
