# Senior Systems Software Engineer, Kubernetes Node Lifecycle - DGX Cloud

**Company**: NVIDIA
**Location**: Santa Clara, CA
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Systems-Software-Engineer--Kubernetes-Node-Lifecycle---DGX-Cloud_JR2019395?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_5b24025b-b0a

## Description

At NVIDIA, we are looking for a Senior Systems Software Engineer with strong experience in Kubernetes node engineering, OS image packaging, and cloud infrastructure. The ideal candidate will possess deep hyperscaler-level knowledge across the entire node lifecycle, including CAPI providers, bring-your-own-node onboarding, OS image build pipelines, packaging, and nodepool management.

The successful candidate will direct the building and refinement of CAPI providers for NVIDIA Kubernetes Engine, maintain steady, consistent, and scalable node provisioning across DGX Cloud and NCP environments, develop and maintain bring-your-own-node workflows that allow customers to integrate different NVIDIA hardware into NKE clusters, and coordinate OS image generation, packaging, deployment, and update processes for NKE nodes.

Key responsibilities include:

- Directing the building and refinement of CAPI providers for NVIDIA Kubernetes Engine

- Maintaining steady, consistent, and scalable node provisioning across DGX Cloud and NCP environments

- Developing and maintaining bring-your-own-node workflows that allow customers to integrate different NVIDIA hardware into NKE clusters

- Coordinating OS image generation, packaging, deployment, and update processes for NKE nodes

In addition, the successful candidate will handle nodepool lifecycle at scale, including provisioning, upgrades, drain and cordon workflows, and seamless node replacement across very large clusters with diverse NVIDIA hardware, examine, resolve, and determine underlying causes of node-layer faults in production NKE clusters, and partner with upstream communities including Cluster API, Kubernetes, and CNCF projects to establish node provisioning and lifecycle standards in accordance with NKE requirements.

Requirements include:

- 8 years of experience with a background in systems software, cloud infrastructure, or Kubernetes node engineering

- Bachelor’s or Master’s degree in Engineering (Electrical, Computer Engineering, Computer Science) or equivalent experience

- Deep expertise in Cluster API (CAPI), including provider development and full machine lifecycle from provisioning to deletion

- Extensive experience with OS image build pipelines, node image packaging, and delivery systems for Kubernetes nodes

- Practical experience with bring-your-own-node models and integrating diverse hardware into live Kubernetes environments

- Strong understanding of kubelet configuration, node bootstrap, and the Kubernetes node registration lifecycle

- Experience with node image security, including vulnerability scanning, patch automation, and compliance gating as part of image build pipelines

- Proficiency in Golang and/or Python, and hands-on experience with at least one major public cloud provider

Preferred qualifications include:

- Direct experience building or maintaining node image pipelines for a hyperscaler Kubernetes distribution

- Experience with supply chain security and hardening for node images

- Experience with automated node provisioning and optimal sizing at scale

- Strong operational experience working with immutable OS image distributions

- Proven background of upstream contributions to Cluster API, Kubernetes or related CNCF projects

## Skills

### Required
- Kubernetes
- Cluster API
- OS image build pipelines
- Node image packaging
- Delivery systems for Kubernetes nodes
- Bring-your-own-node models
- Kubelet configuration
- Node bootstrap
- Kubernetes node registration lifecycle
- Node image security
- Vulnerability scanning
- Patch automation
- Compliance gating
- Golang
- Python
- Public cloud providers

### Nice to have
- Hyperscaler Kubernetes distribution
- Supply chain security
- Hardening for node images
- Automated node provisioning
- Optimal sizing at scale
- Immutable OS image distributions
- Upstream contributions to Cluster API
- Kubernetes or related CNCF projects

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Systems-Software-Engineer--Kubernetes-Node-Lifecycle---DGX-Cloud_JR2019395?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
