# Senior Systems Software Engineer, Data Center Infrastructure Management - EngOps

**Company**: NVIDIA
**Location**: Austin
**Work arrangement**: remote
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-TX-Remote/Senior-Systems-Software-Engineer--Data-Center-Infrastructure-Management---EngOps_JR2015514?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_9d81bab5-057

## Description

Join our team of innovative engineers who develop and maintain software facilitating GPU communication, driving groundbreaking solutions in High Performance Computing and Deep Learning.

We are seeking a highly motivated EngOps Engineer (5+ years of experience) to join our advanced infrastructure software team. In this role, you will be responsible for maintaining high-performance, rack-scale management solutions for datacenter environments. You will work directly with our Infrastructure Service software development team to support deployment and debug of our hardware and Infrastructure Manager.

Responsibilities:

- Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal cluster availability and performance.

- Manage updates to the site controller management nodes.

- Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions.

Requirements:

- BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.

- 5+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure.

- Experience with deployment and configuration of operating systems, computer networks, and high-performance applications.

- Proven ability to work effectively with developers and test engineers across different teams and time zones.

- Experience deploying services in Kubernetes.

- Datacenter or computer architecture experience is required,you should understand server, rack, and network topologies, as well as hardware/firmware/software interactions.

- Background with hardware management protocols (Redfish, IPMI, BMC) and firmware update automation.

- Experience configuring and debugging complex data center networks.

- Experience developing scripts to automate recovery actions for management controllers and datacenter systems.

Ways to stand out from the crowd:

- Direct experience with industry standard alerting tools and emergency response practices. Experience with observability tools such as Grafana.

- Hands-on experience with GPU-focused hardware and software, such as DGX systems and Compute Clusters.

- Proficiency in designing large scale networking technologies and the associated challenges. Experience with OpenStack and Foreman

## Skills

### Required
- deployment and administration of clusters
- servers
- switches
- operating systems
- computer networks
- high-performance applications
- Kubernetes
- datacenter architecture
- hardware management protocols
- firmware update automation
- complex data center networks
- scripting

### Nice to have
- GPU-focused hardware and software
- DGX systems
- Compute Clusters
- OpenStack
- Foreman
- Grafana

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-TX-Remote/Senior-Systems-Software-Engineer--Data-Center-Infrastructure-Management---EngOps_JR2015514?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
