# Senior Service Reliability Engineer - EDA Infrastructure

**Company**: NVIDIA
**Location**: Bengaluru
**Work arrangement**: hybrid
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/India-Bengaluru/Senior-Service-Reliability-Engineer---EDA-Infrastructure_JR2016779-1?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_26ec81d7-7a7

## Description

We are seeking a Senior Service Reliability Engineer to join our team. As a key member of our Service Reliability Operations Center, you will be responsible for ensuring the scalability, resilience, and near 100% availability of our global hardware infrastructure. You will collaborate with SRE, Security, and DevOps teams to improve reliability, reduce incident frequency and impact, and drive rapid resolution when issues occur. You will also partner with development teams to implement monitoring, alerting, and observability solutions that proactively detect issues and enhance the customer experience.

Key Responsibilities:

- Operate in a 24/7 follow-the-sun support model spanning multiple continents, with direct reporting to a U.S.-based manager

- Work a 4-day, 10-hour schedule, including either Saturday or Sunday, with flexible early or late shifts to ensure continuous global coverage across U.S. and India teams

- Monitor and manage large-scale production compute and storage environments to ensure high availability and performance

- Utilize alerts, alarms, and observability tools to proactively detect, prevent, and respond to incidents

- Apply deep systems knowledge to analyze logs, metrics, and system behavior to diagnose issues, identify root causes, and implement effective resolutions

Requirements:

- Highly motivated with strong communication skills, you have the ability to work successfully with multi-functional teams, principles, and architects, coordinating effectively across organizational boundaries and geographies

- 5+ years of experience administering large-scale production systems

- 3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC)

- BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience

- Expert-level knowledge of Linux system administration and automation using Ansible and/or Python

Preferred Qualifications:

- Advanced hands-on experience with Kubernetes, SLURM, and large-scale cluster management

- Familiarity with GPU hardware and high-performance computing environments

- Experience with observability and incident management tools (Grafana, OpenTelemetry, PagerDuty, JIRA)

- Cloud experience (AWS, Azure, GCP) is a plus; strong preference for on-prem expertise

## Skills

### Required
- Linux system administration
- Ansible
- Python
- Kubernetes
- SLURM
- large-scale cluster management
- GPU hardware
- high-performance computing environments
- observability and incident management tools

### Nice to have
- cloud experience
- AWS
- Azure
- GCP

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/India-Bengaluru/Senior-Service-Reliability-Engineer---EDA-Infrastructure_JR2016779-1?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
