# Senior Site Reliability Engineer, AIOPs

**Company**: NVIDIA
**Location**: Santa Clara
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Salary**: $100,000–$150,000
**Category**: Engineering
**Industry**: Technology

**Apply**: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-DevOps-Engineer--AIOPs_JR2012791?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_8eea40aa-3b1

## Description

We're hiring a Senior Site Reliability Engineer to join our team of innovative engineers who are building an AI Data Center AIOps platform. As a Senior Site Reliability Engineer, you will be responsible for ensuring the reliability and performance of our platform, which turns raw, high-volume telemetry into reliable, job-centric insights and automation for GPU fleets.

**Key Responsibilities:**

- Continuously monitor platform health via dashboards/logs/metrics, automate recurring checks, and keep reliability + resource efficiency on track.

- Own Kubernetes deployments end-to-end (runbooks, canary checks, post-deploy validation), and lead rollbacks/remediations when needed.

- Lead first-level incident triage: collect diagnostics, identify likely root causes, and hand off clear, actionable findings to engineering.

- Build and maintain runbooks/SOPs/checklists, pushing continuous improvement through automation.

- Manage deployment infrastructure and packaging (Helm + Terraform/IaC) to keep environments scalable, consistent, and reproducible.

**Requirements:**

- BS/MS in CS/CE (or equivalent experience) and 5+ years operating production distributed systems as SRE/DevOps/Platform Ops.

- Proven ownership of reliability for an observability/AIOps platform: SLOs/SLIs, on-call, addressing incidents, and follow-up evaluations that drive measurable improvements.

- Deep Kubernetes + containers experience (deploying, debugging, scaling) for telemetry-heavy microservices,ingestion, processing, storage, APIs, and UI.

- Automation-first approach: solid scripting (Python/Bash), CI/CD, and infrastructure-as-code (Terraform + Helm) to deliver safe rollouts (canaries/rollbacks), reproducible environments, and minimal toil.

**Nice to Have:**

- Strong Linux + networking fundamentals, distributed systems instincts, and hands-on ops for Kubernetes/services/streaming stacks are ideal; bonus for experience with observability platforms at scale.

- Experience building safe automation that operators trust: canary releases, automated rollback criteria, “monitoring for the monitoring” (lag/drop/error budgets), and replay/backfill pipelines with correctness checks.

- Strong in distributed/streaming systems operations (Kafka/Pulsar, Flink/Spark, ClickHouse/Elastic/TSDBs, object storage),and can reason about backpressure, hotspots, and failure domains end-to-end.

- Proven programming experience building automation tools or services , ideally in Python, or similar languages , to simplify operations and scale recurring processes.

- Proven experience running large-scale production deployments and multiple Kubernetes environments or clusters across teams or customers, coordinating changes and rollouts with minimal disruption with hands-on experience with observability tools , you know your way around dashboards, metrics, logs, and traces using platforms like Prometheus, Grafana, or similar.

**What We Offer:**

- Competitive salaries and a generous benefits package

- Eligibility for equity

- Opportunity to work with a talented team of engineers

- Collaborative and dynamic work environment

- Professional development opportunities

- Flexible work arrangements

- Recognition and rewards for outstanding performance

**How to Apply:**

If you're a motivated and experienced Site Reliability Engineer looking for a new challenge, please submit your application, including your resume and a cover letter, to [insert contact information]. We look forward to hearing from you!

## Skills

### Required
- Site Reliability Engineering
- Kubernetes
- Containers
- Automation
- Scripting
- CI/CD
- Infrastructure-as-Code
- Terraform
- Helm
- Linux
- Networking
- Distributed Systems

### Nice to have
- Observability Platforms
- Streaming Systems Operations
- Object Storage
- Distributed Systems Instincts
- Hands-on Ops
- Automation Tools
- Programming Languages
- Python
- Similar Languages

---

Source: [Apply at nvidia.wd5.myworkdayjobs.com](https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-DevOps-Engineer--AIOPs_JR2012791?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)