New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
NVIDIA

Senior Site Reliability Engineer, AIOPs

NVIDIA
Apply →
onsite senior full-time $100,000–$150,000 Santa Clara

First indexed 18 May 2026

Description

We're hiring a Senior Site Reliability Engineer to join our team of innovative engineers who are building an AI Data Center AIOps platform. As a Senior Site Reliability Engineer, you will be responsible for ensuring the reliability and performance of our platform, which turns raw, high-volume telemetry into reliable, job-centric insights and automation for GPU fleets.

Key Responsibilities:

  • Continuously monitor platform health via dashboards/logs/metrics, automate recurring checks, and keep reliability + resource efficiency on track.
  • Own Kubernetes deployments end-to-end (runbooks, canary checks, post-deploy validation), and lead rollbacks/remediations when needed.
  • Lead first-level incident triage: collect diagnostics, identify likely root causes, and hand off clear, actionable findings to engineering.
  • Build and maintain runbooks/SOPs/checklists, pushing continuous improvement through automation.
  • Manage deployment infrastructure and packaging (Helm + Terraform/IaC) to keep environments scalable, consistent, and reproducible.

Requirements:

  • BS/MS in CS/CE (or equivalent experience) and 5+ years operating production distributed systems as SRE/DevOps/Platform Ops.
  • Proven ownership of reliability for an observability/AIOps platform: SLOs/SLIs, on-call, addressing incidents, and follow-up evaluations that drive measurable improvements.
  • Deep Kubernetes + containers experience (deploying, debugging, scaling) for telemetry-heavy microservices,ingestion, processing, storage, APIs, and UI.
  • Automation-first approach: solid scripting (Python/Bash), CI/CD, and infrastructure-as-code (Terraform + Helm) to deliver safe rollouts (canaries/rollbacks), reproducible environments, and minimal toil.

Nice to Have:

  • Strong Linux + networking fundamentals, distributed systems instincts, and hands-on ops for Kubernetes/services/streaming stacks are ideal; bonus for experience with observability platforms at scale.
  • Experience building safe automation that operators trust: canary releases, automated rollback criteria, “monitoring for the monitoring” (lag/drop/error budgets), and replay/backfill pipelines with correctness checks.
  • Strong in distributed/streaming systems operations (Kafka/Pulsar, Flink/Spark, ClickHouse/Elastic/TSDBs, object storage),and can reason about backpressure, hotspots, and failure domains end-to-end.
  • Proven programming experience building automation tools or services , ideally in Python, or similar languages , to simplify operations and scale recurring processes.
  • Proven experience running large-scale production deployments and multiple Kubernetes environments or clusters across teams or customers, coordinating changes and rollouts with minimal disruption with hands-on experience with observability tools , you know your way around dashboards, metrics, logs, and traces using platforms like Prometheus, Grafana, or similar.

What We Offer:

  • Competitive salaries and a generous benefits package
  • Eligibility for equity
  • Opportunity to work with a talented team of engineers
  • Collaborative and dynamic work environment
  • Professional development opportunities
  • Flexible work arrangements
  • Recognition and rewards for outstanding performance

How to Apply:

If you're a motivated and experienced Site Reliability Engineer looking for a new challenge, please submit your application, including your resume and a cover letter, to [insert contact information]. We look forward to hearing from you!

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-DevOps-Engineer--AIOPs_JR2012791