Description

We're hiring a Senior Site Reliability Engineer to join our team of innovative engineers who are building an AI Data Center AIOps platform. As a Senior Site Reliability Engineer, you will be responsible for ensuring the reliability and performance of our platform, which turns raw, high-volume telemetry into reliable, job-centric insights and automation for GPU fleets.

Key Responsibilities:

Continuously monitor platform health via dashboards/logs/metrics, automate recurring checks, and keep reliability + resource efficiency on track.
Own Kubernetes deployments end-to-end (runbooks, canary checks, post-deploy validation), and lead rollbacks/remediations when needed.
Lead first-level incident triage: collect diagnostics, identify likely root causes, and hand off clear, actionable findings to engineering.
Build and maintain runbooks/SOPs/checklists, pushing continuous improvement through automation.
Manage deployment infrastructure and packaging (Helm + Terraform/IaC) to keep environments scalable, consistent, and reproducible.

Requirements:

BS/MS in CS/CE (or equivalent experience) and 5+ years operating production distributed systems as SRE/DevOps/Platform Ops.
Proven ownership of reliability for an observability/AIOps platform: SLOs/SLIs, on-call, addressing incidents, and follow-up evaluations that drive measurable improvements.
Deep Kubernetes + containers experience (deploying, debugging, scaling) for telemetry-heavy microservices,ingestion, processing, storage, APIs, and UI.
Automation-first approach: solid scripting (Python/Bash), CI/CD, and infrastructure-as-code (Terraform + Helm) to deliver safe rollouts (canaries/rollbacks), reproducible environments, and minimal toil.

Nice to Have:

Strong Linux + networking fundamentals, distributed systems instincts, and hands-on ops for Kubernetes/services/streaming stacks are ideal; bonus for experience with observability platforms at scale.
Experience building safe automation that operators trust: canary releases, automated rollback criteria, “monitoring for the monitoring” (lag/drop/error budgets), and replay/backfill pipelines with correctness checks.
Strong in distributed/streaming systems operations (Kafka/Pulsar, Flink/Spark, ClickHouse/Elastic/TSDBs, object storage),and can reason about backpressure, hotspots, and failure domains end-to-end.
Proven programming experience building automation tools or services , ideally in Python, or similar languages , to simplify operations and scale recurring processes.
Proven experience running large-scale production deployments and multiple Kubernetes environments or clusters across teams or customers, coordinating changes and rollouts with minimal disruption with hands-on experience with observability tools , you know your way around dashboards, metrics, logs, and traces using platforms like Prometheus, Grafana, or similar.

What We Offer:

Competitive salaries and a generous benefits package
Eligibility for equity
Opportunity to work with a talented team of engineers
Collaborative and dynamic work environment
Professional development opportunities
Flexible work arrangements
Recognition and rewards for outstanding performance

How to Apply:

If you're a motivated and experienced Site Reliability Engineer looking for a new challenge, please submit your application, including your resume and a cover letter, to [insert contact information]. We look forward to hearing from you!

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-DevOps-Engineer--AIOPs_JR2012791