Description
We're hiring a Senior Site Reliability Engineer to join our team of innovative engineers who are building an AI Data Center AIOps platform. As a Senior Site Reliability Engineer, you will be responsible for ensuring the reliability and performance of our platform, which turns raw, high-volume telemetry into reliable, job-centric insights and automation for GPU fleets.
Key Responsibilities:
- Continuously monitor platform health via dashboards/logs/metrics, automate recurring checks, and keep reliability + resource efficiency on track.
- Own Kubernetes deployments end-to-end (runbooks, canary checks, post-deploy validation), and lead rollbacks/remediations when needed.
- Lead first-level incident triage: collect diagnostics, identify likely root causes, and hand off clear, actionable findings to engineering.
- Build and maintain runbooks/SOPs/checklists, pushing continuous improvement through automation.
- Manage deployment infrastructure and packaging (Helm + Terraform/IaC) to keep environments scalable, consistent, and reproducible.
Requirements:
- BS/MS in CS/CE (or equivalent experience) and 5+ years operating production distributed systems as SRE/DevOps/Platform Ops.
- Proven ownership of reliability for an observability/AIOps platform: SLOs/SLIs, on-call, addressing incidents, and follow-up evaluations that drive measurable improvements.
- Deep Kubernetes + containers experience (deploying, debugging, scaling) for telemetry-heavy microservices,ingestion, processing, storage, APIs, and UI.
- Automation-first approach: solid scripting (Python/Bash), CI/CD, and infrastructure-as-code (Terraform + Helm) to deliver safe rollouts (canaries/rollbacks), reproducible environments, and minimal toil.
Nice to Have:
- Strong Linux + networking fundamentals, distributed systems instincts, and hands-on ops for Kubernetes/services/streaming stacks are ideal; bonus for experience with observability platforms at scale.
- Experience building safe automation that operators trust: canary releases, automated rollback criteria, “monitoring for the monitoring” (lag/drop/error budgets), and replay/backfill pipelines with correctness checks.
- Strong in distributed/streaming systems operations (Kafka/Pulsar, Flink/Spark, ClickHouse/Elastic/TSDBs, object storage),and can reason about backpressure, hotspots, and failure domains end-to-end.
- Proven programming experience building automation tools or services , ideally in Python, or similar languages , to simplify operations and scale recurring processes.
- Proven experience running large-scale production deployments and multiple Kubernetes environments or clusters across teams or customers, coordinating changes and rollouts with minimal disruption with hands-on experience with observability tools , you know your way around dashboards, metrics, logs, and traces using platforms like Prometheus, Grafana, or similar.
What We Offer:
- Competitive salaries and a generous benefits package
- Eligibility for equity
- Opportunity to work with a talented team of engineers
- Collaborative and dynamic work environment
- Professional development opportunities
- Flexible work arrangements
- Recognition and rewards for outstanding performance
How to Apply:
If you're a motivated and experienced Site Reliability Engineer looking for a new challenge, please submit your application, including your resume and a cover letter, to [insert contact information]. We look forward to hearing from you!
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-DevOps-Engineer--AIOPs_JR2012791