New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
NVIDIA

Principal Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA
Apply →
remote senior full-time $150,000–$200,000 Santa Clara

First indexed 20 May 2026

Description

We're looking for a Principal Site Reliability Engineer to join our team in Santa Clara. As a Principal Site Reliability Engineer, you will be responsible for designing, implementing, and supporting operational and reliability aspects of large-scale Observability & Telemetry collection platforms. You will engage in and improve the whole lifecycle of services, from inception and design through deployment, operation, and refinement. You will also support services before they go live through activities such as system design consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews. Once services are live, you will maintain them by measuring and monitoring availability, latency, and overall system health. You will scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity. You will practice sustainable incident response and blameless postmortems. You will also be part of an on-call rotation to support production systems.

We're looking for someone with a BS degree in Computer Science or a related technical field, or equivalent experience. You should have 15+ years of experience with infrastructure automation, distributed systems design, and experience with design, develop tools for running large-scale private or public cloud systems in production. You should also have 8+ years of experience delivering foundational infrastructure and observability platforms. You should have experience in one or more of the following: Python, Go, Perl, or Ruby. You should have in-depth knowledge of Linux, Networking, and Containers.

If you're interested in crafting, analyzing, and fixing large-scale distributed systems, and you have a systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive, we'd love to hear from you.