Description

We're looking for a Senior Site Reliability Engineer to join our team. As a Senior Site Reliability Engineer, you will be responsible for designing, implementing, and supporting operational and reliability aspects of large-scale Observability & Telemetry collection platforms. You will engage in and improve the whole lifecycle of services, from inception and design through deployment, operation, and refinement. You will also support services before they go live, maintain services once they are live, and scale systems sustainably through mechanisms like automation.

Key responsibilities include:

Design, implement, and support operational and reliability aspects of large-scale Observability & Telemetry collection platforms with a focus on performance at scale, real-time monitoring, logging, and alerting.
Engage in and improve the whole lifecycle of services, from inception and design through deployment, operation, and refinement.
Support services before they go live through activities such as system design consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews.
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems.
Be part of an on-call rotation to support production systems.

Requirements include:

A Bachelor's degree in Computer Science or a related technical field, or equivalent experience.
8+ years of experience with infrastructure automation, distributed systems design, experience with design, develop tools for running large-scale private or public cloud systems in production.
5+ years of experience delivering foundational infrastructure and observability platforms.
Experience in one or more of the following: Python, Go, Perl, or Ruby.
In-depth knowledge of Linux, Networking, and Containers.

Nice to have:

Interest in crafting, analyzing, and fixing large-scale distributed systems.
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. Ability to debug and optimize code and automate routine tasks.
Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker. Experience running Grafana, OpenTelemetry, Prometheus, and similar observability-focused tools.

You will also be eligible for equity and benefits.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Senior-Site-Reliability-Engineer---Observability-and-Telemetry-Platform_JR2017559