Description
As a Staff Site Reliability Engineer at Reddit, you will lead reliability engineering initiatives for critical user-facing systems at internet scale. You will partner closely with product and infrastructure teams to improve availability, latency, scalability, and operational excellence across Reddit's most business-critical experiences.
In this role, you will:
- Lead Reliability Engineering for User Experience
- Drive reliability, scalability, and operational excellence for critical user-facing systems and services
- Architect for Scale
- Reduce Operational Risk
- Drive Automation
- Incident Management
- Influence Engineering Standards
- Mentor and Multiply Impact
To succeed in this role, you will need:
- 8+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large-scale distributed systems
- Strong collaboration and communication skills with the ability to influence technical direction across teams
- Strong experience supporting high-traffic, user-facing production environments
- Deep understanding of one or more: distributed systems, networking, Linux systems, cloud-native architectures
- Experience designing highly available systems with strong operational and reliability practices
- Strong programming skills in languages such as Go, Python, or similar
- Strong understanding of observability systems including metrics, logging, tracing, and alerting
- Experience improving reliability through SLOs, automation, incident management, and performance optimization
Nice to Have:
- Experience operating systems at internet scale traffic volumes
- Experience with Kubernetes, containers, cloud infrastructure, and modern deployment platforms
- Familiarity with technologies such as Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, Redis, or similar distributed infrastructure technologies
- Experience with CDN optimization, edge reliability, traffic engineering, or global infrastructure
- Contributions to open-source software or participation in technical communities
- Experience leading large-scale incident response and operational transformation initiatives
Why Join Reddit? You'll help shape the reliability and performance of one of the internet's largest platforms, influencing experiences used by millions of people every day. This is an opportunity to solve deeply complex engineering problems at massive scale while helping define the future of reliability engineering for a modern consumer platform.