Description

We're looking for a Senior Site Reliability Engineer to join our team. As a Senior SRE, you will be a technical leader in designing, observing, and operating our systems in production. You will focus on how services behave as a whole: reliability, performance, failure modes, and the engineers' experience building them.

Responsibilities:

Design systems with resilience, graceful degradation, and capacity in mind.
Define and measure SLOs and SLIs that actually reflect what our customers feel.
Use Datadog (logging, metrics, APM) together with CloudWatch to build signal-heavy, noise-light observability.
Configure alerting and routing that reach engineers through incident.io, where we run incident management and on-call, so that when a human gets paged, it really matters.
Continuously improve our incident lifecycle, from fast detection and solid triage, through clear communication, to blameless, actionable follow-ups.

Requirements:

Bachelor's or master's degree in computer science or equivalent industry experience.
4+ years of experience in an SRE or Software Engineering role.
Hands-on coding experience in Python and/or Go.
Distributed Systems Expertise , Proven experience designing, operating, and shepherding large-scale distributed systems from design through production, including incident learnings that make on-call quieter over time.
Reliability Engineering Mindset , Deep fluency in SLOs, SLIs, error budgets, and MTTR , using them to drive decisions and explain tradeoffs, not just decorate dashboards.
Observability & Incident Response , Treats observability as essential, not optional; stays calm under pressure; can diagnose incidents from logs and metrics and translate findings into durable process and technical improvements.
Cross-functional Communication , Able to work across technical and non-technical teams, reduce silos through documentation and runbooks, and explain reliability concepts in plain language.
Operational Tooling & AI Fluency , Selects the right tools for production management and leverages AI-assisted development to reduce toil, accelerate RCA, and streamline infrastructure-as-code workflows.
Leadership & Mentorship , Can plan and lead strategic reliability initiatives across engineering, and invests in mentoring engineers as a high-leverage path to long-term reliability improvements.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://job-boards.greenhouse.io/earnin/jobs/7895718