# Staff Site Reliability Engineer

**Company**: EarnIn
**Location**: Mountain View, US
**Work arrangement**: hybrid
**Experience**: staff
**Job type**: full-time
**Salary**: $252,000-$308,000
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/earnin/jobs/7944568?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_933a3dd0-625

## Description

We're looking for a Staff Site Reliability Engineer to lead our shift to AI-first reliability engineering. As a key member of our team, you will define how AI transforms on-call, incident response, alert triage, postmortems, and production investigations across SRE and product engineering teams.

Your primary responsibilities will include:

- Setting a reliability strategy with AI at the center, defining SLIs, SLOs, and error budgets across critical services, and using AI to surface trends, predict capacity risks, and auto-generate reliability scorecards.

- Redesigning the incident lifecycle around AI-assisted speed, leading high-severity incident response as IC, building AI-driven alert correlation and triage that reduces noise and accelerates root-cause identification, and driving adoption of AI-generated postmortems that surface systemic patterns and automatically track corrective actions through to completion.

- Improving on-call fundamentally better through automation, building AI agents that draft runbook responses, pull relevant context from Datadog, incident.io, and Slack during pages, and recommend remediation steps, so on-call engineers spend less time deciding and searching.

- Pushing AI-first operations into product engineering teams, partnering with product engineering to embed AI-assisted investigation, alerting, and production readiness into their workflows, making AI tooling the default path for every team that owns a service, not an SRE-only capability.

- Architecting for resilience at scale, guiding service designs for graceful degradation, failure isolation, and capacity planning across EarnIn's AWS footprint (EKS, Kafka, DynamoDB, RDS, SQS), and using AI-driven analysis to identify architectural weak points before they become incidents.

- Raising the bar through mentorship and standards, coaching engineers on reliability practices, running design and incident reviews, and building documentation and tooling that makes reliability knowledge accessible, setting the expectation that AI-assisted workflows are how EarnIn operates, not an experiment.

We're looking for someone with 7+ years of experience in SRE, Software Engineering, or Infrastructure Engineering, with a track record of KPI-driven reliability and operational excellence improvements at scale. You should have demonstrated experience applying AI/LLMs to operational workflows in production, significant expertise with SLOs/SLIs, error budgets, incident command, and blameless postmortems in large-scale distributed systems, and meaningful software engineering ability (Python, Go, or similar).

Additionally, you should have deep observability experience (Datadog, CloudWatch, OpenTelemetry) with pragmatic, signal-heavy alerting designed for real human response, enhanced by AI-driven noise reduction, solid infrastructure-as-code proficiency (Terraform, Kubernetes, AWS) with safe, reversible deployment practices, and proficiency with AI-assisted development tools (Cursor, Claude Code, Copilot) to accelerate your own engineering work and to model that behavior for the teams you partner with.

## Skills

### Required
- SRE
- Software Engineering
- Infrastructure Engineering
- AI/LLMs
- SLOs/SLIs
- Error Budgets
- Incident Command
- Blameless Postmortems
- Python
- Go
- Datadog
- CloudWatch
- OpenTelemetry
- Terraform
- Kubernetes
- AWS
- AI-Assisted Development Tools

---

Source: [Apply at job-boards.greenhouse.io](https://job-boards.greenhouse.io/earnin/jobs/7944568?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
