# Site Reliability Engineer II

**Company**: EarnIn
**Location**: Bengaluru, India
**Work arrangement**: hybrid
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/earnin/jobs/7913314?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_966d8d00-2cb

## Description

## About EarnIn

EarnIn is a financial technology company that provides earned wage access to individuals with unique financial needs. Our community members access their earnings as they earn them, with options to spend, save, and grow their money without mandatory fees, interest rates, or credit checks.

## Position Summary

We have a real passion for delivering the best product experience for our community members. We work closely with all teams and share responsibility for rapidly delivering production-ready features to our community. We build or contribute to infrastructure, reliability tooling, and practices that help teams ship quickly and safely.

## Responsibilities

- Design systems with resilience, graceful degradation, and capacity in mind.

- Define and measure SLOs and SLIs that actually reflect what our customers feel.

- Use Datadog (logging, metrics, APM) together with CloudWatch to build signal-heavy, noise-light observability.

- Configure alerting and routing that reach engineers through incident.io, where we run incident management and on-call, so that when a human gets paged, it really matters.

- Continuously improve our incident lifecycle, from fast detection and solid triage, through clear communication, to blameless, actionable follow-ups.

- Combine solid software fundamentals with reliability thinking so our systems are highly available, easy to debug, and a joy to work on.

- Be calm and collected, cool under pressure, and not afraid to voice your opinion even in the heat of an incident.

- Explain SLOs and error budgets in plain language.

- Have experience working with large-scale, secure, and performant distributed systems, including the fun parts like retries, backoff, and timeouts that actually work together.

- Be genuinely excited about AI, not just as a productivity tool, but as a platform for building smarter, more autonomous SRE operations.

- Be passionate about learning new technologies and adopting the right tools to manage services in production, keeping SLAs and MTTR in mind at all times.

- Plan and execute on reliability and operability initiatives for the team, with an eye toward growing your scope and impact over time.

## What We're Looking For

- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

- 3+ years of experience in an SRE or Software Engineering role.

- Hands-on coding experience in any two programming languages.

- Experience successfully managing production environments and understand that you need more than a for-loop and SSH to make it happen.

- A strong belief that observability is critically important to run highly available and performant services, not an optional nice-to-have.

- Experience using SLOs, SLIs, and KPIs to guide decisions, prioritize work, and explain tradeoffs, not just decorate dashboards or slide decks.

- Proficiency experience using AI-assisted development tools (e.g., GitHub Copilot, Cursor, ChatGPT, or similar tools) or prompt engineering as part of your software development workflow to reduce operational toil, accelerate incident root cause analysis, and optimize infrastructure-as-code workflows.

- Demonstrated experience building or meaningfully contributing to agentic AI workflows: runbook automation, AI-assisted alert triage, LLM-driven postmortem generation, or similar.

- Hands-on experience shepherding services from design to production, through incident learnings, and into a state where on-call actually gets quieter over time.

- Tackled production incidents, learned the lessons, and know how to turn those lessons into concrete, technical, and process changes that make it much harder for the same problem to happen again.

- Interest in mentoring peers and a belief in the investment of people as one of the highest leverage ways to improve reliability and reduce toil.

## Category

Engineering

## Industry

Technology

## Salary Range

Not specified

## Required Skills

- Datadog

- CloudWatch

- Incident.io

- SLOs

- SLIs

- KPIs

- AI-assisted development tools

- Agentic AI workflows

- Runbook automation

- AI-assisted alert triage

- LLM-driven postmortem generation

- Large-scale, secure, and performant distributed systems

- Retries

- Backoff

- Timeouts

- Observability

## Preferred Skills

- Programming languages (not specified)

- SSH

- For-loop

## Skills

### Required
- Datadog
- CloudWatch
- Incident.io
- SLOs
- SLIs
- KPIs
- AI-assisted development tools
- Agentic AI workflows
- Runbook automation
- AI-assisted alert triage
- LLM-driven postmortem generation
- Large-scale, secure, and performant distributed systems
- Retries
- Backoff
- Timeouts
- Observability

---

Source: [Apply at job-boards.greenhouse.io](https://job-boards.greenhouse.io/earnin/jobs/7913314?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
