Description

We are looking for a Site Reliability Engineer who thinks like a software engineer first. You will own critical production systems end-to-end, designing, building, and improving them rather than simply operating them. You will write production-quality code that keeps the platform reliable at scale, embed with product engineering teams to influence architecture from the start, and build the internal tooling that every engineer at Hebbia depends on.

Responsibilities:

Own critical production services end-to-end, from design and code review through deployment, operation, and incident response
Profile, benchmark, and rewrite hot paths to eliminate bottlenecks as Hebbia scales
Lead incident response and drive post-mortem culture, translating findings into code changes and architectural improvements rather than runbooks
Design and build observability frameworks from scratch, writing custom instrumentation, alerting logic, and debugging tooling that surfaces production issues before customers feel them
Define and enforce SLOs across platform services and build the feedback loops that keep engineering teams accountable to them
Own capacity planning and cost efficiency: model growth, right-size infrastructure, and write automation that prevents over-provisioning and resource exhaustion
Build robust, well-tested internal platforms and deployment tooling held to the same engineering standards as customer-facing code
Own and continuously improve CI/CD systems so engineering teams can ship safely and quickly
Embed with product engineering teams as a peer software engineer, contributing directly to production codebases and co-designing systems for reliability from the start
Partner on infrastructure security through threat modeling, hardening, and automated compliance tooling

Who You Are:

5+ years software development with a track record of writing, shipping, and maintaining production services, not just operating infrastructure
Production-grade proficiency in at least one systems or backend language: Go, Python, C++, or Rust
Proven experience as a Production Engineer, SRE, or software engineer with a deep infrastructure focus, comfortable owning services end-to-end across the full stack
Deep understanding of distributed systems
Container orchestration expertise and hands-on experience debugging complex distributed failures in production
Working knowledge of OS-level concepts
Cloud platform fluency (AWS preferred)
Experience in building and maintaining observability stacks
Strong CI/CD pipeline expertise and a track record of improving developer velocity without sacrificing safety
Background at a company with a Production Engineering or software-focused SRE culture is a strong plus
Experience building platforms for AI/ML workloads or high-throughput document processing pipelines is a plus

Compensation: The salary range for this role is $160,000 to $300,000. This range may be inclusive of several career levels at Hebbia and will be narrowed during the interview process based on the candidate’s experience and qualifications. Adjustments outside of this range may be considered for candidates whose qualifications significantly differ from those outlined in the job description.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://job-boards.greenhouse.io/hebbia/jobs/4666955005