# Senior/Staff Site Reliability Engineer

**Company**: Fal
**Location**: San Francisco
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Salary**: $180,000-250,000
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/fal/jobs/4146019009
**Canonical**: https://yubhub.co/jobs/job_198d64d4-207

## Description

You are a seasoned SRE who keeps production infrastructure running at scale. You own the reliability and availability of customer-facing systems , from Kubernetes clusters to deployment pipelines to the networking layer that connects it all. You think in SLOs, automate ruthlessly, and treat every incident as a chance to make the system better.

## Key Responsibilities

- Own and operate our Kubernetes infrastructure: cluster lifecycle, upgrades, networking, and multi-tenant isolation for customer workloads

- Build and maintain CI/CD pipelines and deployment infrastructure

- Leverage AI to an extreme level to automate analysis and resolution of production issues, and improve software development speed, reliability and maintainability

- Build dashboards, alerting, and anomaly detection across our systems

- Define and enforce SLOs and build out incident response processes

- Manage and improve our networking, load balancing, and service mesh configurations

- Drive reliability improvements across the stack through automation, runbooks, and chaos engineering

## Requirements

- 5+ years experience in managing critical production systems and software development workflows

- Strong production experience setting up and operating Kubernetes at scale, using infrastructure-as-code (Terraform, Ansible)

- Deep knowledge of Linux networking, container networking (CNI plugins, VXLAN, BGP), and DNS

- Experience building CI/CD systems and GitOps workflows (FluxCD, ArgoCD)

- Proficiency in Python and either Go or Bash for tooling and automation

- Strong experience with logging, monitoring and alerting (Prometheus, Grafana, Loki, Thanos, VictoriaMetrics, Datadog)

- Excellent communication and ability to drive technical decisions across teams

- Self-starter who executes quickly, takes ownership, and constantly seeks improvement

## Nice to have

- Experience with managing GPU and AI/ML workloads

- Experience with kernel-based monitoring and routing (eBPF, XDP)

- Experience with security tooling (Falco, Coroot, SIEM)

- Experience with bare metal Kubernetes networking (Calico, Cilium, MetalLB)

- Experience with distributed storage systems (Ceph, Longhorn, etc.)

## Compensation

- $180,000-250,000 plus equity + benefits

## Benefits

- Interesting and challenging work

- A lot of learning and growth opportunities

- Regular team events and offsites

- Health, dental, and vision insurance (US)

- Visa sponsorship and relocation assistance

## Skills

### Required
- Kubernetes
- Infrastructure-as-code
- Linux networking
- Container networking
- CI/CD systems
- GitOps workflows
- Python
- Go
- Bash
- Logging
- Monitoring
- Alerting

### Nice to have
- GPU and AI/ML workloads
- Kernel-based monitoring and routing
- Security tooling
- Bare metal Kubernetes networking
- Distributed storage systems
