New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
Anduril Industries

Senior Production Engineer

Anduril Industries
Apply →
onsite senior full-time $191,000-$287,000 USD Costa Mesa, California, United States; Washington, District of Columbia, United States

First indexed 30 May 2026

Description

We are looking for a Senior Production Engineer to join our team in Costa Mesa, CA (or DC). In this role, you will be responsible for diagnosing and fixing stability vulnerabilities in core platform services that cause cascading failures in multi-tenant cloud deployments. You will write production Go to implement resilience patterns , leader election, circuit breakers, failure domain isolation , directly in service code.

This will require deep experience with distributed systems, debugging complex failure modes across service boundaries, and writing production-quality Go. If you are someone who thrives on fixing hard reliability problems in live systems rather than building greenfield, this role is for you.

Responsibilities

  • Diagnose and fix stability vulnerabilities in core platform services that cause cascading failures under multi-replica, multi-tenant operation
  • Implement resilience patterns (leader election, circuit breakers, failure domain isolation) directly in service code
  • Design multi-replica support for services that currently assume single-instance operation
  • Collaborate with service owners on contract testing and upgrade validation
  • Trace cascading failures across service boundaries and drive them to root-cause fixes
  • Contribute to observability platform improvements to support service stability
  • Light infrastructure work: Terraform/Kubernetes changes to support service fixes (~20% of time)

Requirements

  • Production-quality Go , you'll be modifying core platform services, not writing scripts
  • Practical experience with distributed systems: leader election, consensus, replication, failure modes
  • Kubernetes , enough to understand how services run (not necessarily cluster administration)
  • Debugging complex systems , tracing cascading failures across service boundaries
  • 4+ years in SRE, platform engineering, or backend development roles
  • Must be a U.S. Person due to required access to U.S. export controlled information or facilities

Nice-to-have Qualifications

  • Rust (some platform services use it)
  • Experience fixing reliability problems in production services (not just building greenfield)
  • Familiarity with gRPC service architectures
  • HashiCorp Consul or similar service discovery/mesh
  • FedRAMP/IL5 compliance environment experience
  • ArgoCD / GitOps workflows
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://job-boards.greenhouse.io/andurilindustries/jobs/5150350007