Description

Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology.

The Production Engineering team is a newly formed organization within Anduril's Software Platform, dedicated to ensuring the reliability, performance, and scalability of mission-critical systems that directly support our warfighters in the field.

As a Senior Site Reliability Engineer, you will work closely with platform engineering teams, product developers, and field operations to proactively identify reliability risks, implement defensive strategies, and continuously improve the operational excellence of our software platform.

Responsibilities:

Design and implement comprehensive monitoring, observability, and alerting systems to ensure early detection of reliability issues across the Lattice platform.
Drive incident response and conduct blameless postmortems to identify systemic improvements and prevent recurrence of production issues.
Build and maintain infrastructure automation using tools like Terraform, Kubernetes operators, and custom tooling to manage large-scale distributed systems.
Establish and track Service Level Objectives (SLOs) and Error Budgets to balance feature velocity with system reliability.
Partner with software engineering teams to improve system architecture for reliability, implementing patterns like circuit breakers, graceful degradation, and chaos engineering.
Develop capacity planning models and performance testing frameworks to ensure systems can handle growth and peak operational demands.
Create runbooks, documentation, and training materials to enable teams to operate production systems effectively.
Lead cross-functional efforts to improve deployment safety through progressive rollouts, automated testing, and rollback capabilities.
Implement security best practices and compliance controls for production environments handling sensitive defense data.
Build tooling and automation to reduce toil and improve operational efficiency for the engineering organization.
Participate in on-call rotations and serve as an escalation point for critical production incidents.

Requirements:

7+ years of engineering experience with at least 3+ years focused on SRE, production operations, or infrastructure engineering.
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.
Deep expertise with Kubernetes in production environments, including operational challenges at scale (100+ nodes).
Strong programming skills in one or more languages such as Go, Python, Rust, or Java with ability to build production-grade tooling.
Proven experience designing and implementing observability stacks (metrics, logging, tracing) using tools like Prometheus, Grafana, ELK/EFK, or equivalent.
Hands-on experience with cloud platforms (AWS, Azure, or GCP) and infrastructure as code practices.
Demonstrated ability to debug complex distributed systems issues across multiple layers of the stack.
Track record of improving system reliability through architectural changes, not just operational band-aids.
Strong incident management and communication skills, with experience leading responses to critical outages.
Must be a U.S. Person due to required access to U.S. export controlled information or facilities.
Eligible to obtain and maintain an active U.S. Secret security clearance.

Preferred Qualifications:

Experience with defense, aerospace, or other mission-critical systems where downtime has severe consequences.
Expertise in performance optimization and capacity planning for high-throughput, low-latency systems.
Knowledge of chaos engineering principles and experience implementing resilience testing frameworks.
Experience with service mesh technologies (Istio, Linkerd) and advanced traffic management patterns.
Background in database operations and optimization (PostgreSQL, Cassandra, or similar at scale).
Familiarity with CI/CD platforms and deployment automation (ArgoCD, FluxCD, Spinnaker, Jenkins).
Understanding of networking fundamentals including load balancing, DNS, TLS/SSL, and network security.
Experience with configuration management and secrets management solutions (Vault, Sealed Secrets, SOPS).
Strong written and verbal communication skills with ability to explain technical concepts to non-technical stakeholders.
Active Secret or higher security clearance.

Benefits:

Healthcare Benefits: Comprehensive medical, dental, and vision plans.
Income Protection: Life and disability insurance.
Generous time off: Highly competitive PTO plans.
Family Planning & Parenting Support: Coverage for fertility treatments, adoption, and gestational carriers.
Mental Health Resources: Access free mental health resources 24/7.
Professional Development: Annual reimbursement for professional development.
Commuter Benefits: Company-funded commuter benefits.
Relocation Assistance: Available depending on role eligibility.
Retirement Savings Plan: Traditional 401(k), Roth, and after-tax options.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://job-boards.greenhouse.io/andurilindustries/jobs/5093563007