# Product Reliability Engineer - Defense

**Company**: Palantir
**Location**: Washington, D.C.
**Work arrangement**: hybrid
**Experience**: mid
**Job type**: full-time
**Salary**: $96,000 - $140,000/year
**Category**: Engineering
**Industry**: Technology

**Apply**: https://jobs.lever.co/palantir/57699414-b373-4a5e-9be8-6fb7de41ea72?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_2ce1768b-c47

## Description

A Product Reliability Engineer at Palantir is responsible for the health, performance, and stability of the services that power services at Palantir. You will tackle critical issues for key customers, introduce observability into complex systems, address tech debt in essential codebases, and inform strategic investments in core products.

The role involves deep-dive troubleshooting, strong ownership over problems, and recognition of the urgency of customer-facing outages. You will spend the majority of your time on forward-looking product work, including infrastructure migrations, product contributions to improve stability and observability, and codebase enhancements that increase resilience.

During periodic on-call shifts, you will respond to automated alerts, investigate issues reported by customers, and share technical expertise with adjacent product teams. You will play a central and critical role in resolving technical issues, seeking not just a one-time fix, but a permanent solution.

We provide new team members with an experienced mentor and a clear onboarding framework to set them up for success in the role.

## Core Responsibilities

- Continuously invest in documentation, metrics, monitors, and other troubleshooting tools

- Participate in on-call rotations during business hours and occasional weekends

- Diagnose, resolve, and prevent issues encountered in the field

- Deliver end-to-end improvements to core products based on issues encountered in the field

- Improve observability by refactoring codepaths and introducing telemetry

- Identify and implement data-driven opportunities for improved service resilience

- Develop strategic opinions on stability investments and inform the vision for long-term product stability

## What We Value

- Comfortable with and curious about large-scale production systems and technologies

- Confidence in troubleshooting complex issues independently using observability tools and stack traces

- Familiarity with monitoring tools such as Prometheus and health checks

- Experience coding with Java, Go, and/or web technologies

- Track record of identifying bugs in codebases and contributing fixes leading to long-term service stability

- Demonstrated ability making data-driven decisions and engaging with stakeholders on strategy

## What We Require

- Engineering background in Computer Science, Mathematics, Software Engineering, Physics, or similar field

- Ability to work with a high degree of ownership and a strong sense of urgency in a dynamic environment

- Experience producing code in backend languages such as Java, as part of a past role or personal projects

- Familiarity with storage and data processing systems and cloud infrastructure

- Strong written and verbal communication and ability to iterate quickly with teammates and incorporate feedback

- Eligibility and willingness to obtain a US Security clearance

## Additional Information

- Salary: The estimated salary range for this position is $96,000 - $140,000/year.

- Total compensation for this position may also include Restricted Stock units, sign-on bonus, and other potential future incentives.

## Skills

### Required
- Java
- Go
- Prometheus
- health checks
- backend languages
- storage and data processing systems
- cloud infrastructure

---

Source: [Apply at jobs.lever.co](https://jobs.lever.co/palantir/57699414-b373-4a5e-9be8-6fb7de41ea72?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
