Elastic

Senior Site Reliability Engineer (Resilience) - Platform Resilience

Elastic
remote senior full-time $154,800-$195,600 USD United States
Apply →

First indexed 18 Apr 2026

Description

We're seeking a Senior Site Reliability Engineer (SRE) to join our Platform Engineering department. As an SRE, you will lead technical initiatives to automate system engineering efforts, ensuring the reliability of our global infrastructure. You will grow our global Platform infrastructure to meet increasing scaling demands by developing and maintaining software, tooling, and automations.

Responsibilities:

  • Develop and maintain software, tooling, and automations to ensure the reliability and scalability of our global infrastructure.
  • Lead technical initiatives to automate system engineering efforts, ensuring the reliability of our global infrastructure.
  • Collaborate with engineers to identify, implement, and deliver solutions that meet the needs of our customers.
  • Champion an environment focused on collaboration, operational excellence, and uplifting others.
  • Respond to and prevent repeated customer impact in response to major incidents and prioritized problem management.

Requirements:

  • Success and lessons of experiences from striving for 'progress not perfection' in the name of Platform reliability.
  • Background in software engineering to collaborate with engineers to expertly identify, implement, and deliver solutions.
  • Experience in public cloud and managed Kubernetes services is advantageous.
  • Passion for developing solutions that involve inclusive communication methods to grow and strengthen partner and team relationships.

Preferred Qualifications:

  • Operated a SaaS product in a public cloud ideally built using Infrastructure-as-Code tooling such as Crossplane or Terraform.
  • Built or operated a Kubernetes-at-scale infrastructure, ideally across multiple cloud providers, and the vital automation to support it.
  • Written non-trivial programs in Golang or other programming languages.
  • Worked with containerized services (such as Docker).
  • Proven experience in leading and improving alerting and major incident management standard processes metrics systems (e.g. Elastic Stack, Graphite, Prometheus, Influx) to diagnose issues and quantify impacts to present to others at varying levels of the organization.
  • Experienced in system administration with professional skills in Linux on distributed systems at scale.
  • Diagnosed or designed, implemented, and created solutions with the Elastic Stack.
  • Thrived in a self-organizing and sharing in a globally distributed team environment.
  • Strengthened team members in bringing out the best of each other by uplifting others with coaching and mentoring.

Compensation:

  • This role is eligible to participate in Elastic's stock program.
  • Total rewards package includes a company-matched 401k with dollar-for-dollar matching up to 6% of eligible earnings, along with a range of other benefits offered with a holistic emphasis on employee well-being.
  • Typical starting salary range for this role is $154,800-$195,600 USD.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://job-boards.greenhouse.io/elastic/jobs/7794016