Description
Secure Every Identity
We are looking for a Senior Site Reliability Engineer to join our SRE team based in Europe. As a Senior Site Reliability Engineer, you'll ensure our production systems are not only operational but also resilient, scalable, and ready for exponential growth.
This isn't just about keeping the lights on; it's about directly contributing to the platform's core resiliency and robustness. You'll be a hands-on builder, crafting solutions that make our system more reliable by design.
Responsibilities
- Design and build custom software in Go to enhance the platform's reliability, resiliency, and redundancy.
- Partner with engineering teams to embed reliability principles, improving the availability, performance, and observability of our services.
- Use your deep understanding of infrastructure and observability principles to identify opportunities for improvement within the product and implement solutions.
- Contribute to our on-call rotation, providing rapid, effective response to critical incidents and using your expertise to troubleshoot, mitigate or accurately escalate production issues.
- Develop and refine our SRE tooling and processes, focusing on automation and operational efficiency.
- Define, document, and champion reliability best practices across the organisation.
What you'll need to be successful
This role requires a unique blend of a software engineer's mindset and operational expertise. You'll thrive in this role if you have:
- A proactive and systematic approach to problem-solving, with a high degree of ownership.
- Proven experience in a production environment supporting large-scale, mission-critical applications with a high degree of autonomy.
- Proficiency in at least one programming language, with a preference for Go. You should be comfortable writing custom applications, not just scripts.
- Experience with infrastructure as code (Terraform), container orchestration (Kubernetes, Docker) and GitOps (ArgoCD).
- Demonstrable expertise in a major cloud provider (Azure, AWS, or GCP).
- A strong grasp of microservices architecture, databases (SQL, NoSQL), and networking fundamentals, so you can understand how custom code can solve platform-level issues.
- An understanding of core SRE principles, including SLIs, SLOs, and error budgets.
- Experience in an on-call rotation for a 24/7 cloud-based environment.
- Exceptional communication and collaboration skills, with a proven ability to work effectively in a remote, distributed team, where tasks may be self-driven.
The Okta Experience
- Supporting Your Well-Being
- Driving Social Impact
- Developing Talent and Fostering Connection + Community
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://job-boards.greenhouse.io/okta/jobs/7418982