Description
We are seeking a highly technical Staff Site Reliability Engineer with deep expertise in Splunk and Grafana to own and evolve our observability ecosystem.
As a Staff Site Reliability Engineer, you will move beyond simple monitoring to architect a comprehensive, scalable telemetry platform. You will be our subject-matter expert in Splunk optimisation, ensuring our logging architecture is performant, cost-effective, and deeply integrated with our automated workflows.
Key responsibilities include:
- Splunk Architecture & Optimisation: Lead the design and tuning of Splunk environments. Optimise indexer performance, search efficiency, and data models to ensure rapid troubleshooting and cost-efficiency.
- Advanced Visualisation: Architect and maintain sophisticated Grafana dashboards that correlate disparate data sources into a single pane of glass for real-time system health.
- Automated Infrastructure: Design, build, and maintain scalable observability infrastructure using tools like Terraform.
- Pipeline Engineering: Optimise the collection, processing, and storage of telemetry data (Metrics, Logs, Traces) to ensure high reliability and low latency.
- Workflow Automation: Develop custom Splunk workflows and integrations that trigger automated responses to system events, reducing Mean Time to Resolution (MTTR).
- Incident Response: Participate in on-call rotations and lead post-incident reviews to drive systemic improvements through 'observability-driven development.'
Required skills and experience include:
- Splunk Mastery: Deep, hands-on experience with Splunk administration, search optimisation (SPL), and architecting complex data pipelines.
- Grafana Expertise: Proven ability to build actionable, intuitive dashboards in Grafana that go beyond simple charts to provide deep operational insights.
- SRE Mindset: Minimum 8+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.
- Programming Proficiency: Strong coding skills in Go, Python, or Ruby for building internal tools and automating observability workflows.
- Telemetry Standards: Hands-on experience with OpenTelemetry (OTel), Prometheus, or similar frameworks for instrumenting applications.
- Distributed Systems: Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/EKS).
Bonus skills include:
- Tracing: Implementation of distributed tracing (Jaeger, Tempo, or Honeycomb) to visualise request flow across microservices.
- Security Observability: Experience using Splunk for security orchestration (SOAR) or SIEM-related workflows.
- Cloud Platforms: Experience managing observability native tools within AWS, Azure, or GCP.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://job-boards.greenhouse.io/okta/jobs/6874616