# Staff Site Reliability Engineer- Splunk Expert

**Company**: Okta
**Location**: Bengaluru, India
**Work arrangement**: hybrid
**Experience**: staff
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://job-boards.greenhouse.io/okta/jobs/6874616
**Canonical**: https://yubhub.co/jobs/job_491db8e9-776

## Description

We are seeking a highly technical Staff Site Reliability Engineer with deep expertise in Splunk and Grafana to own and evolve our observability ecosystem.

As a Staff Site Reliability Engineer, you will move beyond simple monitoring to architect a comprehensive, scalable telemetry platform. You will be our subject-matter expert in Splunk optimisation, ensuring our logging architecture is performant, cost-effective, and deeply integrated with our automated workflows.

Key responsibilities include:

- Splunk Architecture & Optimisation: Lead the design and tuning of Splunk environments. Optimise indexer performance, search efficiency, and data models to ensure rapid troubleshooting and cost-efficiency.

- Advanced Visualisation: Architect and maintain sophisticated Grafana dashboards that correlate disparate data sources into a single pane of glass for real-time system health.

- Automated Infrastructure: Design, build, and maintain scalable observability infrastructure using tools like Terraform.

- Pipeline Engineering: Optimise the collection, processing, and storage of telemetry data (Metrics, Logs, Traces) to ensure high reliability and low latency.

- Workflow Automation: Develop custom Splunk workflows and integrations that trigger automated responses to system events, reducing Mean Time to Resolution (MTTR).

- Incident Response: Participate in on-call rotations and lead post-incident reviews to drive systemic improvements through 'observability-driven development.'

Required skills and experience include:

- Splunk Mastery: Deep, hands-on experience with Splunk administration, search optimisation (SPL), and architecting complex data pipelines.

- Grafana Expertise: Proven ability to build actionable, intuitive dashboards in Grafana that go beyond simple charts to provide deep operational insights.

- SRE Mindset: Minimum 8+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.

- Programming Proficiency: Strong coding skills in Go, Python, or Ruby for building internal tools and automating observability workflows.

- Telemetry Standards: Hands-on experience with OpenTelemetry (OTel), Prometheus, or similar frameworks for instrumenting applications.

- Distributed Systems: Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/EKS).

Bonus skills include:

- Tracing: Implementation of distributed tracing (Jaeger, Tempo, or Honeycomb) to visualise request flow across microservices.

- Security Observability: Experience using Splunk for security orchestration (SOAR) or SIEM-related workflows.

- Cloud Platforms: Experience managing observability native tools within AWS, Azure, or GCP.

## Skills

### Required
- Splunk
- Grafana
- SRE
- Go
- Python
- Ruby
- OpenTelemetry
- Prometheus
- Linux
- Networking
- Container Orchestration

### Nice to have
- Tracing
- Security Observability
- Cloud Platforms