# Senior Site Reliability Engineer

**Company**: Razer
**Location**: Bangsar South
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://razer.wd3.myworkdayjobs.com/en-US/Careers/job/Bangsar-South/Site-Reliability---Engineering-3_JR2025005670?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_6e91fa49-4f0

## Description

Joining Razer will place you on a global mission to revolutionize the way the world games. As a Senior Site Reliability Engineer, you will be part of a growing infrastructure and platform engineering team. The ideal candidate will have hands-on experience in Amazon Web Services (AWS), strong troubleshooting capabilities, and a passion for building scalable, observable, and resilient systems using modern Infrastructure as Code (IaC) and automation tools.

**Responsibilities:**

- Design, implement, and maintain Infrastructure as Code (IaC) solutions using Terraform and/or CloudFormation across multi-account AWS environments.

- Collaborate with developers, architects, and DevOps teams to build scalable, secure, and observable cloud infrastructure.

- Lead and participate in architecture design sessions, focusing on system reliability, scalability, security, and performance.

- Implement and manage robust monitoring, alerting, and observability solutions (e.g., CloudWatch, Prometheus, ELK, Datadog).

- Set and monitor Key Performance Indicators (KPIs) for system uptime, latency, throughput, and overall reliability.

- Drive incident response processes, including coordination, triaging, resolution, documentation, and post-incident reviews (PIRs).

- Supervise and mentor junior SREs and infrastructure engineers, fostering knowledge-sharing and team growth.

- Collaborate across development, operations, and security teams to ensure secure and compliant deployments.

- Automate manual tasks and workflows through scripting and tooling (Python, Node.js, Bash, Ruby, JSON/YAML).

- Troubleshoot complex infrastructure issues across Linux, Windows, Docker, and cloud-native environments.

- Provide IaC and CI/CD best practices to ensure repeatability, scalability, and compliance across all environments.

- Provide on-call support, participate in incident rotations, and lead technical investigations during outages or degradations.

- Strong understanding and experience for Disaster Recovery (DR).

- Provide support and solution handling to incident and tickets assigned.

## Skills

### Required
- Amazon Web Services (AWS)
- Terraform
- CloudFormation
- Infrastructure as Code (IaC)
- Automation tools
- CloudWatch
- Prometheus
- ELK
- Datadog
- Key Performance Indicators (KPIs)
- System uptime
- Latency
- Throughput
- Reliability
- Incident response
- Coordination
- Triage
- Resolution
- Documentation
- Post-incident reviews (PIRs)
- Junior SREs
- Infrastructure engineers
- Knowledge-sharing
- Team growth
- Development
- Operations
- Security
- Compliant deployments
- Scripting
- Tooling
- Python
- Node.js
- Bash
- Ruby
- JSON
- YAML
- Linux
- Windows
- Docker
- Cloud-native environments
- Disaster Recovery (DR)

---

Source: [Apply at razer.wd3.myworkdayjobs.com](https://razer.wd3.myworkdayjobs.com/en-US/Careers/job/Bangsar-South/Site-Reliability---Engineering-3_JR2025005670?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
