# Site Reliability Engineer

**Company**: Razer
**Location**: Chengdu,Bangsar South
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology

**Apply**: https://razer.wd3.myworkdayjobs.com/en-US/Careers/job/Chengdu/Site-Reliability-Engineer_JR2026007407?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_353b1961-eb6

## Description

Joining Razer will place you on a global mission to revolutionize the way the world games. We are seeking a skilled and driven Senior Site Reliability Engineer (SRE) to join our growing infrastructure and platform engineering team.

The ideal candidate will have hands-on experience in Amazon Web Services (AWS), strong troubleshooting capabilities, and a passion for building scalable, observable, and resilient systems using modern Infrastructure as Code (IaC) and automation tools.

**Responsibilities:**

- Design, develop, and maintain Infrastructure as Code (IaC) using tools like Terraform or AWS CloudFormation, leveraging AI coding assistants to accelerate development and enforce best practices.

- Implement and operate reliable, scalable cloud infrastructure primarily on AWS (e.g., EC2, ECS, RDS, S3, Lambda, ElastiCache, SQS, SES, Auto Scaling, Load Balancers)

- Lead and participate in architecture reviews focusing on reliability, scalability, security, performance, and the cost-efficiency of infrastructure.

- Develop and manage robust monitoring, alerting, and logging solutions (e.g., CloudWatch, Prometheus, Grafana, ELK), incorporating AIOps tools for predictive alerting, anomaly detection, and reducing alert fatigue.

- Perform incident management, postmortems, root cause analysis, and implement continuous improvement strategies, utilizing AI-driven analytics to rapidly summarize logs and traces during outages.

- Collaborate with software engineering teams to improve CI/CD pipelines, deployment automation, release management, and the deployment lifecycles of machine learning models.

- Automate infrastructure operations, reduce manual toil, and improve reliability using scripting (Python, Bash, Node.js, or Ruby) and AI-powered workflow automation.

- Maintain and troubleshoot environments involving web servers, databases, firewalls, DNS, load balancers, networking.

- Ensure systems are compliant with security standards, including patching, hardening, secure access policies, and data privacy constraints specific to AI training data.

- Provide on-call support, participate in incident rotations.

- Monitor and maintain service-level objectives (SLOs), SLAs, and error budgets to ensure reliability targets are met.

- Provide support and solution handling to incidents and tickets assigned.

## Skills

### Required
- Amazon Web Services (AWS)
- Infrastructure as Code (IaC)
- Terraform
- CloudFormation
- Automation
- Scripting (Python, Bash, Node.js, or Ruby)
- AIOps
- Monitoring
- Alerting
- Logging
- CloudWatch
- Prometheus
- Grafana
- ELK
- CI/CD pipelines
- Deployment automation
- Release management
- Machine learning models
- Web servers
- Databases
- Firewalls
- DNS
- Load balancers
- Networking
- Security standards
- Patching
- Hardening
- Secure access policies
- Data privacy constraints

---

Source: [Apply at razer.wd3.myworkdayjobs.com](https://razer.wd3.myworkdayjobs.com/en-US/Careers/job/Chengdu/Site-Reliability-Engineer_JR2026007407?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)