New The Skills of Tomorrow: how AI-exposed is every skill in 2026? See the data →
Razer

Site Reliability Engineer

Razer
Apply →
onsite senior full-time Chengdu,Bangsar South

First indexed 26 May 2026

Description

Joining Razer will place you on a global mission to revolutionize the way the world games. We are seeking a skilled and driven Senior Site Reliability Engineer (SRE) to join our growing infrastructure and platform engineering team.

The ideal candidate will have hands-on experience in Amazon Web Services (AWS), strong troubleshooting capabilities, and a passion for building scalable, observable, and resilient systems using modern Infrastructure as Code (IaC) and automation tools.

Responsibilities:

  • Design, develop, and maintain Infrastructure as Code (IaC) using tools like Terraform or AWS CloudFormation, leveraging AI coding assistants to accelerate development and enforce best practices.
  • Implement and operate reliable, scalable cloud infrastructure primarily on AWS (e.g., EC2, ECS, RDS, S3, Lambda, ElastiCache, SQS, SES, Auto Scaling, Load Balancers)
  • Lead and participate in architecture reviews focusing on reliability, scalability, security, performance, and the cost-efficiency of infrastructure.
  • Develop and manage robust monitoring, alerting, and logging solutions (e.g., CloudWatch, Prometheus, Grafana, ELK), incorporating AIOps tools for predictive alerting, anomaly detection, and reducing alert fatigue.
  • Perform incident management, postmortems, root cause analysis, and implement continuous improvement strategies, utilizing AI-driven analytics to rapidly summarize logs and traces during outages.
  • Collaborate with software engineering teams to improve CI/CD pipelines, deployment automation, release management, and the deployment lifecycles of machine learning models.
  • Automate infrastructure operations, reduce manual toil, and improve reliability using scripting (Python, Bash, Node.js, or Ruby) and AI-powered workflow automation.
  • Maintain and troubleshoot environments involving web servers, databases, firewalls, DNS, load balancers, networking.
  • Ensure systems are compliant with security standards, including patching, hardening, secure access policies, and data privacy constraints specific to AI training data.
  • Provide on-call support, participate in incident rotations.
  • Monitor and maintain service-level objectives (SLOs), SLAs, and error budgets to ensure reliability targets are met.
  • Provide support and solution handling to incidents and tickets assigned.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://razer.wd3.myworkdayjobs.com/en-US/Careers/job/Chengdu/Site-Reliability-Engineer_JR2026007407