Description

Joining Razer will place you on a global mission to revolutionize the way the world games. We are seeking a skilled and driven Senior Site Reliability Engineer (SRE) to join our growing infrastructure and platform engineering team.

The ideal candidate will have hands-on experience in Amazon Web Services (AWS), strong troubleshooting capabilities, and a passion for building scalable, observable, and resilient systems using modern Infrastructure as Code (IaC) and automation tools.

Responsibilities:

Design, develop, and maintain Infrastructure as Code (IaC) using tools like Terraform or AWS CloudFormation, leveraging AI coding assistants to accelerate development and enforce best practices.
Implement and operate reliable, scalable cloud infrastructure primarily on AWS (e.g., EC2, ECS, RDS, S3, Lambda, ElastiCache, SQS, SES, Auto Scaling, Load Balancers)
Lead and participate in architecture reviews focusing on reliability, scalability, security, performance, and the cost-efficiency of infrastructure.
Develop and manage robust monitoring, alerting, and logging solutions (e.g., CloudWatch, Prometheus, Grafana, ELK), incorporating AIOps tools for predictive alerting, anomaly detection, and reducing alert fatigue.
Perform incident management, postmortems, root cause analysis, and implement continuous improvement strategies, utilizing AI-driven analytics to rapidly summarize logs and traces during outages.
Collaborate with software engineering teams to improve CI/CD pipelines, deployment automation, release management, and the deployment lifecycles of machine learning models.
Automate infrastructure operations, reduce manual toil, and improve reliability using scripting (Python, Bash, Node.js, or Ruby) and AI-powered workflow automation.
Maintain and troubleshoot environments involving web servers, databases, firewalls, DNS, load balancers, networking.
Ensure systems are compliant with security standards, including patching, hardening, secure access policies, and data privacy constraints specific to AI training data.
Provide on-call support, participate in incident rotations.
Monitor and maintain service-level objectives (SLOs), SLAs, and error budgets to ensure reliability targets are met.
Provide support and solution handling to incidents and tickets assigned.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://razer.wd3.myworkdayjobs.com/en-US/Careers/job/Chengdu/Site-Reliability-Engineer_JR2026007407