Description
Joining Razer will place you on a global mission to revolutionize the way the world games. We are seeking a skilled and driven Senior Site Reliability Engineer (SRE) to join our growing infrastructure and platform engineering team.
The ideal candidate will have hands-on experience in Amazon Web Services (AWS), strong troubleshooting capabilities, and a passion for building scalable, observable, and resilient systems using modern Infrastructure as Code (IaC) and automation tools.
Responsibilities:
- Design, develop, and maintain Infrastructure as Code (IaC) using tools like Terraform or AWS CloudFormation, leveraging AI coding assistants to accelerate development and enforce best practices.
- Implement and operate reliable, scalable cloud infrastructure primarily on AWS (e.g., EC2, ECS, RDS, S3, Lambda, ElastiCache, SQS, SES, Auto Scaling, Load Balancers)
- Lead and participate in architecture reviews focusing on reliability, scalability, security, performance, and the cost-efficiency of infrastructure.
- Develop and manage robust monitoring, alerting, and logging solutions (e.g., CloudWatch, Prometheus, Grafana, ELK), incorporating AIOps tools for predictive alerting, anomaly detection, and reducing alert fatigue.
- Perform incident management, postmortems, root cause analysis, and implement continuous improvement strategies, utilizing AI-driven analytics to rapidly summarize logs and traces during outages.
- Collaborate with software engineering teams to improve CI/CD pipelines, deployment automation, release management, and the deployment lifecycles of machine learning models.
- Automate infrastructure operations, reduce manual toil, and improve reliability using scripting (Python, Bash, Node.js, or Ruby) and AI-powered workflow automation.
- Maintain and troubleshoot environments involving web servers, databases, firewalls, DNS, load balancers, networking.
- Ensure systems are compliant with security standards, including patching, hardening, secure access policies, and data privacy constraints specific to AI training data.
- Provide on-call support, participate in incident rotations.
- Monitor and maintain service-level objectives (SLOs), SLAs, and error budgets to ensure reliability targets are met.
- Provide support and solution handling to incidents and tickets assigned.
This listing is enriched and indexed by YubHub. To apply, use the employer's original posting:
https://razer.wd3.myworkdayjobs.com/en-US/Careers/job/Chengdu/Site-Reliability-Engineer_JR2026007407