Description

Joining Razer will place you on a global mission to revolutionize the way the world games. We are seeking an experienced Senior AIOps Engineer to enhance the reliability, scalability, and operational intelligence of mission-critical payment platform infrastructure and services.

This role focuses on leveraging automation, advanced analytics, and AI-driven operational tooling to improve system observability, incident response efficiency, performance optimization, and proactive risk detection across high-throughput transaction processing environments.

The successful candidate will work closely with DevOps, SRE, Engineering, and Platform teams to design intelligent operational workflows that reduce manual intervention, improve service availability, and support continuous platform growth.

Key Responsibilities:

AIOps Platform Development & Automation

Design and implement intelligent automation solutions to improve operational efficiency and reduce repetitive infrastructure and application support tasks.
Develop tools and pipelines for automated incident triage, alert enrichment, and operational diagnostics.
Integrate AI/ML capabilities into monitoring, logging, and event management platforms.
Improve signal-to-noise ratio by optimizing alerting strategies and anomaly detection mechanisms.
Other duties as assigned.

Observability Engineering & Operational Intelligence

Enhance monitoring frameworks covering infrastructure, applications, transaction flows, and distributed system dependencies.
Build intelligent dashboards and predictive insights to support proactive reliability management.
Analyze large-scale operational datasets including logs, metrics, traces, and transaction telemetry.
Define and track SLIs, SLOs, and reliability indicators for critical payment services.

Incident Prediction & Reliability Optimization

Implement predictive models and heuristics to identify early indicators of system degradation or failure.
Collaborate with SRE and platform teams to automate remediation workflows and self-healing mechanisms.
Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR) through intelligent automation and operational playbooks.
Contribute to resilience engineering initiatives including chaos testing and reliability simulations.

Platform Performance & Capacity Intelligence

Develop analytics to forecast workload growth, capacity requirements, and scaling thresholds.
Provide recommendations for infrastructure tuning, cost efficiency, and performance optimization.
Support engineering teams in identifying performance bottlenecks across compute, database, messaging, and network layers.

Security, Compliance & Governance Support

Ensure AI-driven operational tooling aligns with secure engineering practices and regulated environment requirements.
Support audit readiness through improved operational visibility and traceability.
Contribute to anomaly detection use cases related to infrastructure misuse or unusual operational patterns.

AI Innovation & Research Collaboration

Evaluate emerging AIOps tools, frameworks, and techniques for suitability in high-availability payment environments.
Prototype intelligent operational capabilities such as:
Predictive incident correlation
Automated runbook execution
Intelligent deployment risk analysis
Log summarization and pattern clustering
Transaction degradation early-warning signals
Promote responsible AI adoption and knowledge sharing across engineering teams.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://razer.wd3.myworkdayjobs.com/en-US/Careers/job/Shah-Alam/Senior-AIOps-Engineer_JR2026007450