Description

Joining Razer will place you on a global mission to revolutionize the way the world games. We are seeking an experienced AIOps Engineer to enhance the reliability, performance, and operational intelligence of mission-critical payment platform infrastructure and services.

This role focuses on designing and implementing intelligent automation solutions, improving observability practices, and leveraging AI-assisted operational capabilities to proactively detect system risks and optimize platform performance. The AIOps Engineer will work closely with DevOps, SRE, and Software Engineering teams to reduce operational overhead, improve incident response efficiency, and support scalable transaction processing environments.

Operational Intelligence & Observability Engineering

Design and improve monitoring strategies covering infrastructure, applications, transaction flows, and distributed system dependencies.
Build dashboards and alerting frameworks that provide actionable operational insights.
Analyze logs, metrics, traces, and telemetry data to identify performance degradation patterns and reliability risks.
Define and track service reliability indicators such as SLIs and SLOs.

Automation & Intelligent Workflow Implementation

Develop automation scripts and workflows to reduce repetitive operational tasks and improve system recovery speed.
Implement intelligent alert enrichment and automated incident triage mechanisms.
Improve signal-to-noise ratio by tuning anomaly detection thresholds and alert correlation logic.
Support creation of self-healing mechanisms for common infrastructure or application failure scenarios.

Incident Response & Reliability Improvement

Participate in incident investigations and lead technical diagnosis for complex operational issues.
Identify systemic reliability weaknesses and recommend engineering improvements.
Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR) through improved tooling and automation.
Contribute to resilience engineering initiatives such as failover validation or controlled fault simulations.

Performance & Capacity Intelligence

Analyze operational data to forecast infrastructure capacity requirements and scaling thresholds.
Provide recommendations for performance optimization across compute, database, messaging, and networking components.
Support cost efficiency initiatives through workload behavior analysis.

AI-Enabled Operational Innovation

Implement AI use cases such as:
Anomaly detection for infrastructure metrics
Log clustering and automated incident summarization
Predictive scaling signals based on workload patterns
Deployment risk analysis using historical operational data
Operational insight dashboards for proactive decision-making
Evaluate emerging AIOps tools and integrate suitable capabilities into monitoring and automation platforms.
Collaborate with R&D and platform teams to validate intelligent automation solutions in production environments.

Security & Compliance Awareness

Ensure operational tooling and automation workflows align with secure engineering practices and regulatory expectations.
Support audit readiness by maintaining visibility and traceability of monitoring configurations and operational actions.
Contribute to detection use cases related to abnormal infrastructure or system behavior.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://razer.wd3.myworkdayjobs.com/en-US/Careers/job/Shah-Alam/AIOps-Engineer_JR2026007451