Description

We are a technology organisation operating high-performance, large-scale Linux production environments that support critical platforms and engineering teams. Our focus is on operational excellence, service reliability, automation, and continuous improvement. We run 24x7 operations and partner closely with platform, network, security, and engineering teams to deliver stable, secure, and scalable infrastructure.

You will lead and manage a 24x7 L1 Linux Engineering / SRE team operating in rotational shifts. Your responsibilities will include owning hiring, onboarding, performance management, coaching, and career development for L1 engineers. You will also own L1 production support operations for Linux systems in a 24x7 environment, acting as the first leadership escalation point during major production incidents.

Key responsibilities include ensuring adherence to SLAs, OLAs, and operational KPIs such as availability and MTTR. You will provide technical oversight across Linux OS, bare metal and virtualized platforms, and monitoring/logging systems. Driving automation adoption using Ansible, Bash, and Python to reduce manual toil is also a key aspect of this role.

You will partner with platform, network, security, and engineering teams to improve system reliability and resilience. Your impact will be ensuring stable, reliable, and efficient 24x7 L1 Linux/SRE operations, reducing incident recurrence and improving incident response and resolution times, building a skilled, motivated, and well-governed L1 engineering team, and improving operational maturity through automation, standardization, and documentation.

To succeed in this role, you will need 10–14+ years of experience in IT Infrastructure, Linux Operations, or SRE, with 4–6+ years of people management experience, preferably managing 24x7 support teams. You will also need a strong hands-on background in Linux system administration and production support, experience with incident management, on-call models, and rotational shifts, advanced knowledge of Linux OS internals, experience with virtualization platforms (VMware, KVM, OpenStack, oVirt), knowledge of monitoring and logging tools (e.g., Nagios, ELK), experience with automation and configuration management (Ansible), and scripting skills in Bash and/or Python.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://careers.synopsys.com/job/bengaluru/site-reliability-manager/44408/94212497792