Description
We are looking for a highly motivated and skilled Incident Response Engineer to join our Facility Operations Center (FOC) team. In this critical role, we are responsible for coordination and presentation within NVIDIA’s datacenters, with a specific focus on incident response, vendor support, and maintenance performance.
The primary role is to perform coordination and communication across NVIDIA’s datacenter portfolio from an operations perspective regarding incidents, maintenance, and reporting/monitoring. Develop standards and programs in support of reliability and operations initiatives, including Problem and Change Control, and define and maintain a health score for sites and environments, including testing methods to predict and isolate points of failure, assessing and advising on maintenance strategies, and providing related reporting and metrics.
Study failure data and work with machine learning and AI teams and tools to predict future failures, and facilitate reliability studies such as critical assessments, RAM models, and RCM studies. Identify and drive automation & process improvement opportunities across catalog quality workflows and reporting.
Coordinate disaster recovery tests, liaise during audits, collaborate with internal partners, and make vital progress to ensure business continuity and compliance. Perform risk assessments to ensure compliance with policies, procedures, rules & regulations, and data center standards.
Own and present end-to-end key business metrics related to incident response, including ownership and representation of internal and external tooling. Lead root cause analysis for outages and adjust documentation, workflows, and operating procedures to avoid future incidents.
Assess process improvement & transformation opportunities and partner with process owners & collaborators to scope opportunities, define problem statements and objectives, and structure projects and teams.
Work multi-functionally with other team members and groups within the organization, and develop strong, productive relationships across peer organizations that further the organization's business objectives. This will incorporate training, coaching, and mentoring Operations teams as needed to empower them to use operations tools and systems to meet daily business needs.
Other projects and duties as assigned.