Description
Summary
Microsoft AI are looking for a talented Member of Technical Staff, HPC Operations Engineering Manager to join their MAI SuperIntelligence Team. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising haptic entertainment technology. You'll work directly with leadership to shape the company's direction in the cinema and simulation markets.
About the Role
In this role, you'll lead a team of Site Reliability Engineers who blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You'll work closely with ML researchers, data engineers, and product developers to design and operate the platforms that power training, fine-tuning, and serving generative AI models.
Accountabilities
- Conduct in-depth market research across cinema and simulation sectors, identifying emerging trends, competitive threats, and partnership opportunities that directly inform the company's quarterly strategic planning sessions
- Lead a team of experienced SREs to ensure uptime, resiliency and fault tolerance of AI model training and inference systems
The Candidate we're looking for
Experience:
- 8+ years technical engineering experience with Site Reliability Engineering, DevOps, or Infrastructure Engineering Leadership roles
Technical skills:
- Kubernetes, Docker, and container orchestration
- Public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
Personal attributes:
- Low ego individual
Benefits
- Competitive salary
- Benefits and other compensation