# Member of Technical Staff, HPC Operations Engineering Manager

**Company**: Microsoft AI
**Location**: Mountain View
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Salary**: USD $139,900 – $274,800 per year
**Category**: Engineering
**Industry**: Technology

**Apply**: https://microsoft.ai/job/member-of-technical-staff-hpc-operations-engineering-manager-mai-superintelligence-team/
**Canonical**: https://yubhub.co/jobs/job_2b3a3ab9-2bc

## Description

## Summary

Microsoft AI are looking for a talented Member of Technical Staff, HPC Operations Engineering Manager to join their MAI SuperIntelligence Team. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising haptic entertainment technology. You'll work directly with leadership to shape the company's direction in the cinema and simulation markets.

## About the Role

In this role, you'll lead a team of Site Reliability Engineers who blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You'll work closely with ML researchers, data engineers, and product developers to design and operate the platforms that power training, fine-tuning, and serving generative AI models.

## Accountabilities

- Conduct in-depth market research across cinema and simulation sectors, identifying emerging trends, competitive threats, and partnership opportunities that directly inform the company's quarterly strategic planning sessions

- Lead a team of experienced SREs to ensure uptime, resiliency and fault tolerance of AI model training and inference systems

## The Candidate we're looking for

**Experience:**

- 8+ years technical engineering experience with Site Reliability Engineering, DevOps, or Infrastructure Engineering Leadership roles

**Technical skills:**

- Kubernetes, Docker, and container orchestration

- Public cloud platforms like Azure/AWS/GCP and infrastructure-as-code

**Personal attributes:**

- Low ego individual

## Benefits

- Competitive salary

- Benefits and other compensation

## Skills

### Required
- Kubernetes
- Docker
- container orchestration
- public cloud platforms
- infrastructure-as-code

### Nice to have
- monitoring & observability tools
- Grafana
- Datadog
- OpenTelemetry
