Description
Summary
Microsoft AI are looking for a talented Member of Technical Staff, Pre-Training Infrastructure, to help build the next wave of capabilities for our personalized AI assistant, Copilot. We’re seeking someone who brings an abundance of positive energy, empathy, and kindness to the team every day, in addition to being highly effective.
About the Role
We are seeking a highly skilled and experienced engineer to join our team as a Member of Technical Staff, Pre-Training Infrastructure. The successful candidate will be responsible for designing, implementing, testing, and optimizing distributed training infrastructure in Python and C++ for large-scale GPU clusters. They will also profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems.
Accountabilities
- Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters.
- Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems.
The Candidate we're looking for
Experience:
- Bachelor’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Technical skills:
- Experience in distributed computing and large-scale systems.
- Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch.
Personal attributes:
- Proven ability to profile, benchmark, and optimize performance-critical systems.
- Experience in leading technical projects and supporting architectural decisions with data.
Benefits
- Competitive salary and benefits package.
- Opportunity to work on cutting-edge AI projects.
- Collaborative and dynamic work environment.