Description
We are seeking a seasoned Principal Engineer to join the Core of Databricks Infrastructure. As the Technical Lead for Compute Fleet Management, you will set the standard for how Databricks consumes and optimizes compute across all three major clouds (AWS, Azure, and GCP).
Your mandate includes pioneering fleet optimization, delivering hyper-scale resilience, and owning the critical path. This is a mission-critical role with direct impact on our gross margin and customer experience.
Key responsibilities include:
- Provisioning and pooling of O(Billion)s of cloud resources to achieve peak workload performance, industry-leading efficiency, and robust resource isolation.
- Building the architecture that guarantees horizontal scaling and resilience against zonal or even cloud account-level failures, ensuring Databricks is always on.
- Leading the development of the lowest-dependency systems required to bootstrap and manage our massive compute platform.
The ideal candidate will have a track record of leading transformative projects, distributed systems mastery, influence without authority, and execution discipline.
In addition, highly desirable experience includes managing and scaling a massive fleet of GPUs for AI/ML workloads and developing and operating large-scale distributed systems across all major clouds (AWS, Azure, and GCP).
Databricks is committed to fair and equitable compensation practices. The pay range for this role is $264,300-$322,300 USD.