Description
As a Sr. Manager of the Data & AI Support Engineering team, you will lead and manage a team of Technical Solutions Engineers responsible for driving deep technical resolutions for complex customer issues across Spark, AI/ML, Streaming, and Lakehouse platforms.
You will help customers realise business value from Databricks Ecosystem products through strong technical leadership, AI-first operational innovation and customer-centric execution.
Mission Lead and scale a world-class AI-first Data & AI Support Engineering organisation that combines deep technical expertise, operational excellence, intelligent automation and customer-centric support to accelerate issue resolution, improve platform reliability and drive exceptional customer outcomes across enterprise-scale Data and AI workloads.
Key Responsibilities:
- Build AI-enabled support workflows and reusable automations to improve resolution speed and support quality.
- Use Agentic AI systems, logs, telemetry, observability platforms and internal systems to accelerate troubleshooting and root-cause analysis safely.
- Create reusable runbooks, prompts, and agentic workflows that scale operational efficiency across teams.
- Ensure strong AI governance, customer data safety, validation practices, auditability, and human-in-the-loop controls.
- Partner with Engineering and Product teams to drive AI-first support innovation and operational excellence.
Outcomes:
- Drive AI-first support transformation initiatives that improve resolution speed, case quality, operational efficiency and customer experience.
- Partner with Engineering and Product teams to operationalize AI-assisted diagnostics, observability insights, and intelligent escalation management for enterprise customers.
- Build and scale reusable AI-enabled workflows, automations, runbooks, and operational intelligence frameworks across the support organisation.
- Lead and manage Technical Solutions Engineers, Team Leads, and support operations personnel across AMER support functions based out of the Dallas location.
- Own and improve operational KPIs including customer satisfaction, escalation management, backlog health, resolution efficiency, and support quality.
- Act as a senior escalation point for customers and internal teams while driving operational excellence and process optimisation.
- Lead hiring, onboarding, mentoring, technical assessments, training, and career development for support engineers and technical leads.
- Conduct regular one-on-ones, annual review, and career development discussions with direct reports.
- Be a hands-on technical leader supporting complex issues related to Spark Core, Spark SQL, Structured Streaming, Delta Lake, Lakehouse architecture, and Databricks Runtime technologies.
- Guide customers on Spark runtime optimisation, distributed systems performance, and best practices for scalable Data & AI workloads.
- Own Engineering JIRA escalations and proactively drive faster resolutions for customer-reported product issues.
- Maintain internal operational documentation, runbooks, and customer-facing knowledge base assets.
- Coordinate closely with Engineering and Backline Support engineering, customer experience intelligence teams to identify, reproduce, and report product defects effectively.
- Act as a strong customer advocate and collaborate with cloud partners to support mutual customer success.
- Participate in major incident management, escalation handling, on-call rotations, and critical production support activities.
Requirements:
- 10+ years of experience designing, building, troubleshooting, and supporting large-scale Data & AI applications using Python, Java, Scala, Spark, or related distributed technologies.
- Strong work experience of AI-enabled support workflows, agentic AI systems, Claude Skills workflows, RAG architectures, vector databases and any other operational automation frameworks.
- Proven development/delivery experience at a production scale in Databricks tech stacks like Model serving, Lakehouse, Delta, DLT, Lakeflow, Lakebase platforms is a strong plus.
- Experience using AI tools for troubleshooting, root-cause analysis, observability analysis, and support workflow acceleration.
- Strong hands-on expertise in Apache Spark, Spark SQL, Structured Streaming, Delta Lake, and distributed data processing systems.
- Experience leading production-scale workloads across Big Data, Hadoop, AI/ML, Kafka, Streaming, Data Science, or Analytics platforms.
- Strong troubleshooting and performance tuning experience for Spark and JVM-based distributed systems, including memory management, garbage collection, heap analysis, and thread dump analysis.
- Hands-on experience with AWS, Azure, or GCP cloud platforms.
- Proven experience managing globally distributed technical teams and handling high-severity customer escalations.
- Strong analytical, debugging, problem-solving, and distributed systems troubleshooting skills.
- Excellent written and verbal communication skills with strong customer-facing leadership abilities.
- Strong organisational, multitasking, stakeholder management, and operational leadership capabilities.