# Director, Site Reliability Engineer | Senior Engineering Team Director

**Company**: BlackRock
**Location**: England
**Work arrangement**: hybrid
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Finance
**Wikidata**: https://www.wikidata.org/wiki/Q219635

**Apply**: https://jobs.workable.com/view/cLBuSgz7avHiG3cKzS91ZB/director%2C-site-reliability-engineer-%7C-senior-engineering-team-director-in-england-at-blackrock
**Canonical**: https://yubhub.co/jobs/job_49ef318f-90a

## Description

We're seeking a Site Reliability Engineering (SRE) Lead to design, build, and maintain resilient, high-scale systems supporting BlackRock's Private Markets platform. In this hands-on leadership role, you'll apply deep engineering expertise to solve complex challenges, guide a global team, shape technical direction, and communicate effectively with senior stakeholders,ensuring the reliability of mission-critical systems that power private market investment workflows and decision-making. You will drive the adoption of AI-driven solutions to accelerate incident detection and triage, reduce toil, improve forecasting and capacity planning, and strengthen end-to-end observability and resilience.

Key Responsibilities:

- Take ownership of project priorities, deadlines and deliverables using Agile methodologies, with clear outcomes around reliability automation and AI-enabled operations

- Understand and refine business and functional requirements, translating them into SLOs/SLIs and AI-assisted observability and support capabilities

- Hands on approach to getting work done,this role requires a “roll your sleeves up” mentality, including building and operationalizing reliability tooling and automation that measurably reduces toil and improves stability

- Be a leader with vision and a partner in brainstorming solutions for team productivity and efficiency to improve engineering effectiveness

- Drive priority setting of the engineering teams, balancing foundational reliability work with delivery of new product features

- Improve Engineering culture by encouraging continuous focus on reliability across the entire application lifecycle, and by adopting AI-enabled SRE practices (e.g., intelligent alerting, automated diagnosis, and self-healing where appropriate)

- Proactive participant in architectural and design decisions, including AI-ready telemetry, data quality, and model integration patterns for operational analytics

- Design and implement end-to-end monitoring solutions for application and infrastructure components, leveraging modern observability platforms plus AI/ML techniques for anomaly detection, correlation, and alert noise reduction

- Drive the engineering of capacity management and demand forecasting solutions, including predictive analytics/ML approaches where they add measurable value

- Act as a culture carrier and leader, passing on SRE knowledge and best practices to the engineering team

- Drive detailed root cause investigations for production incidents with rigorous focus on issue avoidance, using AI-assisted correlation/analysis to accelerate time-to-insight

- Create/coordinate retros for significant incidents, ensuring learnings are captured in automated/AI-assisted runbooks and embedded into prevention mechanisms

- Additional core engineering functions, such as adding custom telemetry metrics/logs/traces to the code base of in-scope applications to enable AI/ML-driven operational insights

- Anticipate new opportunities to continuously evolve the resiliency profile of scoped applications and infrastructure

Requirements:

- B.S. / M.S. degree in Computer Science, Engineering or a related discipline with 10+ years of experience

- Experience leading high performing engineering/SRE teams, with a track record of driving continuous improvement through automation and AI-enabled operations

- Demonstrated ability to represent engineering/SRE priorities, status, and risk to senior leadership stakeholders with clear, executive-ready communication

- Hands-on experience building or operating AI-assisted capabilities (AIOps, ML-based anomaly detection, or GenAI workflows) in an engineering/production environment

- A passion for providing engineering support for highly available, performant full stack applications with a “Student of Technology” attitude

- Experience with relational database and NoSQL Database (e.g. Redis, Apache Cassandra)

Benefits:

- Retirement investment and tools designed to help you in building a sound financial future

- Access to education reimbursement

- Comprehensive resources to support your physical health and emotional well-being

- Family support programs

- Flexible Time Off (FTO) so you can relax, recharge and be there for the people you care about

Hybrid Work Model:

- BlackRock’s hybrid work model is designed to enable a culture of collaboration and apprenticeship that enriches the experience of our employees, while supporting flexibility for all

- Employees are currently required to work at least 4 days in the office per week, with the flexibility to work from home 1 day a week

- Some business groups may require more time in the office due to their roles and responsibilities

- We remain focused on increasing the impactful moments that arise when we work together in person – aligned with our commitment to performance and innovation

About BlackRock:

- At BlackRock, we are all connected by one mission: to help more and more people experience financial well-being

- Our clients, and the people they serve, are saving for retirement, paying for their children’s educations, buying homes and starting businesses

- Their investments also help to strengthen the global economy: support businesses small and large; finance infrastructure projects that connect and power cities; and facilitate innovations that drive progress

## Skills

### Required
- Site Reliability Engineering
- Agile Methodologies
- Reliability Automation
- AI-Enabled Operations
- Business Requirements
- Functional Requirements
- SLOs/SLIs
- Observability
- Support Capabilities
- Reliability Tooling
- Automation
- Stability
- Leadership
- Vision
- Team Productivity
- Efficiency
- Engineering Effectiveness
- Priority Setting
- Foundational Reliability
- New Product Features
- Engineering Culture
- Reliability Across Application Lifecycle
- AI-Enabled SRE Practices
- Intelligent Alerting
- Automated Diagnosis
- Self-Healing
- Architectural Decisions
- AI-Ready Telemetry
- Data Quality
- Model Integration Patterns
- Operational Analytics
- Monitoring Solutions
- Application Components
- Infrastructure Components
- Anomaly Detection
- Correlation
- Alert Noise Reduction
- Capacity Management
- Demand Forecasting
- Predictive Analytics
- ML Approaches
- Root Cause Investigations
- Production Incidents
- Issue Avoidance
- AI-Assisted Correlation
- Time-To-Insight
- Retros
- Significant Incidents
- Learnings
- Runbooks
- Prevention Mechanisms
- Custom Telemetry Metrics
- Logs
- Traces
- AI/ML-Driven Operational Insights
- Resiliency Profile
- Scoped Applications
- Infrastructure
- Relational Database
- NoSQL Database
- Redis
- Apache Cassandra
