Description
We're seeking a Site Reliability Engineering (SRE) Lead to design, build, and maintain resilient, high-scale systems supporting BlackRock's Private Markets platform. In this hands-on leadership role, you'll apply deep engineering expertise to solve complex challenges, guide a global team, shape technical direction, and communicate effectively with senior stakeholders,ensuring the reliability of mission-critical systems that power private market investment workflows and decision-making. You will drive the adoption of AI-driven solutions to accelerate incident detection and triage, reduce toil, improve forecasting and capacity planning, and strengthen end-to-end observability and resilience.
Key Responsibilities:
- Take ownership of project priorities, deadlines and deliverables using Agile methodologies, with clear outcomes around reliability automation and AI-enabled operations
- Understand and refine business and functional requirements, translating them into SLOs/SLIs and AI-assisted observability and support capabilities
- Hands on approach to getting work done,this role requires a “roll your sleeves up” mentality, including building and operationalizing reliability tooling and automation that measurably reduces toil and improves stability
- Be a leader with vision and a partner in brainstorming solutions for team productivity and efficiency to improve engineering effectiveness
- Drive priority setting of the engineering teams, balancing foundational reliability work with delivery of new product features
- Improve Engineering culture by encouraging continuous focus on reliability across the entire application lifecycle, and by adopting AI-enabled SRE practices (e.g., intelligent alerting, automated diagnosis, and self-healing where appropriate)
- Proactive participant in architectural and design decisions, including AI-ready telemetry, data quality, and model integration patterns for operational analytics
- Design and implement end-to-end monitoring solutions for application and infrastructure components, leveraging modern observability platforms plus AI/ML techniques for anomaly detection, correlation, and alert noise reduction
- Drive the engineering of capacity management and demand forecasting solutions, including predictive analytics/ML approaches where they add measurable value
- Act as a culture carrier and leader, passing on SRE knowledge and best practices to the engineering team
- Drive detailed root cause investigations for production incidents with rigorous focus on issue avoidance, using AI-assisted correlation/analysis to accelerate time-to-insight
- Create/coordinate retros for significant incidents, ensuring learnings are captured in automated/AI-assisted runbooks and embedded into prevention mechanisms
- Additional core engineering functions, such as adding custom telemetry metrics/logs/traces to the code base of in-scope applications to enable AI/ML-driven operational insights
- Anticipate new opportunities to continuously evolve the resiliency profile of scoped applications and infrastructure
Requirements:
- B.S. / M.S. degree in Computer Science, Engineering or a related discipline with 10+ years of experience
- Experience leading high performing engineering/SRE teams, with a track record of driving continuous improvement through automation and AI-enabled operations
- Demonstrated ability to represent engineering/SRE priorities, status, and risk to senior leadership stakeholders with clear, executive-ready communication
- Hands-on experience building or operating AI-assisted capabilities (AIOps, ML-based anomaly detection, or GenAI workflows) in an engineering/production environment
- A passion for providing engineering support for highly available, performant full stack applications with a “Student of Technology” attitude
- Experience with relational database and NoSQL Database (e.g. Redis, Apache Cassandra)
Benefits:
- Retirement investment and tools designed to help you in building a sound financial future
- Access to education reimbursement
- Comprehensive resources to support your physical health and emotional well-being
- Family support programs
- Flexible Time Off (FTO) so you can relax, recharge and be there for the people you care about
Hybrid Work Model:
- BlackRock’s hybrid work model is designed to enable a culture of collaboration and apprenticeship that enriches the experience of our employees, while supporting flexibility for all
- Employees are currently required to work at least 4 days in the office per week, with the flexibility to work from home 1 day a week
- Some business groups may require more time in the office due to their roles and responsibilities
- We remain focused on increasing the impactful moments that arise when we work together in person – aligned with our commitment to performance and innovation
About BlackRock:
- At BlackRock, we are all connected by one mission: to help more and more people experience financial well-being
- Our clients, and the people they serve, are saving for retirement, paying for their children’s educations, buying homes and starting businesses
- Their investments also help to strengthen the global economy: support businesses small and large; finance infrastructure projects that connect and power cities; and facilitate innovations that drive progress