Description

We're looking for a well-versed, passionate Engineer who wants to play a key role in site reliability engineering and cloud operations of our global cloud infrastructure.

As a Site Reliability Engineer at VGS, you will be responsible for designing and optimizing infrastructure for high availability, fault tolerance, and performance across distributed systems. You will also lead incident management and root cause analysis, ensuring swift resolution of issues and driving post-incident improvements to prevent recurrences.

In addition to these responsibilities, you will build and maintain automated monitoring, alerting, and healing systems that improve system health, reduce manual intervention, and minimize downtime. You will also identify bottlenecks and optimization opportunities, and implement scaling strategies to handle traffic spikes and growing workloads efficiently.

You will collaborate with cross-functional teams, including software engineers, product teams, and DevOps, to enhance system reliability and delivery pipelines. You will also champion continuous improvement initiatives in deployment, scaling, and performance testing, while advocating for the adoption of SRE best practices across the organization.

As a mentor and leader, you will provide technical guidance to junior engineers, contribute to strategic decisions around infrastructure, and ensure best practices are implemented at scale.

We rely on your feedback to build a world-class product, and we believe in the core values of transparency, collaboration, grit, and humility. We're a remote-first organization, but we also value connection, collaboration, and the energy that comes from a great brainstorm, a team lunch, or celebrating a big win in person.

This listing is enriched and indexed by YubHub. To apply, use the employer's original posting: https://jobs.lever.co/verygoodsecurity/a9c9ae14-c48a-41de-a0d4-4ad8a720e7ee