Description
This is a small but growing team responsible for the infrastructure and operations behind core developer tools used across the entire engineering organization. You'll own the full lifecycle , patching, upgrades, backups, scaling, and incident response , for services that every engineer depends on daily. The role blends DevOps, SRE, and software engineering, and is ideal for engineers who want high ownership and company-wide impact. You should have a mindset of continuous improvement , if something is manual and repetitive, your instinct should be to automate it away. As the company's on-prem infrastructure footprint grows, this team will expand its scope to provide SRE capabilities for on-prem systems , making this an opportunity to help shape that practice from the ground up.
- Own the lifecycle of core self-hosted developer tools (e.g., GitHub Enterprise Server, CircleCI, JFrog Artifactory/Xray)
- Design and implement automated systems for patching, backups (with validation), and upgrades
- Scale infrastructure to support a fast-growing engineering org
- Use Infrastructure-as-Code (Terraform) to manage environments
- Operate and troubleshoot systems using Docker, Kubernetes, and cloud platforms (AWS, GCP, Azure)
- Define and maintain SLOs for service availability, reliability, and performance
- Build and maintain monitoring, alerting, and observability for developer tool services
- Lead and participate in incident response and root cause analysis
- Work cross-functionally with platform, security, infrastructure (on-prem and cloud), and software teams