Description
As a Senior Software Engineer within our Compute Architecture organization, you will help build the software control plane for hardware lifecycle management across large-scale GPU data centers. The METALDEV team builds Go-based distributed services that bring infrastructure online, monitor production hardware health, automate safe operational workflows, and give operators the observability and control needed to manage GPU servers and rack-scale systems with reliability and confidence.
This is a software-first role at the intersection of distributed systems, production reliability, and hardware-aware automation, ideal for engineers who want their code to operate real-world infrastructure at massive scale.
Key responsibilities include:
- Design, build, and operate Go-based services that manage the lifecycle of large-scale GPU data center infrastructure.
- Build automation for data center bring-up, hardware discovery, health monitoring, remediation, and production operations.
- Develop reliable APIs, services, and workflows for managing BMCs, firmware state, server health, and rack-level infrastructure.
- Improve observability, alerting, and operational tooling so production issues can be detected, understood, and resolved quickly.
- Translate incidents and hardware failure modes into software improvements that make the platform more resilient.
- Partner with hardware-adjacent, infrastructure, operations, and software teams to design systems that work safely at fleet scale.
We're looking for someone with 5+ years of experience building and operating infrastructure or backend systems, strong proficiency in Go, and experience designing and building gRPC and REST APIs. Experience with Kubernetes and containerized workloads in production environments is also a plus.