Description
We are seeking a Senior Manager to lead the design, scaling, and operations of high-performance networking for GPU-based cloud infrastructure. This role is critical to enabling cloud gaming workloads, AI/ML training, and inference platforms by delivering ultra-low-latency, high-throughput, and highly reliable interconnects across data centers and cloud environments.
Key responsibilities include building and mentoring a specialized team of network architects, overseeing the design of intra-cluster and inter-cluster connectivity, driving technical tuning to reduce latency and increase throughput, defining the roadmap for networking strategies, engaging with ISPs to optimize low-latency edge networks, and implementing Infrastructure as Code (IaC) and observability frameworks.
The ideal candidate will have 12+ years of proven experience in networking, cloud infrastructure, or distributed systems, with 5+ years of experience directly managing technical teams. They should have mastery of data center networking, including Clos/spine-leaf architectures and high-performance fabrics like RDMA, RoCE, or InfiniBand.
Additionally, the candidate should have hands-on experience with BGP, EVPN/VXLAN, and kernel-level development for routing and switching, as well as skilled in using Ansible or Terraform for infrastructure automation, paired with monitoring tools like Prometheus and Grafana.
A Bachelor’s or Master’s degree in Computer Science or a related engineering field is required, and relevant top-tier certifications, such as CCIE or specialized cloud networking designations, are a plus.