Description
We are seeking a highly skilled Senior Performance Engineer to join our Performance and R&D organisations. In this role, you will help build and evolve systems that support performance analysis, telemetry, and optimisation for large-scale GPU- and CPU-based clusters used in AI and high-performance computing environments.
You will work closely with hardware, networking, firmware, and software teams to collect, analyse, and interpret performance data from live systems. This is a fast-paced R&D environment where system behaviour and requirements evolve rapidly, requiring adaptable engineering solutions and strong analytical thinking.
Key responsibilities include:
- Profiling, benchmarking, and analysing AI and HPC workloads on GPU and CPU clusters
- Exploring performance characteristics of high-performance networking and collective communications (e.g., NCCL, RDMA, MPI, RoCE)
- Identifying performance bottlenecks across networking, compute, memory, and system architecture
- Developing and enhancing performance analysis, benchmarking, and diagnostic tools
- Defining performance test plans and establishing expectations for new technologies and platforms
- Collaborating across hardware, firmware, networking, systems, and software teams to provide actionable performance insights
Requirements include:
- B.Sc. or M.Sc. in Computer Science, Computer Engineering, Software Engineering, or equivalent experience
- 5+ years of experience in performance analysis, systems engineering, or HPC/AI infrastructure
- Demonstrated expertise in performance analysis skills and methodologies
- Hands-on experience with high-performance networking (RDMA, MPI, NCCL, congestion control)
- Strong understanding of system performance metrics (latency, throughput, resource utilisation)
- Exposure to hardware, firmware, or embedded telemetry environments
- Strong analytical, problem-solving, and communication skills
- Ability to work effectively in cross-functional, fast-paced R&D teams
Preferred qualifications include knowledge of CUDA, NCCL internals, and congestion control algorithms, as well as experience with NVIDIA GPUs, CUDA, and deep learning frameworks such as PyTorch or TensorFlow.