Description
As a Staff Machine Learning Engineer on the Machine Learning Platform team, you will be a key technical leader architecting and scaling our Generative AI and LLM platform capabilities.
Training and deploying foundation models places unprecedented demands on our systems. You will define the technical strategy and build the core infrastructure that enables machine learning engineers and researchers to seamlessly train, evaluate, and iterate on large language models at Reddit scale.
- Drive GenAI Infrastructure Strategy: Propose, design, and lead the architecture of our next-generation LLM platform, significantly advancing our capabilities to support large-scale foundation models that serve millions of redditors.
- Design Resilient, Large-Scale Distributed Systems: Architect highly fault-tolerant training infrastructure capable of supporting multi-week, distributed workloads across massive GPU clusters.
- Build Self-Serve LLM Workflows: Design and implement robust, production-grade pipelines for LLM fine-tuning (e.g., SFT, RLHF/DPO).
- Develop Comprehensive Evaluation & Benchmarking Infrastructure: Treat model evaluation as a first-class platform capability.
- Architect Advanced Data Ingestion Pipelines: Extend our distributed data platforms to natively and efficiently handle the massive, multimodal datasets (text, image, video) required for modern GenAI workloads,
You will have 10+ years of work experience in a production software development environment or building complex distributed data systems, plus a degree in ML, Engineering, Computer Science, or a related discipline.
GenAI/LLM Infrastructure Expertise: Proven track record of designing and operating large-scale ML systems, specifically working with distributed training frameworks (e.g., FSDP, DeepSpeed, Megatron-LM) and LLM serving/inference optimization (e.g., vLLM, TensorRT-LLM).
Distributed Systems Mastery: Hands-on experience managing fault-tolerant, petabyte-scale distributed systems and multi-node/multi-GPU training clusters.
Advanced MLOps Knowledge: Deep understanding of modern ML orchestration, fine-tuning pipelines, and model evaluation methodologies.
GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes.
Production Engineering Fundamentals: Hands-on experience with Kubernetes, Docker, and building production-quality, object-oriented code in Python and/or Go.
Strong focus on scalability, reliability, performance, and ease of use.
You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle.
Strong organizational & communication skills.