Description
As a Senior Machine Learning Engineer on the Machine Learning Platform team at Reddit, you will be instrumental in architecting, implementing, and maintaining foundational Machine Learning (ML) infrastructure that powers Feeds Ranking, Content Understanding, Recommendations and more.
You will deliver a self-service ML platform that enables the continuous iteration and improvement of systems that use ML techniques including Deep Learning, Natural Language Processing, Recommendation Systems, Representation Learning and Computer Vision.
Key responsibilities include:
- Leading the building, testing, and maintenance of ML training infrastructure at Reddit
- Designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
- Evolving the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows
You will work closely with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully.
In addition to technical expertise, you will treat internal MLEs as your customers, conducting user research, reducing friction in the 'Idea-to-Prototype' loop, and standardizing software environments (Docker images, Python dependency management).
To be successful in this role, you will have 5+ years of software engineering experience, with a focus on Platform Engineering, ML Infrastructure, or Backend Systems. You will also have deep Kubernetes expertise, Jupyter Ecosystem knowledge, strong coding skills in Python and Go, and experience with GPU environments, cloud providers, and distributed training frameworks.