Description
We are seeking a Research Engineer/Research Scientist to join our Audio team. As a member of this team, you will work across the full stack of audio ML, developing audio codecs and representations, sourcing and synthesizing high-quality audio data, training large-scale speech language models and large audio diffusion models, and developing novel architectures for incorporating continuous signals into LLMs.
Our team focuses primarily but not exclusively on speech, building advanced steerable systems spanning end-to-end conversational systems, speech and audio understanding models, and speech synthesis capabilities. The team works closely with many collaborators across pretraining, finetuning, reinforcement learning, production inference, and product to get advanced audio technologies from early research to high-impact real-world deployments.
Responsibilities:
- Develop and train audio models, including conversational speech-to-speech, speech translation, speech recognition, text-to-speech, diarization, codecs, and generative audio models
- Work across abstraction levels, from signal processing fundamentals to large-scale model training and inference optimization
- Collaborate with teams across the company to develop and deploy audio technologies
- Communicate clearly and effectively with colleagues and stakeholders
Strong candidates may also have experience with:
- Large language model pretraining and finetuning
- Training diffusion models for image and audio generation
- Reinforcement learning for large language models and diffusion models
- End-to-end system optimization, from performance benchmarking to kernel optimization
- GPUs, Kubernetes, PyTorch, or distributed training infrastructure
Representative projects:
- Training state-of-the-art neural audio codecs for 48 kHz stereo audio
- Developing novel algorithms for diffusion pretraining and reinforcement learning
- Scaling audio datasets to millions of hours of high-quality audio
- Creating robust evaluation methodologies for hard-to-measure qualities such as naturalness or expressiveness
- Studying training dynamics of mixed audio-text language models
- Optimizing latency and inference throughput for deployed streaming audio systems