SoundStream: An End-to-End Neural Audio Codec
- 👤 Speaker: Neil Zeghidour (Google)
- 📅 Date & Time: Monday 04 April 2022, 12:00 - 13:00
- 📍 Venue: Zoom: https://eng-cam.zoom.us/j/81927138251?pwd=TVd3MXliV003dUdYVlFwU2NDWGpmdz09
Abstract
Abstract: Audio codecs (mp3, Opus), are compression algorithms used whenever one needs to transmit audio, whether when streaming a song or during a conference call. In this talk, I will present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3kbps to 18kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU . In subjective evaluations using audio at 24kHz sampling rate, SoundStream at 3kbps outperforms Opus at 12kbps and approaches EVS at 9.6kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech.
Bio: Neil Zeghidour is a Senior Research Scientist at Google Brain in Paris, and teaches automatic speech processing at Ecole Normale Supérieure. He previously graduated with a PhD in Machine Learning from Ecole Normale Superieure in Paris, jointly with Facebook AI Research. His main research interest is to integrate signal processing and deep learning into fully learnable architectures for audio understanding and generation.
Series This talk is part of the CUED Speech Group Seminars series.
Included in Lists
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- CUED Speech Group Seminars
- Guy Emerson's list
- Information Engineering Division seminar list
- PhD related
- Zoom: https://eng-cam.zoom.us/j/81927138251?pwd=TVd3MXliV003dUdYVlFwU2NDWGpmdz09
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)


Monday 04 April 2022, 12:00-13:00