LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
๐ Abstract
The paper presents LiveSpeech, a fully autoregressive transformer-based approach for zero-shot text-to-speech (TTS) that enables low-latency streaming of the output audio. The key contributions are:
- An adaptive codebook loss weighting mechanism that redistributes the model capacity across codebooks to address the content accuracy vs. voice quality trade-off.
- A parallel processing of codebook groups to enhance the step capability of the model without significantly increasing inference time.
๐ Q&A
[01] Low-Latency Zero-shot Text-to-Speech
1. What are the key challenges in adapting existing zero-shot TTS approaches to low-latency scenarios? Existing zero-shot TTS approaches, such as those using generative language models on audio tokens, are still challenging to adapt to low-latency scenarios. This is due to the non-autoregressive nature of some models or the high inference time per step associated with others.
2. How does the LiveSpeech model address these challenges? LiveSpeech is a fully autoregressive transformer-based approach that enables low-latency streaming of the output audio. It introduces two key techniques:
- Adaptive codebook loss weighting to redistribute the model capacity across codebooks and address the content accuracy vs. voice quality trade-off.
- Parallel processing of codebook groups to enhance the step capability of the model without significantly increasing inference time.
3. What are the advantages of the LiveSpeech model compared to existing approaches? Experiments show that the LiveSpeech model achieves competitive results to state-of-the-art baselines in terms of content accuracy, speaker similarity, audio quality, and inference speed, while being suitable for low-latency streaming applications.
[02] Audio Tokenization and Generation
1. How does the audio tokenization component work in the LiveSpeech model? The audio tokenization component uses a neural audio codec, such as Encodec, to encode the raw audio into a sequence of discrete codes. The codes in early codebooks represent the content of the audio, while the codes in later codebooks represent fine-grained details.
2. How does the LiveSpeech model generate audio from the discrete tokens? The LiveSpeech model uses a fully autoregressive transformer decoder to generate the sequence of discrete codes. It employs a delayed generation pattern, where each decoding step predicts multiple codes from different frames, to reduce the number of transformer steps required.
3. What are the challenges in efficiently predicting all codebooks within a single decoding step? With the delayed generation pattern, the model needs to distribute its capacity across all codebooks to produce one code from each of them. This can lead to a content accuracy vs. voice quality trade-off, where prioritizing some codebooks may result in poor prediction of other codebooks.
4. How does LiveSpeech address the content accuracy vs. voice quality trade-off? LiveSpeech introduces two techniques to address this trade-off:
- Adaptive codebook loss weighting: The model assigns higher weights to the loss of high-level codebooks at the early training stage, and gradually shifts the focus to lower-level codebooks as training progresses.
- Parallel codebook group heads: The model groups the codebooks and processes them in parallel, allowing each group to attend to different parts of the memory and improving the modeling capacity of each decoding step.
[03] Experimental Evaluation
1. What metrics are used to evaluate the performance of the LiveSpeech model? The LiveSpeech model is evaluated using both objective and subjective metrics:
- Objective metrics: Transcript error rates (CER, WER, PER), speaker similarity scores (SS), and objective perceptual speech quality scores (O-MOS).
- Subjective metric: Relative Mean Opinion Score (S-MOS) based on human evaluation.
2. How does the LiveSpeech model perform compared to the baselines? The LiveSpeech model with the proposed techniques (adaptive codebook weights and parallel codebook groups) achieves competitive or better performance than the state-of-the-art baselines in terms of content accuracy, speaker similarity, and audio quality.
3. What are the speed and latency characteristics of the LiveSpeech model? The LiveSpeech model has a real-time factor (RTF) comparable to the VALL-E baseline, despite being fully autoregressive. The model with parallel codebook groups has an RTF increase of only 0.09s or 10%, demonstrating the efficiency of the parallelization. The model also operates with a delay of 200ms, making it suitable for low-latency applications.