Multispeaker Glow-TTS demo



Reference papers

Glow-TTS

Kim, J., Kim, S., Kong, J., & Yoon, S. (2020). Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. arXiv preprint arXiv:2005.11129.

Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.

GE2E speaker embedding

Wan, L., Wang, Q., Papir, A., & Moreno, I. L. (2017). Generalized end-to-end loss for speaker verification. arXiv preprint arXiv:1710.10467.

Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. arXiv preprint arXiv:1806.04558.

Qian, K., Zhang, Y., Chang, S., Yang, X., & Hasegawa-Johnson, M. (2019). Zero-shot voice style transfer with only autoencoder loss. arXiv preprint arXiv:1905.05879.

Prosody encoder (Global style token layer)

Wang, Y., Stanton, D., Zhang, Y., Skerry-Ryan, R. J., Battenberg, E., Shor, J., ... & Saurous, R. A. (2018). Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017.

Gradient reversal layer

Zhang, Y., Weiss, R. J., Zen, H., Wu, Y., Chen, Z., Skerry-Ryan, R. J., ... & Ramabhadran, B. (2019). Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. arXiv preprint arXiv:1907.04448.

Jung, S., & Kim, H. (2020). Pitchtron: Towards audiobook generation from ordinary people's voices. arXiv preprint arXiv:2005.10456.



Github code:
https://github.com/CODEJIN/Glow_TTS



Single speaker (LJSpeech dataset)

Structure

Training

Inference


Length scale 0.8 0.9 1.0 1.1 1.2
Text1
Text2
Text3
Text4



Multi speakers (Prosody encoder-GST mode)

Structure

Training

Inference


Trained dataset: LJ + CMUA, 100K trained

Length scale 0.8 0.9 1.0 1.1 1.2
Text1 CMUA-BDL
(Trained male)
LJSpeech
(Trained female)
VCTK-P226
(Unseen male)
BC2013
(Unseen female)
Text2 CMUA-BDL
(Trained male)
LJSpeech
(Trained female)
VCTK-P226
(Unseen male)
BC2013
(Unseen female)
Text3 CMUA-BDL
(Trained male)
LJSpeech
(Trained female)
VCTK-P226
(Unseen male)
BC2013
(Unseen female)
Text4 CMUA-BDL
(Trained male)
LJSpeech
(Trained female)
VCTK-P226
(Unseen male)
BC2013
(Unseen female)

Trained dataset: LJ + VCTK, 400K trained

Length scale 0.8 0.9 1.0 1.1 1.2
Text1 VCTK-P360
(Trained male 1)
LJSpeech
(Trained female 2)
VCTK-P226
(Trained male 2)
VCTK-P240
(Trained female 2)
Text2 VCTK-P360
(Trained male 1)
LJSpeech
(Trained female 2)
VCTK-P226
(Trained male 2)
VCTK-P240
(Trained female 2)
Text3 VCTK-P360
(Trained male 1)
LJSpeech
(Trained female 2)
VCTK-P226
(Trained male 2)
VCTK-P240
(Trained female 2)
Text4 VCTK-P360
(Trained male 1)
LJSpeech
(Trained female 2)
VCTK-P226
(Trained male 2)
VCTK-P240
(Trained female 2)

Multi speakers (Speaker embedding lookup table mode)

Structure

Training

Inference


Length scale 0.8 0.9 1.0 1.1 1.2
Text1 CMUA-BDL
(Male1)
LJSpeech
(Female1)
CMUA-AWB
(Male2)
CMUA-CLB
(Female2)
Text2 CMUA-BDL
(Male1)
LJSpeech
(Female1)
CMUA-AWB
(Male2)
CMUA-CLB
(Female2)
Text3 CMUA-BDL
(Male1)
LJSpeech
(Female1)
CMUA-AWB
(Male2)
CMUA-CLB
(Female2)
Text4 CMUA-BDL
(Male1)
LJSpeech
(Female1)
CMUA-AWB
(Male2)
CMUA-CLB
(Female2)



Voice conversion (Speaker embedding lookup table mode) - Failed

Structure

Training

Inference


  To
LJSpeech
(Female1)
VCTK-P226
(Male1)
VCTK-P360
(Male2)
VCTK-P240
(Female2)
0.8 1.0 1.2 0.8 1.0 1.2 0.8 1.0 1.2 0.8 1.0 1.2
From LJSpeech
(Female1)
VCTK-P226
(Male1)
VCTK-P360
(Male2)
VCTK-P240
(Female2)