EE-TTS: Emphatic Expressive TTS with Linguistic Information


arXiv:2305.12107

Authors

Yi Zhong, Chen Zhang, Xule Liu, Chenxi Sun, Weishan Deng, Haifeng Hu, Zhongqian Sun.

Tencent AI Lab, Zhejiang University

Abstract

While Current TTS systems perform well in synthesizing high-quality speech, producing highly expressive speech remains a challenge. Emphasis, as a critical factor in determining the expressiveness of speech, has attracted more attention nowadays. Previous works usually enhance the emphasis by adding intermediate features, but they can not guarantee the overall expressiveness of the speech. To resolve this matter, we propose Emphatic Expressive TTS (EE-TTS), which leverages multi-level linguistic information from syntax and semantics. EE-TTS contains an emphasis predictor that can identify appropriate emphasis positions from text and a conditioned acoustic model to synthesize expressive speech with emphasis and linguistic information. Experimental results indicate that EE-TTS outperforms baseline with MOS improvements of 0.49 and 0.67 in expressiveness and naturalness. EE-TTS also shows strong generalization across different datasets according to AB test results.

Audio Samples

FYI: Bold characters with a "^" symbol in the texts means the position should be emphasized

Section I: Overall Performance of EE-TTS

Section 1.1. Comparison with GT and Baseline

Emphasized speechs of Baseline(with GT labels), EE-TTS(with GT labels), EE-TTS(with predict labels) and Ground Truth

Baseline
(GT emphasis labels)
EE-TTS
(GT emphasis labels)
EE-TTS
(Pred emphasis labels)
Ground Truth
mandarin text:      对面要难受^喽,怕是赶快投^了吧.
mandarin pinyin: dui4 mian4 yao4 nan2 shou4^ lou5, pa4 shi4 gan3 kuai4 tou2^ le5 ba5.

mandarin text:      这场比赛一^定^很精彩,期待你的英勇表现哦!
mandarin pinyin: zhe4 chang3 bi3 sai4 yi2^ ding4^ hen3 jing1 cai3, qi1 dai4 ni3 de5 ying1 yong3 biao3 xian4 o5!

Section 1.2. Generalization Test

Emphasized speech samples of EE-TTS and Baseline on another dataset F2

EE-TTS Baseline
mandarin text:      你们的答案好^奇怪,为什么会有这^么长的答案?
mandarin pinyin: ni3 men5 de5 da2 an4 hao3^ qi2 guai4, wei4 shen2 me5 hui4 you3 zhe4^ me5 chang2 de5 da2 an4?
mandarin text:      那我们来继续吧,我们来第三^个灯谜,请过来吧。
mandarin pinyin: na4 wo3 men5 lai2 ji4 xu4 ba5, wo3 men5 lai2 di4 san1^ ge5 deng1 mi2, qing3 guo4 lai2 ba5.

Section II: Controllability Test

We also conduct a controllability test for EE-TTS to indicate the significance of emphasis control on different positions.

   Different emphasis postions:             1) 你真的太帅了
ni3 zhen1 de5 tai4 shuai4 le5.
2) 你^真的太帅了
ni3^ zhen1 de5 tai4 shuai4 le5.

3) 你真^的^太帅了
ni3 zhen1^ de5^ tai4 shuai4 le5.
4) 你真的太^帅了
ni3 zhen1 de5 tai4^ shuai4 le5.
5) 你真的太帅^
ni3 zhen1 de5 tai4 shuai4^ le5.

Section III: Ablation Studies

we give some emphasized speech of all experiments in ablation studies:

mandarin text:      咱们可以先抓一下对面经济比较低^的.
mandarin pinyin: zan2 men5 ke2 yi3 xian1 zhua1 yi2 xia4 dui4 mian4 jing1 ji4 bi3 jiao4 di1^ de5.

A: EE-TTS B: A with conformer encoder
rather than transformer
C: B without BERT
D: C without Dependency Parsing E: D without Part-of-Speech F: A without unsupervised labels
in pre-training

mandarin text:      我会一直陪^着你,直到比赛胜利^.
mandarin pinyin: wo3 hui4 yi4 zhi2 pei2^ zhe5 ni3, zhi2 dao4 bi3 sai4 sheng4 li4^.

A: EE-TTS B: A with conformer encoder
rather than transformer
C: B without BERT
D: C without Dependency Parsing E: D without Part-of-Speech F: A without unsupervised labels
in pre-training

Thanks for your patience!