研究生(外文):Yang, Chung-Wen
論文名稱(外文):A System for Modifying the Duration of Synthesized Speech from Text-To-Speech Models
指導教授(外文):Lai, Chin-Feng
口試委員(外文):Lai, Ying-HsunChen, Shih-YehTsai, Chia-WeiSu, Yu-Sheng
外文關鍵詞:Text-To-SpeechAudio Time-Scale ModificationMontreal Forced Aligner
In the past decades, synthesizing speech from texts, also known as Text to Speech (TTS), has drawn a great attention from researchers since it is applicable to a variety of applications. One of the factors that affects the prosody of synthesized speech is the speed at which it is spoken. Most of the TTS models proposed earlier are based on the autoregressive mechanism that generates speeches frame by frame. However, these autoregressive TTS models have a major drawback that lack the ability of controlling the duration of the synthesized speeches. In order to increase the ability of TTS models to control the duration of the synthesized speeches, many non-autoregressive TTS models are proposed to model the duration of synthesized speech. However, comparing with training an autoregressive TTS model, it takes a huge amount of data and computing power to train a non-autoregressive TTS model. Therefore, in thesis, a system for modifying the duration of speech synthesized from a TTS model is proposed. The proposed system consists of a forced aligner, a duration modifier, a neural network named ATM-Net and a vocoder. The forced aligner in the proposed system is adopted from the Montreal Forced Aligner with pretrained mandarin model. Once the boundary of each word / phoneme is determined, the duration modifier is then used to lengthen or shorten speech segments by inserting dummy frames or removing frames in a mel spectrogram respectively. The modified mel spectrogram is then fed into the ATM-Net to fill in audio contents as well as smoothen the discontinuities between frames. At last, the output mel spectrogram is used to synthesize the audio signal by a vocoder. Experiments show that the proposed system could modify the duration of speech and synthesize speech with natural prosody that is close to a real human does.
摘要 I
誌謝 III
內文目錄 IV
表目錄 VI
圖目錄 VII
第壹章 Introduction 1
第一節 Motivation 1
第二節 The Proposed Method 3
第三節 An Overview of the Thesis 5
第貳章 Related Works 6
第一節 Non-Autoregressive TTS 6
第二節 Audio Time-Scale Modification 10
第三節 Speech Segmentation 12
第四節 The Vocoder 14
第參章 The Proposed System 19
第一節 The Forced Aligner 21
第二節 The Duration Modifier 23
第三節 The ATM-Net 26
第一小節 The Model Architecture 26
第二小節 The Two-Staged Training Method 31
第四節 The Vocoder 34
第肆章 Experimental Results 35
第一節 Data Preparation 35
第二節 Experimental Results 37
第一小節 Shortening the Duration of an Audio Signal 39
第二小節 Lengthening the Duration of an Audio Signal 41
第三小節 Audio Signals with Duration Partially Modified 43
第四小節 The Vocoders 44
第三節 Discussions 45
第一小節 Discussions on the masking strategies 45
第二小節 A Brief History of the ATM-Net 48
第伍章 Conclusion and Future Works 52
第一節 The Conclusion 52
第二節 The Future Works 54
參考文獻 55
