研究生(外文):Bing-Jhih Huang
論文名稱(外文):Investigate the speech generation system
指導教授(外文):Jia-Ching Wang
外文關鍵詞:text to speechattentiontacotrondeep learning
語音合成一詞表示使用人為的方式合成近乎真人言談的語音,早期的做法譬如統計方式。作法繁瑣容易出錯,且如果架構是管線架構,前一階的錯誤還會產生連鎖效應。近年來,隨著深度學習的熱潮,使用深度學習架構搭建文字轉語音(Text To Speech, TTS)系統的技術已越發成熟,各式各樣利用深度網路完成的TTS應用開始進入人們的生活周遭。在TTS能成功合成出逼真的語音後,這些人們已經不滿足於合成出逼真的聲音了,不論是要能夠合成指定語者的聲音、合成指定腔調的聲音或包含特定情感的聲音。現在的TTS系統要能夠依據使用者的喜好產生聲音。
注意力機制部分,除了Google實現Tacotron2採用的Location Sensitive Attention以及同樣常被運用於語音合成模型的Forward Attention外,Monotonic Chunkwise Attention(Mocha)也是注意力機制的一種。Mocha將原先Soft Attention的作用範圍縮小為長度固定的Chunk,希望藉此提升網路的準確度。不過目前Mocha相關的研究多運用在語音辨識(Speech Recognition)領域。本篇論文將基於多語者的Tacotron2模型,實驗上述三個注意力機制。比較Mocha Attention與其他兩者的結果差異後發現,經由MoChA產生的音訊並不會比較好,反倒是喪失了MoChA可以處理Streaming的優勢。
The term speech synthesis refers to the artificial way of synthesizing speech that is almost human-talking like. Early approaches such as statistical methods. The method is tedious and error-prone, and if the architecture is a pipeline architecture, the errors of the previous order will have a knock-on effect. In recent years, with the boom in deep learning, the technology of using the deep learning architecture to build the Text To Speech (TTS) system has become more and more mature, and various TTS applications using deep network have begun to enter people's lives. . After TTS can successfully synthesize realistic speech, these people are not satisfied with synthesizing realistic voices, whether it is capable of synthesizing the voice of the specified speaker, synthesize a specific tone, or include emotional sound Today's TTS systems must be able to generate sounds based on user preferences.
Tacotron2 is a more classic deep learning architecture, including a text encoding Encoder, an attention mechanism that converts the encoding result into the input of the decoder, a decoder output spectrum and a vocoder that converts the spectrogram to audio.
As attention mechanism, except for Location Sensitive, which Google uses to implement Tacotron2’s Attention and Forward Attention, which is also commonly used in speech synthesis models. Monotonic Chunkwise Attention (Mocha) is also a type of attention mechanism. Mocha reduce the scope of the original Soft Attention to a Chunk with a fixed length, hoping to improve network accuracy. However, Mocha-related research is mostly used in speech recognition field. This paper will be based on the multilingual Tacotron2 model.
Attention mechanism. After comparing the difference between the results of Mocha Attention and the other two, it was found that the audio generated by MoChA will not be better, but it will lose the advantages to process streaming.
摘要 i
Abstract ii
章節目次 iv
圖目錄 vi
表目錄 viii
第一章 緒論 1
1-1背景 1
1-2 研究動機與目的 2
1-3研究方法與章節概要 2
第二章 相關文獻 3
2-1 文字轉語音 3
2-2 自回歸模型 4
2-3 編碼器 - 解碼器 5
2-4 Tacotron2 5
2-4-1 Encoder 5
2-4-2 Location Sensitive Attention 7
2-4-3 Decoder 9
2-4-4 Loss Function 10
2-5 其他Attention 12
2-5-1 Forward Attention 12
2-5-2 Hard Monotonic Attention 12
2-5-3 Monotonic Chunkwise Attention (MoChA) 14
2-5-4 Stable Monotonic Chunkwise Attention (sMoChA) 15
2-6 語者資訊 15
2-6-1 Triplet Loss 16
2-6-2 Tuple-based End-To-End Loss 17
2-6-3 Generalized End-To-End Loss 19
2-7 輸入語者特徵的位置 21
2-7-1 Attention輸入 22
2-7-2 Decoder 輸入 23
2-8 Temporal Convolutional Networks (TCN) 25
第三章 實驗設置 26
3-1 研究目標與實驗環境 26
3-2 TCN Speaker Encoder 26
3-3梅爾頻譜圖 27
3-4網路架構 27
3-5 訓練資料集 29
3-6 實驗結果與分析 29
第四章 結論 31
4-1 結論與未來方向 31
第五章 參考文獻 32
