( 您好!臺灣時間:2021/03/05 16:10
字體大小: 字級放大   字級縮小   預設字形  


研究生(外文):Bing-Jhih Huang
論文名稱(外文):Investigate the speech generation system
指導教授(外文):Jia-Ching Wang
外文關鍵詞:text to speechattentiontacotrondeep learning
  • 被引用被引用:0
  • 點閱點閱:73
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
語音合成一詞表示使用人為的方式合成近乎真人言談的語音,早期的做法譬如統計方式。作法繁瑣容易出錯,且如果架構是管線架構,前一階的錯誤還會產生連鎖效應。近年來,隨著深度學習的熱潮,使用深度學習架構搭建文字轉語音(Text To Speech, TTS)系統的技術已越發成熟,各式各樣利用深度網路完成的TTS應用開始進入人們的生活周遭。在TTS能成功合成出逼真的語音後,這些人們已經不滿足於合成出逼真的聲音了,不論是要能夠合成指定語者的聲音、合成指定腔調的聲音或包含特定情感的聲音。現在的TTS系統要能夠依據使用者的喜好產生聲音。
注意力機制部分,除了Google實現Tacotron2採用的Location Sensitive Attention以及同樣常被運用於語音合成模型的Forward Attention外,Monotonic Chunkwise Attention(Mocha)也是注意力機制的一種。Mocha將原先Soft Attention的作用範圍縮小為長度固定的Chunk,希望藉此提升網路的準確度。不過目前Mocha相關的研究多運用在語音辨識(Speech Recognition)領域。本篇論文將基於多語者的Tacotron2模型,實驗上述三個注意力機制。比較Mocha Attention與其他兩者的結果差異後發現,經由MoChA產生的音訊並不會比較好,反倒是喪失了MoChA可以處理Streaming的優勢。
The term speech synthesis refers to the artificial way of synthesizing speech that is almost human-talking like. Early approaches such as statistical methods. The method is tedious and error-prone, and if the architecture is a pipeline architecture, the errors of the previous order will have a knock-on effect. In recent years, with the boom in deep learning, the technology of using the deep learning architecture to build the Text To Speech (TTS) system has become more and more mature, and various TTS applications using deep network have begun to enter people's lives. . After TTS can successfully synthesize realistic speech, these people are not satisfied with synthesizing realistic voices, whether it is capable of synthesizing the voice of the specified speaker, synthesize a specific tone, or include emotional sound Today's TTS systems must be able to generate sounds based on user preferences.
Tacotron2 is a more classic deep learning architecture, including a text encoding Encoder, an attention mechanism that converts the encoding result into the input of the decoder, a decoder output spectrum and a vocoder that converts the spectrogram to audio.
As attention mechanism, except for Location Sensitive, which Google uses to implement Tacotron2’s Attention and Forward Attention, which is also commonly used in speech synthesis models. Monotonic Chunkwise Attention (Mocha) is also a type of attention mechanism. Mocha reduce the scope of the original Soft Attention to a Chunk with a fixed length, hoping to improve network accuracy. However, Mocha-related research is mostly used in speech recognition field. This paper will be based on the multilingual Tacotron2 model.
Attention mechanism. After comparing the difference between the results of Mocha Attention and the other two, it was found that the audio generated by MoChA will not be better, but it will lose the advantages to process streaming.
摘要 i
Abstract ii
章節目次 iv
圖目錄 vi
表目錄 viii
第一章 緒論 1
1-1背景 1
1-2 研究動機與目的 2
1-3研究方法與章節概要 2
第二章 相關文獻 3
2-1 文字轉語音 3
2-2 自回歸模型 4
2-3 編碼器 - 解碼器 5
2-4 Tacotron2 5
2-4-1 Encoder 5
2-4-2 Location Sensitive Attention 7
2-4-3 Decoder 9
2-4-4 Loss Function 10
2-5 其他Attention 12
2-5-1 Forward Attention 12
2-5-2 Hard Monotonic Attention 12
2-5-3 Monotonic Chunkwise Attention (MoChA) 14
2-5-4 Stable Monotonic Chunkwise Attention (sMoChA) 15
2-6 語者資訊 15
2-6-1 Triplet Loss 16
2-6-2 Tuple-based End-To-End Loss 17
2-6-3 Generalized End-To-End Loss 19
2-7 輸入語者特徵的位置 21
2-7-1 Attention輸入 22
2-7-2 Decoder 輸入 23
2-8 Temporal Convolutional Networks (TCN) 25
第三章 實驗設置 26
3-1 研究目標與實驗環境 26
3-2 TCN Speaker Encoder 26
3-3梅爾頻譜圖 27
3-4網路架構 27
3-5 訓練資料集 29
3-6 實驗結果與分析 29
第四章 結論 31
4-1 結論與未來方向 31
第五章 參考文獻 32
[1] M. C. A. C. G. D. A. G. Y. K. X. L. J. M. A. N. J. R. S. S. M. S. Sercan O. Arik, “Deep Voice: Real-time Neural Text-to-Speech,” arXiv:1702.07825 [cs.CL], 2017.
[2] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark 且 R. A. Saurous, “Tacotron: Towards End-to-End Speech Synthesis,” arXiv:1703.10135v2 [cs.CL], 2017.
[3] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. C. Courville 且 Y. Bengio, “Char2Wav: End-to-End Speech Synthesis.,” 於 ICLR (Workshop), 2017.
[4] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis 且 Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” arXiv:1712.05884 [cs.CL], 2017.
[5] S. D. H. Z. K. S. O. V. A. G. N. K. A. S. K. K. Aaron van den Oord, “WaveNet: A Generative Model for Raw Audio,” arXiv:1609.03499 [cs.SD], 2016.
[6] G. D. A. G. J. M. K. P. W. P. J. R. Y. Z. Sercan Arik, “Deep Voice 2: Multi-Speaker Neural Text-to-Speech,” arXiv:1705.08947 [cs.CL], 2017.
[7] K. P. A. G. S. O. A. A. K. S. N. J. R. J. M. Wei Ping, “Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning,” arXiv:1710.07654 [cs.SD], 2017.
[8] D. S. Y. Z. R. S.-R. E. B. J. S. Y. X. F. R. Y. J. R. A. S. Yuxuan Wang, “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis,” arXiv:1803.09017 [cs.CL], 2018.
[9] C. R. Chung-Cheng Chiu, “Monotonic Chunkwise Attention,” arXiv:1712.05382v2 [cs.CL], 2018.
[10] D. B. D. S. K. C. Y. B. Jan Chorowski, “Attention-Based Models for Speech Recognition,” arXiv:1506.07503 [cs.CL], 2015.
[12] M.-T. L. P. J. L. R. J. W. D. E. Colin Raffel, “Online and Linear-Time Attention by Enforcing Monotonic Alignments,” arXiv:1704.00784v2 [cs.LG], 2017.
[13] H. Miao, G. Cheng, P. Zhang, T. Li 且 Y. Yan, “Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” 於 Interspeech 2019, 2019.
[14] D. K. J. P. Florian Schroff, “FaceNet: A Unified Embedding for Face Recognition and Clustering,” arXiv:1503.03832 [cs.CV], 2015.
[15] X. M. B. J. X. L. X. Z. X. L. Y. C. A. K. Z. Z. Chao Li, “Deep Speaker: an End-to-End Neural Speaker Embedding System,” arXiv:1705.02304 [cs.CL], 2017.
[16] I. M. S. B. N. S. Georg Heigold, “End-to-End Text-Dependent Speaker Verification,” arXiv:1509.08062 [cs.LG], 2015.
[17] Q. W. A. P. I. L. M. Li Wan, “Generalized End-to-End Loss for Speaker Verification,” arXiv:1710.10467 [eess.AS], 2017.
[18] Y. Z. R. J. W. Q. W. J. S. F. R. Z. C. P. N. R. P. I. L. M. Y. W. Ye Jia, “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis,” arXiv:1806.04558 [cs.CL], 2018.
[19] S. S. S. N. Andros Tjandra, “Machine Speech Chain with One-shot Speaker Adaptation,” arXiv:1803.10525 [cs.CL] , 2018.
[20] J. Z. K. V. K. Shaojie Bai, “An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling,” arXiv:1803.01271v2 [cs.LG] , 2018.
[21] K. Ito, "The LJ Speech Dataset," 2017. [Online]. Available: https://keithito.com/LJ-Speech-Dataset/.
[22] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
[23] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner and M. Sonderegger, "Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi.," in Interspeech 2017, 2017.
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔