跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.172) 您好!臺灣時間:2024/12/07 04:10
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:吳克駿
研究生(外文):Ko-Chun Wu
論文名稱:利用跨語言情緒特徵遷移之台語情緒語音合成
論文名稱(外文):Emotional Taiwanese Speech Synthesis using Cross-Lingual Emotion Feature Transfer
指導教授:鄭士康
指導教授(外文):Shyh-Kang Jeng
口試委員:李宏毅張智星
口試委員(外文):Hung-yi LeeJyh-Shing Roger Jang
口試日期:2021-08-09
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:電機工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2021
畢業學年度:109
語文別:英文
論文頁數:46
中文關鍵詞:語音合成轉移學習深度學習領域自適應
外文關鍵詞:Speech SynthesisTransfer LearningDeep LearningDomain Adaptation
DOI:10.6342/NTU202102325
相關次數:
  • 被引用被引用:0
  • 點閱點閱:289
  • 評分評分:
  • 下載下載:53
  • 收藏至我的研究室書目清單書目收藏:0
文字轉語音(Text-to-speech)是當今人機互動中相當重要的一塊,尤其對於不識字的老人來說,若能透過語音溝通,將可以大大的提升其操控機器的能力。尤其近年來深度學習的快速發展,如Tacotron或是Fastspeech等語音合成模型,已經可以將合成出來的語音逼近人類說話的水準,並有相當多的應用。

本論文旨在開發出帶有情緒語音的臺語合成系統。但對於語音合成系統而言,需要一定數量的高品質語料。而當今臺語並沒有太多適合合成的語料,更不用說是情緒語料。因此,我們試著使用公開的中英語情緒語料庫,利用轉移學習的技術製造出跨語言、多語者、多情緒的合成系統,將其他語言中情緒表現的方式應用在臺語上。
Text-to-speech is a big part of today's human-computer interaction, especially for the illiterate elderly. If they can communicate through voice, it will improve their ability to control the machine. With the rapid development of deep learning, speech synthesis models, such as Tacotron or Fastspeech, can already synthesize speech to the level of human speech, and have many applications.

This paper aims to develop a Taiwanese speech synthesis system with emotional speech. A speech synthesis system requires a huge amount of high-quality corpus. However, there are not many Taiwanese corpora suitable for synthesis now, let alone emotional corpus. Therefore, we try to use the public English and Mandarin emotion corpus to create a cross-lingual, multi-speaker, and multi-emotional TTS system using the transfer learning technique to apply emotional expression from other language to Taiwanese.
口試委員審定書 i
致謝 ii
摘要 iii
Abstract iv
Contents v
List of Figures viii
List of Tables ix
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Problem Statement 2
1.3 Literature Survey 2
1.3.1 Overview of TexttoSpeech System 2
1.3.2 MultiStyle Speech Synthesis 4
1.4 Contributions 4
1.5 Chapter Outline 5
Chapter 2 Background Knowledge 6
2.1 Mel Spectrogram 6
2.2 Tacotron 7
2.2.1 Encoder Module 8
2.2.2 Decoder Module 9
2.2.3 Postprocessing Net 10
2.3 Tacotron2 10
2.4 Multistyle Tacotron 12
2.5 Domain Adaptation 13
Chapter 3 System Design 14
3.1 Dataset 14
3.1.1 LJ Speech 14
3.1.2 Suisiann Dataset 14
3.1.3 Emotion Speech Dataset 15
3.2 System Overview 15
3.3 Speaker Embedding Network 16
3.4 Emotion Embedding Network 18
3.5 Domain Adaptation Module 19
3.6 Vocoder 20
Chapter 4 Experiment Setup 21
4.1 Preprocess 21
4.1.1 Audio Preprocess 21
4.1.2 Text Preprocess 21
4.1.3 Valence Arousal Setting 22
4.2 Pretraining 23
4.2.1 TTS Pretraining 23
4.2.2 Other Model Pretraining 23
4.3 Fine Tuning 23
4.4 Inference 25
4.5 Subjective Test 25
4.6 Objective Test 26
Chapter 5 Result and Discussion 27
5.1 Result of Pretrained Model 27
5.1.1 Speaker Embedding 27
5.1.2 Emotion Embedding 28
5.2 Text Embedding Distribution 30
5.3 Subjective Evaluation 31
5.3.1 Different Language Embeddings 32
5.3.2 Ablation Studies 35
5.4 Objective Evaluation 36
5.5 Other Findings 37
Chapter 6 Conclusion 38
References 39
[1] Yuxuan Wang, RJ SkerryRyan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards endtoend speech synthesis. Proc. Interspeech 2017, pages 4006–4010, 2017.
[2] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj SkerrvRyan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018.
[3] RJ SkerryRyan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, and Rif A Saurous. Towards endtoend prosody transfer for expressive speech synthesis with tacotron. In international conference on machine learning, pages 4693–4702. PMLR, 2018.
[4] James A Russell. A circumplex model of affect. Journal of personality and social psychology, 39(6):1161, 1980.
[5] IThuan Khoki Iuhan Kongsi. Suísiann dataset. https://suisiann-dataset.ithuan.tw/, 2019.
[6] ChaoPeng Liu. Empathetic generativebased chatbot with emotion understanding via reinforcement learning. Master’s thesis, Department of Electrical Engineering, National Taiwan University, 2020.
[7] JiunHao Jhan. Empathetic and retrievalbased chatbot using deep reinforcement learning. Master’s thesis, Graduate Institute of Communication Engineering, National Taiwan University, 2020.
[8] Ch G Kratzenstein. Sur la formation et la naissance des voyelles. Journal de Physique, 21:358–380, 1782.
[9] John J Ohala. Christian gottlieb kratzenstein: Pioneer in speech synthesis. In ICPhS, pages 156–159, 2011.
[10] Homer Dudley. The carrier nature of speech. Bell System Technical Journal, 19(4):495–515, 1940.
[11] Andrew J Hunt and Alan W Black. Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 1, pages 373–376. IEEE, 1996.
[12] Heiga Zen, Keiichi Tokuda, and Alan W Black. Statistical parametric speech synthesis. speech communication, 51(11):1039–1064, 2009.
[13] Heiga Ze, Andrew Senior, and Mike Schuster. Statistical parametric speech synthesis using deep neural networks. In 2013 ieee international conference on acoustics, speech and signal processing, pages 7962–7966. IEEE, 2013.
[14] Sercan Ö Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al. Deep voice: Realtime neural texttospeech. In International Conference on Machine Learning, pages 195–204. PMLR, 2017.
[15] Jose Sotelo, Soroush Mehri, Kundan Kumar, João Felipe Santos, Kyle Kastner, Aaron C Courville, and Yoshua Bengio. Char2wav: Endtoend speech synthesis. In ICLR (Workshop), 2017.
[16] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and TieYan Liu. Fastspeech: fast, robust and controllable text to speech. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 3171–3180, 2019.
[17] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
[18] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[19] Yuxuan Wang, Daisy Stanton, Yu Zhang, RJSkerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. Style tokens: Unsupervised style modeling, control and transfer in endtoend speech synthesis. In International Conference on Machine Learning, pages 5180–5189. PMLR, 2018.
[20] WeiNing Hsu, Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, et al. Hierarchical generative modeling for controllable speech synthesis. In International Conference on Learning Representations, 2018.
[21] Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ SkerryRyan, Daisy Stanton, David Kao, and Tom Bagby. Semisupervised generative modeling for controllable speech synthesis. In International Conference on Learning Representations, 2019.
[22] Ye Jia, Yu Zhang, Ron J Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, et al. Transfer learning from speaker verification to multispeaker texttospeech synthesis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4485–4495, 2018.
[23] Erica Cooper, ChengI Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, and Junichi Yamagishi. Zeroshot multispeaker texttospeech with stateoftheart neural speaker embeddings. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6184–6188. IEEE, 2020.
[24] Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ SkerryRyan, Ye Jia, Andrew Rosenberg, and Bhuvana Ramabhadran. Learning to speak fluently in a foreign language: Multilingual speech synthesis and crosslanguage voice cloning. Proc. Interspeech 2019, pages 2080–2084, 2019.
[25] Xuehao Zhou, Xiaohai Tian, Grandee Lee, Rohan Kumar Das, and Haizhou Li. Endtoend codeswitching tts with crosslingual language model. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7614–7618. IEEE, 2020.
[26] Tao Li, Shan Yang, Liumeng Xue, and Lei Xie. Controllable emotion transfer for endtoend speech synthesis. In 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pages 1–5. IEEE, 2021.
[27] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized endtoend loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883. IEEE, 2018.
[28] David Snyder, Daniel GarciaRomero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. Xvectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329–5333. IEEE, 2018.
[29] Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, and Hiroshi Saruwatari. Crosslingual texttospeech synthesis via domain adaptation and perceptual similarity regression in speaker space. In INTERSPEECH, pages 2947–2951, 2020.
[30] Ralph Beebe Blackman and John Wilder Tukey. The measurement of power spectra from the point of view of communications engineering—part i. Bell System Technical Journal, 37(1):185–282, 1958.
[31] Stanley Smith Stevens, John Volkmann, and Edwin Broomell Newman. A scale for the measurement of the psychological magnitude pitch. The journal of the acoustical society of america, 8(3):185–190, 1937.
[32] Daniel Griffin and Jae Lim. Signal estimation from modified shorttime fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984.
[33] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
[34] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
[35] Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully characterlevel neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics, 5:365–378, 2017.
[36] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
[37] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
[38] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
[39] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.
[40] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
[41] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR, abs/1609.03499, 2016.
[42] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
[43] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attentionbased models for speech recognition. In Proceedings of the 28th International Conference on Neural Information Processing SystemsVolume 1, pages 577–585, 2015.
[44] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In International Conference on Machine Learning, pages 2410–2419. PMLR, 2018.
[45] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flowbased generative network for speech synthesis. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019.
[46] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. In Advances in Neural Information Processing Systems, pages 14881–14892, 2019.
[47] Pengfei Wu, Zhenhua Ling, Lijuan Liu, Yuan Jiang, Hongchuan Wu, and Lirong Dai. Endtoend emotional speech synthesis using style tokens and semisupervised training. In 2019 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 623–627. IEEE, 2019.
[48] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domainadversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
[49] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015.
[50] Keith Ito and Linda Johnson. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
[51] Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 920–924. IEEE, 2021.
[52] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier GonzalezDominguez. Deep neural networks for small footprint textdependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4052–4056. IEEE, 2014.
[53] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer. Endtoend textdependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5115–5119. IEEE, 2016.
[54] Manoj Kumar, Tae JinPark, Somer Bishop, and Shrikanth Narayanan. Designing neural speaker embeddings with meta learning. arXiv preprint arXiv:2007.16196, 2020.
[55] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: A largescale speaker identification dataset. Proc. Interspeech 2017, pages 2616–2620, 2017.
[56] Guillaume Chanel, Karim AnsariAsl, and Thierry Pun. Valencearousal evaluation using physiological signals in an emotion recall paradigm. In 2007 IEEE International Conference on Systems, Man and Cybernetics, pages 2662–2667. IEEE, 2007.
[57] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, volume 8, pages 18–25. Citeseer, 2015.
[58] Carnegie Mellon University. The cmu pronouncing dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict, 2014.
[59] 中華民國教育部. 臺灣閩南語羅馬字拼音方案使用手冊. https://ws.moe.edu.tw/001/Upload/FileUpload/3677-15601/Documents/tshiutsheh.pdf, 2007.
[60] WenChin Huang, YiChiao Wu, and Tomoki Hayashi. Anytoone sequencetosequence voice conversion using selfsupervised discrete speech representations. In ICASSP 20212021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5944–5948. IEEE, 2021.
[61] Pochun Hsu and Hungyi Lee. Wgwavenet: Realtime highfidelity speech synthesis without gpu. Proc. Interspeech 2020, pages 210–214, 2020.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊