跳到主要內容

臺灣博碩士論文加值系統

(35.172.223.251) 您好!臺灣時間:2022/08/16 05:26
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:楊明豪
研究生(外文):YANG, MING-HAO
論文名稱:基於生成對抗網路之語音合成
論文名稱(外文):Speech Synthesis based on Generative Adversarial Network
指導教授:許正欣許正欣引用關係
指導教授(外文):Sheu, Jeng-Shin
口試委員:周修平王緒翔張傳育
口試委員(外文):ZHOU, XIU-PINGWang, Syu-SiangChang, Chuan-Yu
口試日期:2019-07-26
學位類別:碩士
校院名稱:國立雲林科技大學
系所名稱:資訊工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2019
畢業學年度:107
語文別:中文
論文頁數:53
中文關鍵詞:深度類神經網路生成對抗網路語音合成支援向量機語者辨識
外文關鍵詞:deep neural networkgenerative adversarial networkspeech synthesissupport vector machinespeaker recognition
相關次數:
  • 被引用被引用:0
  • 點閱點閱:179
  • 評分評分:
  • 下載下載:12
  • 收藏至我的研究室書目清單書目收藏:0
近年來,基於成熟的硬體技術以及巨量資料(big data),使得深度類神經網路(Deep Neural Network, DNN)有突破性的發展,在各領域都能夠看到許多其成功的案例,其中最具突破性發展的深度網路架構可說是生成對抗網路(Generative Adversarial Network, GAN),該架構提供了一種創新方法來訓練生成模型(generative model),更具體地說它將模型設計成兩個子模型:生成器(generator)和鑑別器(discriminator)。生成器用於生成樣本,而鑑別器嘗試將樣本分類為真實或偽造。本論文探討基於生成對抗網路之語音合成(speech synthesis)技術,有別於傳統語音合成技術,利用生成對抗網路具有學習數據分布的能力,藉此生成出更自然的語音。
本論文考慮中、英文語音合成,採用英文語料CSTR VCTK語料集,訓練出男女各3位不同語者的模型,中文語料則採用口語韻律語料庫暨工具平台庫,也訓練男女各3位不同語者的模型。從實驗結果中可以發現,英文語言男女平均意見分數(Mean Opinion Score, MOS)滿分5分達到了3.18分(男3.52分,女2.83分),而中文語言男女平均意見分數達到了1.91分(男2.21分,女1.6分)。此外,在語者辨識實驗中,我們發現中、英文之本文有關合成語音(text dependent synthesized speech)的平均通過率如下:DNN平均通過率達到80.5%(中72%,英89%),支援向量機(Support Vector Machine, SVM)平均通過率可達到86%(中100%,英72%)。而本文無關合成語音(text independent synthesized speech)的平均通過率則依據語音長短有不同的通過率:0.5秒下DNN平均通過率為36%(中44%,英28%),SVM則為44.5%(中61%,英28%),3秒下DNN平均通過率為75%(中78%,英72%),SVM則為80.5%(中72%,英89%),5秒下DNN平均通過率為89%(中78%,英100%),SVM則為97%(中94%,英100%)。
在平均意見分數中,由於英文具有較完善的前端語言規則以產生更完整的文字特徵,使得模型能夠生成出更自然的語音。因此,英文合成語音較優於中文。在語者辨識實驗中,本文有關情況下,英文通過率較中文差,是由於英文語音時間遠短於中文語音。至於本文無關的情況下,可以發現給予的語音時間越長通過率越高,因此,要提高語者辨識系統的安全性可以減短語音時間或改善模型。由於,本系統的鑑別器在訓練過程中用於判斷語音的真實性,因此我們可以將本系統中的鑑別器的結合到語者辨識系統中,以有效地阻擋合成語音的攻擊。

In recent years, based on mature hardware technology and big data, the Deep Neural Network(DNN) has made breakthrough, and many successful cases can be seen in various fields. One of the most groundbreaking deep network architectures is the generative adversarial network, which provides an innovative way to train the generative model, and more specifically, it designs the model into two sub-models: generator and discriminator. The generator is used to generate samples, and the discriminator attempts to classify the samples as real or fake. This thesis, which is different from traditional speech synthesis technology, explores the speech synthesis technology based on generative adversarial network. Generative adversarial network can learn the feature distribution from the training data, thereby generating more natural speech.
This thesis includes the Chinese and English speech synthesis. For English model, which corpus CSTR VCTK corpus to train three different speaker models of men and women. As for Chinese corpus, which uses the COSPRO & Toolkit, and also trains three different speakers models of men and women. From the results, it can be found that the English language average score of men and women Mean Opinion Score(MOS) reached 3.18 points (3.52 points for men and 2.83 points for women) out of 5 points, and the average score of men and women in Chinese language MOS reached 1.91 points (2.21 points for men, 1.6 points for women). In addition, in the speaker identification experiment, we found that the average pass rate of the text-related synthesized speech in Chinese and English is as follows: DNN average pass rate reaches 80.5% (72% for Chinese, 89% for English). The Support Vector Machine (SVM) has an average pass rate of 86% (100% in Chinese, 72% in English). The average pass rate of text independent synthesized speech has different pass rates according to the length of speech: the average pass rate of DNN is 36% (44% in Chinese, 28% in English) in 0.5 seconds, and 44.5% in SVM. The average pass rate of DNN in 3 seconds is 75% (78% in Chinese, 72% in English), SVM is 80.5% (72% in Chinese, 89% in English), DNN average in 5 seconds, the pass rate was 89% (78% in Chinese, 100% in English), and the SVM is 97% (94% in Chinese, 100% in English).
In the average opinion score, since the English has a more complete front-end language rule to produce complete text features, so that the model can generate more natural speech. Therefore, English synthesized speech is better than Chinese. In the speaker identification experiment, the English pass rate is worse than that of Chinese in this case because the English speech time is much shorter than Chinese speech. As far as this article is unrelated, it can be found that the longer the speech time is, the higher the pass rate is. Therefore, to improve the security of the speaker recognition system can reduce the phrase time or improve the model. Since the discriminator of the system is used to identify the authenticity of the speech during the training process, we can combine the discriminator in the system into the speaker recognition system to effectively block the synthetic speech attack.

一、緒論 1
1.1 簡介 1
1.2 系統介紹 2
1.3 論文架構 3
二、語音合成架構 4
2.1 文字特徵 4
2.1.1 標記解析(tokenize) 5
2.1.2 標準化(normalize) 5
2.1.3 詞性標記(Part-Of-Speech tagging, POS tagging) 6
2.1.4 字轉聲音(Letter-To-Sound, LTS) 6
2.1.5 詞組斷(phrase breaks) 6
2.1.6 語調(intonation) 6
2.2 聲學特徵 7
2.2.1 基本頻率(Fundamental frequency, F0) 7
2.2.2 梅爾生成倒譜(Mel-Generalized Cepstrum, MGC) 7
2.2.3 頻帶非週期性(band-aperiodicity) 8
2.2.4 清濁音(voiced/unvoiced) 9
2.3 深度類神經網路 10
2.3.1 生成對抗網路(Generative Adversarial Network, GAN) 10
2.3.2 瓦瑟斯坦生成對抗網路(Wasserstein Generative Adversarial Network, WGAN) 12
2.3.3 條件式生成對抗網路(Conditional Generative Adversarial Network, CGAN) 12
2.3.4 自編碼器(autoencoder) 13
2.3.5 轉換器(transformer) 14
三、系統設計 16
3.1 系統架構 16
3.2 前端(frontend) 16
3.3 聲碼器(vocoder) 16
3.4 生成對抗網路架構 17
3.5 目標函數 19
3.5.1 鑑別器目標函數 19
3.5.2 對抗目標函數 19
3.5.3 生成器目標函數 20
3.5.4 訓練流程 20
3.5.5 合成流程 21
四、實驗結果 23
4.1 語料庫介紹 23
4.2 實驗方法 23
4.2.1 平均意見分數(Mean Opinion Score, MOS) 23
4.2.2 短時間客觀可懂性(Short-Time Objective Intelligibility, STOI) 24
4.2.3 語者辨識模型 24
4.3 實驗結果 25
4.3.1 平均意見分數結果 25
4.3.2 真人發音統計結果 26
4.3.3 短時間客觀可懂性結果 26
4.3.4 語者辨識模型之通過率結果 27
4.4 問題討論 28
4.4.1 英文語言 28
4.4.2 中文語言 30
4.4.3 鑑別器架構 32
4.4.4 收斂條件 34
4.4.5 雜訊標準差 36
五、結論與未來展望 37
參考文獻 38


[1]Zen, Heiga, Keiichi Tokuda, and Alan W. Black. "Statistical parametric speech synthesis." speech communication 51.11 (2009): 1039-1064
[2]Hunt, Andrew J., and Alan W. Black. "Unit selection in a concatenative speech synthesis system using a large speech database." 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. Vol. 1. IEEE, 1996
[3]Ling, Zhen-Hua, et al. "Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends." IEEE Signal Processing Magazine 32.3 (2015): 35-52
[4]Tokuda, Keiichi, et al. "Speech synthesis based on hidden Markov models." Proceedings of the IEEE 101.5 (2013): 1234-1252
[5]Ijima, Yusuke, Taichi Asami, and Hideyuki Mizuno. "Objective Evaluation Using Association Between Dimensions Within Spectral Features for Statistical Parametric Speech Synthesis." INTERSPEECH. 2016
[6]Toda, Tomoki, Alan W. Black, and Keiichi Tokuda. "Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory." IEEE Transactions on Audio, Speech, and Language Processing 15.8 (2007): 2222-2235
[7]Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014
[8]Nash, John. "Non-cooperative games." Annals of mathematics (1951): 286-295
[9]Mao, Xudong, et al. "Least squares generative adversarial networks." Proceedings of the IEEE International Conference on Computer Vision. 2017
[10]Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017)
[11]Nowozin, Sebastian, Botond Cseke, and Ryota Tomioka. "f-gan: Training generative neural samplers using variational divergence minimization." Advances in neural information processing systems. 2016
[12]LeCun, Yann, et al. "Handwritten digit recognition with a back-propagation network." Advances in neural information processing systems. 1990
[13]Tokuda, Keiichi, et al. "Mel-generalized cepstral analysis-a unified approach to speech spectral estimation." Third International Conference on Spoken Language Processing. 1994
[14]Jain, Anil K., Jianchang Mao, and K. M. Mohiuddin. "Artificial neural networks: A tutorial." Computer 3 (1996): 31-44.
[15]Lin, Jianhua. "Divergence measures based on the Shannon entropy." IEEE Transactions on Information theory 37.1 (1991): 145-151.
[16]Armijo, Larry. "Minimization of functions having Lipschitz continuous first partial derivatives." Pacific Journal of mathematics 16.1 (1966): 1-3.
[17]Mirza, Mehdi, and Simon Osindero. "Conditional generative adversarial nets." arXiv preprint arXiv:1411.1784 (2014).
[18]Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
[19]Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.
[20]Itti, Laurent, and Christof Koch. "A saliency-based search mechanism for overt and covert shifts of visual attention." Vision research 40.10-12 (2000): 1489-1506.
[21]Mikolov, Tomáš, et al. "Recurrent neural network based language model." Eleventh annual conference of the international speech communication association. 2010.
[22]He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[23]Kawahara, Hideki. "STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds." Acoustical science and technology 27.6 (2006): 349-353.
[24]Morise, Masanori, Fumiya Yokomori, and Kenji Ozawa. "WORLD: a vocoder-based high-quality speech synthesis system for real-time applications." IEICE TRANSACTIONS on Information and Systems 99.7 (2016): 1877-1884.
[25]Wu, Yi-Jian, and Ren-Hua Wang. "Minimum generation error training for HMM-based speech synthesis." 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. Vol. 1. IEEE, 2006.
[26]Taal, Cees H., et al. "A short-time objective intelligibility measure for time-frequency weighted noisy speech." 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2010.
[27]Veaux, Christophe, Junichi Yamagishi, and Kirsten MacDonald. "CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit." University of Edinburgh. The Centre for Speech Technology Research (CSTR) (2017).
[28]Ephraim, Yariv, and David Malah. "Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator." IEEE Transactions on acoustics, speech, and signal processing 32.6 (1984): 1109-1121.
[29]Taal, Cees H., et al. "A short-time objective intelligibility measure for time-frequency weighted noisy speech." 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2010.
[30]王傳貴。「基於深度學習與i-vector之語者辨識系統實現」。碩士論文,國立雲林科技大學資訊工程系,2018。
[31]Suykens, Johan AK, and Joos Vandewalle. "Least squares support vector machine classifiers." Neural processing letters 9.3 (1999): 293-300.
[32]Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in neural information processing systems. 2017.
[33]Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun. "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling." arXiv preprint arXiv:1803.01271 (2018).
[34]Campbell, Joseph P. "Speaker recognition: A tutorial." Proceedings of the IEEE 85.9 (1997): 1437-1462.
[35]Rabiner, Lawrence R. "A tutorial on hidden Markov models and selected applications in speech recognition." Proceedings of the IEEE 77.2 (1989): 257-286.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊