跳到主要內容

臺灣博碩士論文加值系統

(44.192.247.184) 您好!臺灣時間:2023/01/30 12:07
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:王奕雯
研究生(外文):Yih-Wen Wang
論文名稱:整合潛藏語者風格資訊於多語言語碼轉換語音合成
論文名稱(外文):Integrating Hidden Speaker and Style Information to Multi-Lingual and Code-Switching Speech Synthesis
指導教授:陳嘉平陳嘉平引用關係
指導教授(外文):Chen,Chia-Ping
學位類別:碩士
校院名稱:國立中山大學
系所名稱:資訊工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2021
畢業學年度:110
語文別:中文
論文頁數:121
中文關鍵詞:語音合成Tacotron-2參數產生器梯度反轉層語者編碼器全局風格標註層WaveGlow
外文關鍵詞:Speech SynthesisTacotron2Parameter GeneratorGradient Reversal LayerSpeaker EncoderGlobal Style Token LayerWaveGlow
相關次數:
  • 被引用被引用:0
  • 點閱點閱:117
  • 評分評分:
  • 下載下載:16
  • 收藏至我的研究室書目清單書目收藏:0
本文致力於研究並建置一個任一語者風格中英文語音合成系統,透過整合潛藏的語者風格資訊,使該系統最終能夠以任一語者、任一風格、以及中英文文本作為條件,輸出一個具有特定語者、風格,且對應於文本內容的合成語音。基 Tacotron2 合成器的注意力機制與解碼器,我們建置一個生成卷積編碼器,將語言特徵作為參數產生器的輸入,以生成文本編碼器中每層網路所需的參數,使該編碼器能同時為不同語言的文本進行編碼。接著,加入一個對抗式語者分類器,利用梯度反轉層之概念使文本編碼器能夠學習語者獨立的文本資訊,於推斷時模型能夠跨語言的轉換說話人聲音。進一步,我們透過以下模塊來整合語者風格資訊並提高語音質量:可獨立訓練的語者編碼器,對任一參考音檔提取語者資訊,達到複製任一說話人聲音的效果;具有批量實例標準化的無監督式全局風格標註層,學習對音頻當中的風格正確建模,於推斷時可從任一參考音檔提取說話人風格,同時也可自行指定任一網路習得之風格進行合成,使該系統更符合人類可自由控制語音風格的特性;語音鑑別器,將合成器視為生成器,加入辨別真假語音的鑑別器,透過生成對抗網路之訓練概念,使合成語音更加難以被分辨真偽,進而提升語音品質。此外,我們應用遷移式學習方法訓練一個 WaveGlow 聲碼器,用以即時生成語音。最終,我們的中英文語音合成系統不僅可以合成高質量的雙語語音,還可從任一參考音檔中複製任一說話人的聲音與遷移任一說話風格
This thesis proposed a Mandarin-English speech synthesis system for any speaker and style by integrating hidden speaker and style information. The system can use any speaker, style, and bilingual text as condition to output speech with specific speaker and style characteristics. Based on the attention mechanism and decoder of Tacotron2 synthesizer, we implement multiple parameter generators conditioned on language embedding to output the parameters for each layer of text encoder, which then encodes the text in different languages using these parameters. Besides, an adversarial speaker classifier with a gradient reversal layer encourages the text encoder to learn a speaker-independent text embedding, which makes the system transfer voices across languages. In addition, a trainable speaker encoder extracts the speaker embedding from an audio signal to clone any speaker''s voice; a global style token layer with batch-instance normalization models the prosody of speech and controls speaking style during inference; a speech discriminator improves the speech quality by applying the concept of the generative adversarial network. Through transfer learning, the WaveGlow vocoder is implemented to generate the speech. Finally, our Mandarin-English speech synthesis system not only synthesizes the high-quality bilingual speech but also clones and transfers any speaker''s voice and style from any reference audio.
論文審定書 i

Acknowledgments ii

List of Tables ix

List of Figures xi

Chapter 1 緒論 1

1.1 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 研究背景 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 評估機制 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 平均意見分數 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..5

1.3.2 梅爾倒頻譜失真 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 使用工具 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 研究貢獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6 論文架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Chapter 2 單語者台灣腔中文語音合成系統 10

2.1 合成器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

2.1.2 資料前處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.3 GST­Tacotron2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.3.1 Tacotron2 . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.3.2 Global Style Token . . . . . . . . . . . . . . . . . . . . . 18

2.1.3.3 損失函數 . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1.4 訓練方式 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1.4.1 標準訓練 Tacotron2 . . . . . . . . . . . . . . . . . . . . . 23

2.1.4.2 預訓練 GST­Tacotron2 再凍結 Tacotron2 參數 . . . . . . . 23

2.1.4.3 預訓練 GST­Tacotron2 再凍結 GST 參數 . . . . . . . . . . 24

2.1.4.4 標準訓練兩 GST­Tacotron2 再融合參數 . . . . . . . . . . . 24

2.1.4.5 預訓練 Tacotron2 並無凍結任何參數 . . . . . . . . . . . . 24

2.2 聲碼器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.1 Griffin­Lim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 2. WaveGlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

實驗設置及研究成果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 實驗設置 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.2 研究成果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.2.1 合成器比較 . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.2.2 聲碼器比較 . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.3 小結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Chapter 3 特定語者中英文語音合成系統 33

3.1 合成器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.1 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.2 資料前處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.3 生成卷積編碼器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.3.1 網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.3.2 推斷流程說明 . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 對抗式語者分類器. . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.1 網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.2 損失函數 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 聲碼器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.1 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.2 訓練方式 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.2.1 標準訓練單語者單語言資料 . . . . . . . . . . . . . . . . . 42

3.3.2.2 標準訓練多語者多語言資料 . . . . . . . . . . . . . . . . . 43

3.3.2.3 遷移式學習 . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 實驗設置及研究成果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.1 實驗設置 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.2 研究成果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.2.1 聲碼器比較 . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.2.2 合成器比較 . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4.3 小結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Chapter 4 任一語者風格中英文語音合成系統 51

4.1 任意說話人聲音複製 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.1.1 語者編碼器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1.1.1 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1.1.2 網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1.1.3 損失函數 . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1.2 對抗式語者分類器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.2.1 網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.2.2 損失函數 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 任意說話人風格遷移 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.1 BIN­-Global Style Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 實驗設置及研究成果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.1 實驗設置 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.2 研究成果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.2.1 自然流暢度 . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.2.2 語者相似度 . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3.2.3 合成失真性 . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3.2.4 合成穩定性 . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.3 小結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Chapter 5 應用生成對抗網路改善中英文語音合成系統 75

5.1 生成對抗網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2 語音鑑別器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3 實驗設置及研究成果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3.1 實驗設置 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.3.2 研究成果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.3.2.1 自然流暢度 . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.3.2.2 語者相似度 . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.2.3 合成失真性 . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.2.4 合成穩定性 . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.3.3 小結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Chapter 6 結論與未來展望 97

參考文獻 99
[1] Y. Wang, R. Skerry­-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao,
Z. Chen, S. Bengio, et al., “Tacotron: Towards End-­to­-End Speech Synthesis,” arXiv preprint arXiv:1703.10135, 2017.

[2] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-­Ryan, et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783, IEEE, 2018.

[3] S. Ö. Arık, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, et al., “Deep Voice: Real­time Neural Text-­to-­Speech,” in
International Conference on Machine Learning, pp. 195–204, PMLR, 2017.

[4] A. Gibiansky, S. Ö. Arik, G. F. Diamos, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou, “Deep Voice 2: Multi­Speaker Neural Text­to­Speech,” in NIPS, 2017.

[5] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep Voice 3: 2000-­Speaker Neural Text­-to­-Speech,” Proc. ICLR, pp. 214– 217, 2018.

[6] N. Perraudin, P. Balazs, and P. L. Søndergaard, “A Fast Griffin­-Lim Algorithm,” in 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4, IEEE, 2013.

[7] W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang, and T. C. Walters, “WaveNet Based Low Rate Speech Coding,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 676–680, IEEE, 2018.

[8] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stim­ berg, A. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient Neural Audio Synthesis,” in International Conference on Machine Learning, pp. 2410–2419, PMLR, 2018.

[9] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A Flow­Based Generative Net­ work for Speech Synthesis,” in ICASSP 2019­2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621, IEEE, 2019.

[10] K. Azizah, M. Adriani, and W. Jatmiko, “Hierarchical Transfer Learning for Multilin­gual, Multi­-Speaker, and Style Transfer DNN-­Based TTS on Low­-Resource Languages,” IEEE Access, vol. 8, pp. 179798–179812, 2020.

[11] T. Tu, Y.­J. Chen, C.­c. Yeh, and H.­Y. Lee, “End-­to­-end Text-­to­-Speech for Low-Resource Languages by Cross-­Lingual Transfer Learning,” arXiv preprint arXiv:1904.06508, 2019.

[12] Y. Lee, S. Shon, and T. Kim, “Learning Pronunciation from A Foreign Language in Speech Synthesis Networks,” arXiv preprint arXiv:1811.09364, 2018.

[13] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, et al., “Transfer Learning from Speaker Verification to Multi-speaker Text­-to­-Speech Synthesis,” arXiv preprint arXiv:1806.04558, 2018.

[14] Y. Wang, D. Stanton, Y. Zhang, R.­S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-­to­-End Speech Synthesis,” in International Conference on Machine Learning, pp. 5180–5189, PMLR, 2018.

[15] A. Prakash, A. L. Thomas, S. Umesh, and H. A. Murthy, “Building Multilingual End­-to­-End Speech Synthesisers for Indian Languages,” in Proc. of 10th ISCA Speech Synthesis Workshop (SSW’10), pp. 194–199, 2019.

[16] H. Ming, Y. Lu, Z. Zhang, and M. Dong, “A Light­-weight Method of Building An LSTM-­RNN-­Based Bilingual TTS System,” in 2017 International Conference on Asian Language Processing (IALP), pp. 201–205, IEEE, 2017.

[17] B. Li and H. Zen, “Multi-­Language Multi­-Speaker Acoustic Modeling for LSTM­-RNN Based Statistical Parametric Speech Synthesis,” 2016.

[18] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan, “Bytes are All You Need: End­to­ end Multilingual Speech Recognition and Synthesis with Bytes,” in ICASSP 2019­2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5621–5625, IEEE, 2019.

[19] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry­-Ryan, Y. Jia, A. Rosen­ berg, and B. Ramabhadran, “Learning to Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross­-Language Voice Cloning,” arXiv preprint arXiv:1907.04448, 2019.

[20] T. Nekvinda and O. Dušek, “One Model, Many Languages: Meta-­learning for Multilin­gual Text-­to-­Speech,” arXiv preprint arXiv:2008.00768, 2020.

[21] A. J. Hunt and A. W. Black, “Unit Selection in A Concatenative Speech Synthesis Sys­tem Using A Large Speech Database,” in 1996 IEEE International Conference on Acous­ tics, Speech, and Signal Processing Conference Proceedings, vol. 1, pp. 373–376, IEEE, 1996.

[22] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech Param­eter Generation Algorithms for HMM­-Based Speech Synthesis,” in 2000 IEEE Interna­tional Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), vol. 3, pp. 1315–1318, IEEE, 2000.

[23] M. C. Orhan and C. Demiroğlu, “HMM­-Based Text to Speech System with Speaker Interpolation,” in 2011 IEEE 19th Signal Processing and Communications Applications Conference (SIU), pp. 781–784, IEEE, 2011.

[24] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” arXiv preprint arXiv:1409.3215, 2014.

[25] S. Hochreiter and J. Schmidhuber, “Long Short-­Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[26] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” arXiv preprint arXiv:1412.3555, 2014.

[27] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention­-Based Mod­els for Speech Recognition,” arXiv preprint arXiv:1506.07503, 2015.

[28] F. Yu and V. Koltun, “Multi­-Scale Context Aggregation by Dilated Convolutions,” arXiv preprint arXiv:1511.07122, 2015.

[29] D. P. Kingma and P. Dhariwal, “Glow: Generative Flow with Invertible 1x1 Convolu­tions,” arXiv preprint arXiv:1807.03039, 2018.

[30] R. Skerry­-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. Clark, and R. A. Saurous, “Towards End-­to­-End Prosody Transfer for Expressive Speech Synthesis with Tacotron,” in International Conference on Machine Learning, pp. 4693–4702, PMLR, 2018.

[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in Advances in Neural Information Process­ing Systems, pp. 5998–6008, 2017.

[32] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized End­-to-­End Loss for Speaker Verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883, IEEE, 2018.

[33] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marc­ hand, and V. Lempitsky, “Domain-­Adversarial Training of Neural Networks,” The Jour­nal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.

[34] E. A. Platanios, M. Sachan, G. Neubig, and T. Mitchell, “Contextual Parameter Gen­eration for Universal Neural Machine Translation,” arXiv preprint arXiv:1808.08493, 2018.

[35] Y.­Y. Wang, A. Acero, and C. Chelba, “Is Word Error Rate A Good Indicator for Spo­ ken Language Understanding Accuracy,” in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), pp. 577–582, IEEE, 2003.

[36] R. Kubichek, “Mel­-Cepstral Distance Measure for Objective Speech Quality Assess­ment,” in Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, vol. 1, pp. 125–128, IEEE, 1993.

[37] Executedone. https://zhuanlan.zhihu.com/p/117634492.

[38] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A System for Large­-Scale Machine Learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283, 2016.

[39] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “PyTorch: An Imperative Style, High­Performance Deep Learning library,” arXiv preprint arXiv:1912.01703, 2019.

[40] T. E. Oliphant, “Python for Scientific Computing,” Computing in Science & Engineer­ing, vol. 9, no. 3, pp. 10–20, 2007.

[41] H. Nam and H.­E. Kim, “Batch-­Instance Normalization for Adaptively Style-Invariant Neural Networks,” arXiv preprint arXiv:1805.07925, 2018.

[42] I. J. Goodfellow, J. Pouget-­Abadie, M. Mirza, B. Xu, D. Warde­-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” arXiv preprint arXiv:1406.2661, 2014.

[43] L. BiaoBei (Beijing) Technology Co., “Chinese Standard Mandarin Speech Corpus.” https://www.data-baker.com/open_source.html.

[44] A. for Computational Linguistics and C. L. Processing, “NER­-TRS­-VOL1­-4.” https: //scidm.nchc.org.tw/zh_TW/dataset/ner-trs-vol1-text.

[45] S. Junyi, “Jieba.” https://github.com/fxsjy/jieban.

[46] H. Huang, “Python­-pinyin.” https://github.com/mozillazg/python-pinyin.

[47] B. McFee and B. McFee, “Librosa.” https://github.com/librosa/librosa.

[48] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in International Conference on Machine Learning, pp. 448–456, PMLR, 2015.

[49] L. Dinh, J. Sohl­-Dickstein, and S. Bengio, “Density Estimation Using Real NVP,” arXiv preprint arXiv:1605.08803, 2016.

[50] M. N. Timothy Gu and BtbN, “FFmpeg.” https://github.com/FFmpeg/FFmpeg.

[51] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv preprint arXiv:1412.6980, 2014.

[52] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL­-3: A Multi­-speaker Mandarin TTS Corpus and the Baselines,” arXiv preprint arXiv:2010.11567, 2020.

[53] J. Yamagishi, C. Veaux, K. MacDonald, et al., “CSTR VCTK Corpus: English Multi­ Speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92),” 2019.

[54] PyTorch, “PyTorch.” https://pytorch.org/docs/stable/generated/torch.nn. Embedding.html.

[55] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently Trainable Text-­to­-Speech System Based on Deep Convolutional Networks with Guided Attention,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4788, IEEE, 2018.

[56] K. Ito and L. Johnson, “The LJ Speech Dataset.” https://keithito.com/ LJ-Speech-Dataset/, 2017.

[57] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance Normalization: The Missing In­gredient for Fast Stylization,” arXiv preprint arXiv:1607.08022, 2016.

[58] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR Corpus Based on Public Domain Audio Books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, IEEE, 2015.

[59] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A Large­-Scale Speaker Identifi­cation Dataset,” arXiv preprint arXiv:1706.08612, 2017.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top