跳到主要內容

臺灣博碩士論文加值系統

(44.200.82.149) 您好!臺灣時間:2023/06/02 16:59
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:林厚安
研究生(外文):Hou-An Lin
論文名稱:利用語音合成和對抗性文本鑑別器對語音辨識進行訓練以改進單語言以及語碼轉換下的語音辨識系統
論文名稱(外文):Improving Speech Recognition System under Monolingual and Code-Switching by Training with Speech Synthesis and an Adversarial Text Discriminator
指導教授:陳嘉平陳嘉平引用關係
指導教授(外文):Chen,Chia-Ping
學位類別:碩士
校院名稱:國立中山大學
系所名稱:資訊工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2022
畢業學年度:111
語文別:中文
論文頁數:85
中文關鍵詞:語音辨識語音合成對抗性文本鑑別器上下文區塊處理逐塊同步集束搜索串流處理
外文關鍵詞:automatic speech recognitiontext to speechadversarial text discriminatorcontextual block processingblockwise synchronous beam searchstreaming method
相關次數:
  • 被引用被引用:0
  • 點閱點閱:52
  • 評分評分:
  • 下載下載:14
  • 收藏至我的研究室書目清單書目收藏:0
本論文中,我們以基於注意力機制的卷積增強變換器 (Convolution-augmented Transformer, Conformer) 架構結合連續時序性分類來建立我們的端到端自動語音辨識 (Automatic Speech Recognition, ASR) 系統,同時使用上下文區塊處理 (Contextual Block Processing) 以及 逐塊同步集束搜索 (Blockwise Synchronous Beam Search) 的方法使系統可以達到串流 (Streaming) 的可能,並以此架構做為我們本文的基礎系統,後續基於此基礎系統我們採用 三種方法來提高自動語音辨識系統的性能。同時我們分別在各個系統上使用單語言以及語碼 轉換的資料集進行訓練以及利用遷移學習的方式微調系統,並使用單一語言以及語碼轉換測 試資料分別測試系統,並觀察改進後的系統在單語言以及語碼轉換情況下的結果。首先,我 們添加了一個對抗性文本鑑別器模塊對語音辨識模型進行訓練以糾正辨識結果中的拼寫錯 誤。實驗結果表明,加入對抗性文本鑑別器的單語言以及語碼轉換語音辨識系統的字符錯 誤率 (Character Error Rate, CER) 分別從 12.6% 以及 48.7% 下降至 12.3% 以及 45.1%,而 單詞錯誤率 (Word Error Rate, WER) 分別從 31.7% 以及 65.7% 下降至 31.4% 以及 65.4%。 其次,我們在語音辨識模型中加入了對應語言情境下的預訓練的語音合成 (Text to Speech, TTS) 模型。語音合成模型可以將語音辨識模型的輸出結果作為輸入以合成對應的梅爾頻譜 圖 (Mel-spectrogram),並近似真實的 (Ground-truth) 梅爾頻譜圖。在加入了語音合成模型 後,單語言及語碼轉換的字元錯誤率分別從 12.6% 以及 48.7% 下降至 10.0% 以及 43.4%, 而單詞錯誤率分別從 31.7% 以及 65.7% 下降至 23.0% 以及 64.3%。這表明預訓練的語音合 成系統可以幫助提升語音辨識系統的效能。最後,我們將對應語言情境下的預訓練語音合成 模型和對抗性文本鑑別器合併並對語音辨識模型進行訓練。通過這樣做,不僅可以有效地 糾正錯別字,而且可以繼承語音合成系統修正原始語音辨識系統的效能。實驗結果表明,單一語言及語碼轉換的字符錯誤率與單詞錯誤率分別達到了 9.6% 和 22.0% 以及 41.6% 和 62.1%。
In this thesis, we implement our end-to-end automatic speech recognition system using the conformer architecture based on the attention mechanism and Connectionist Temporal Clas- sification, and we employed Contextual Block Processing and Blockwise Synchronous Beam Search towards real-time speech recognition, and this architecture has served as a baseline through the development of our system. The speech recognition system will be improved us- ing three methods based on this baseline. We train our systems using monolingual datasets and code-switching datasets. After training, we evaluate the improved system using monolin- gual and code-switching test data, and we observe how well they perform. First, we add an adversarial text discriminator module to train the speech recognition model to correct typos in recognition results. The experimental results show that the character error rates of the mono- lingual and code-switching speech recognition systems with text discriminators drop from 12.6% and 48.7% to 12.3% and 45.1%, respectively, and the word error rates from 31.7% and 31.7%, respectively. 65.7% down to 31.4% and 65.4%. Second, we added a pre-trained speech synthesis (text-to-speech, TTS) model to the ASR model for the corresponding lan- guage. TTS can synthesize the output of ASR into a mel-spectrogram and approximate the mel-spectrogram of the label. The character error rates for monolingual and code-switching dropped from 12.6% and 48.7% to 10.0% and 43.4%, respectively, while the word error rates dropped from 31.7% and 65.7% to 23.0% and 64.3%. Finally, we merge language-specific pre-trained TTS and an adversarial text discriminator to train the speech recognition model in different languages. By doing this, not only the typos can be corrected effectively, but also the advantages of pre-trained TTS can be inherited. According to the experimental results, the character error rate and word error rate of monolingual and code-switching are 9.6% and 22.0% and 41.6% and 62.1%, respectively.
目錄
論文審定書 i
誌謝 ii
摘要 iii
Abstract v
圖目錄 x
表目錄 xii
第 1 章 緒論 1
1.1 研究動機與目標 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 研究貢獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 投稿論文 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 文章架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
第 2 章 基礎端到端語音辨識系統 4
2.1 文本前處理模塊 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 資料增強 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 連續時序性分類 (Connectionist Temporal Classification) 模型 . . . . . . . . 7
2.4 注意力機制之介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 變換器 (Transformer) 架構 . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 卷積增強變換器 (Conformer) 編碼器 . . . . . . . . . . . . . . . . . 14
2.5 結合連續時序性分類之端到端語音辨識系統 . . . . . . . . . . . . . . . . . . 15
2.5.1 訓練階段之損失函數 . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.2 解碼階段之計分方法 . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 實現串流處理之方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.1 應用於編碼器之上下文區塊處理方法 . . . . . . . . . . . . . . . . . . 18
2.6.2 用於解碼階段之逐塊同步集束搜索方法 . . . . . . . . . . . . . . . . 21
2.7 遷移學習方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
第 3 章 結合語音合成模型與對抗性文本鑑別器改進端到端語音辨識系統 25
3.1 對抗性文本鑑別器模塊 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 利用對抗性文本鑑別器對語音辨識模型訓練之作法 . . . . . . . . . . . . . . 27
3.3 語音合成端到端模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 文本轉拼音模塊 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 可變資訊轉接器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 長度調節器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.4 後網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.5 語音合成模型之編碼器以及解碼器架構 . . . . . . . . . . . . . . . . 35
3.4 平滑動態時間校正 (Soft-Dynamic Time Warping, Soft-DTW) . . . . . . . . 36
3.5 利用語音合成模型加強語音辨識模型訓練之作法 . . . . . . . . . . . . . . . . 38
3.6 循環生成對抗網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 利用語音合成模型以及對抗性文本鑑別器對語音辨識模型進行訓練之作法 . . 44
第 4 章 實驗設置 46
4.1 語音辨識模型使用之資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.1 單語言資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.2 語碼轉換資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.3 FSR-2020 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 語音合成模型使用之資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 單語言資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 多語言資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 語音辨識模型以及語音合成模型之實驗設置 . . . . . . . . . . . . . . . . . . 49
第 5 章 實驗結果 52
5.1 語音辨識系統之評估方式 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 基礎語音辨識系統架構在單語言以及語碼轉換上的結果 . . . . . . . . . . . . 54
5.2.1 單語言測試集在單語言基礎語音辨識系統的結果分析 . . . . . . . . . 54
5.2.2 語碼轉換測試集在基礎語音辨識系統經過微調後的結果分析 . . . . . 55
5.3 利用對抗性文本鑑別器對語音辨識模型進行訓練之系統架構在單語言以及語碼轉換上的結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 單語言測試集在單語言對抗性文本鑑別器對語音辨識模型進行訓練之系統的結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.2 語碼轉換測試集在對抗性文本鑑別器對語音辨識模型進行訓練之系統經過微調後的結果分析 . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 利用語音合成模型加強語音辨識模型訓練之系統架構在單語言以及語碼轉換上的結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.1 單語言測試集在單語言語音合成模型加強語音辨識模型訓練之系統的結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4.2 語碼轉換測試集在語音合成模型加強語音辨識模型訓練之系統經過微調後的結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 利用語音合成模型以及對抗性文本鑑別器對語音辨識模型進行訓練之系統架構在單語言以及語碼轉換上的結果 . . . . . . . . . . . . . . . . . . . . . . . 58
5.5.1 單語言測試集在單語言語音合成模型以及對抗性文本鑑別器對語音辨識模型進行訓練之系統的結果分析 . . . . . . . . . . . . . . . . . . 58
5.5.2 語碼轉換測試集在利用語音合成模型以及對抗性文本鑑別器對語音辨識模型進行訓練之系統經過微調後的結果分析 . . . . . . . . . . . 59
5.6 在語碼轉換情況效能低落之分析 . . . . . . . . . . . . . . . . . . . . . . . . 59
5.7 計算相似度的平滑動態時間校正與平均絕對誤差的比較 . . . . . . . . . . . . 59
5.8 FSR-2020 台文漢字實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . 61
第 6 章 結論與未來展望 63
參考文獻 65
參考文獻
[1] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech
recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[2] D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel,
M. Karafiát, A. Rastrow, et al., “The subspace gaussian mixture model—a structured
model for speech recognition,” Computer Speech & Language, vol. 25, no. 2, pp. 404–
439, 2011.
[3] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376, 2006.
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014.
[5] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using
cycle-consistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232, 2017.
[6] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376, 2006.
[7] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Transformer ASR with contextual block processing,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 427–433, 2019.
[8] E. Tsunoo, Y. Kashiwagi, and S. Watanabe, “Streaming transformer ASR with blockwise synchronous beam search,” in 2021 IEEE Spoken Language Technology Workshop
(SLT), pp. 22–29, 2021.
[9] J. Sun, “Jieba chinese word segmentation tool,” 2012.
[10] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Interspeech 2015, pp. 3586–3589, 2015.
[11] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le,
“Specaugment: A simple data augmentation method for automatic speech recognition,”
arXiv Preprint arXiv:1904.08779, 2019.
[12] “Sox, audio manipulation tool.” Available: http://sox.sourceforge.net, accessed:
March 25,2015.
[13] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation,
vol. 9, no. 8, pp. 1735–1780, 1997.
[14] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent
neural networks on sequence modeling,” arXiv Preprint arXiv:1412.3555, 2014.
[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing
Systems, vol. 30, 2017.
[16] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for
large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964, 2016.
[17] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based
models for speech recognition,” Advances in Neural Information Processing Systems,
vol. 28, 2015.
[18] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv Preprint
arXiv:1607.06450, 2016.
[19] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformerxl: Attentive language models beyond a fixed-length context,” arXiv Preprint
arXiv:1901.02860, 2019.
[20] Y. Lu, Z. Li, D. He, Z. Sun, B. Dong, T. Qin, L. Wang, and T.-Y. Liu, “Understanding
and improving transformer from a multi-particle dynamic system point of view,” arXiv
Preprint arXiv:1906.02762, 2019.
[21] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv
Preprint arXiv:1710.05941, 2017.
[22] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in International Conference on Machine Learning, pp. 933–941,
2017.
[23] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” in International Conference on Machine Learning,
pp. 448–456, 2015.
[24] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal
Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
[25] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 4835–4839, 2017.
[26] T. Nakatani, “Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,” in Proc. Interspeech
2019, 2019.
[27] T. Hori, S. Watanabe, and J. R. Hershey, “Joint ctc/attention decoding for end-to-end
speech recognition,” in Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics, pp. 518–529, 2017.
[28] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional
lstm and other neural network architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602–
610, 2005.
[29] L. Dong, F. Wang, and B. Xu, “Self-attention aligner: A latency-control end-to-end
model for asr using self-attention network and chunk-hopping,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5656–
5660, 2019.
[30] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” arXiv Preprint
arXiv:1712.05382, 2017.
[31] K. Kim, K. Lee, D. Gowda, J. Park, S. Kim, S. Jin, Y.-Y. Lee, J. Yeo, D. Kim,
S. Jung, et al., “Attention based on-device streaming speech recognition with large
speech corpus,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 956–963, 2019.
[32] H. Inaguma, Y. Gaur, L. Lu, J. Li, and Y. Gong, “Minimum latency training strategies for streaming sequence-to-sequence asr,” in 2020 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 6064–6068, 2020.
[33] D. Wang and T. F. Zheng, “Transfer learning for speech and language processing,” in
2015 Asia-Pacific Signal and Information Processing Association Annual Summit and
Conference (APSIPA), pp. 1225–1237, 2015.
[34] L. YouTube, “Youtube,” Retrieved, vol. 27, p. 2011, 2011.
[35] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and
high-quality end-to-end text to speech,” arXiv Preprint arXiv:2006.04558, 2020.
[36] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang,
Y. Wang, R. Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet on mel
spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 4779–4783, 2018.
[37] P. J. Huber, “Robust estimation of a location parameter,” in Breakthroughs in Statistics,
pp. 492–518, Springer, 1992.
[38] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast,
robust and controllable text to speech,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[39] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” arXiv Preprint arXiv:1803.02155, 2018.
[40] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word
recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26,
no. 1, pp. 43–49, 1978.
[41] M. Cuturi and M. Blondel, “Soft-dtw: a differentiable loss function for time-series,” in
International Conference on Machine Learning, pp. 894–903, 2017.
[42] P. Blanchard, D. J. Higham, and N. J. Higham, “Accurately computing the log-sum-exp
and softmax functions,” IMA Journal of Numerical Analysis, vol. 41, no. 4, pp. 2311–
2330, 2021.
[43] E. L. Denton, S. Chintala, R. Fergus, et al., “Deep generative image models using a
laplacian pyramid of adversarial networks,” Advances in Neural Information Processing
Systems, vol. 28, 2015.
[44] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep
convolutional generative adversarial networks,” arXiv Preprint arXiv:1511.06434, 2015.
[45] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative visual manipulation on the natural image manifold,” in European Conference on Computer Vision,
pp. 597–613, 2016.
[46] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun, “Disentangling factors of variation in deep representation using adversarial training,” Advances
in Neural Information Processing Systems, vol. 29, 2016.
[47] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al., “Espnet: End-to-end speech processing
toolkit,” arXiv Preprint arXiv:1804.00015, 2018.
[48] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann,
P. Motlicek, Y. Qian, P. Schwarz, et al., “The kaldi speech recognition toolkit,” in 2011
IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 1–4, 2011.
[49] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep
learning library,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[50] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation open source
framework for deep learning,” in Proceedings of Workshop on Machine Learning Systems (LearningSys) in the Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), vol. 5, pp. 1–6, 2015.
[51] R. Collobert, S. Bengio, and J. Mariéthoz, “Torch: A modular machine learning software
library,” tech. rep., Idiap, 2002.
[52] G. Van Rossum and F. L. Drake, Python 3 reference manual. CreateSpace, 2009.
[53] B. B. T. C. L, “Chinese standard mandarin speech corpus.” https://www.data-baker.
com/opensource.html, 2017.
[54] H.-P. Lin, “Improving speech recognition systems for low-resource languages with hidden speaker information.” https://hdl.handle.net/11296/p2yng3, 2021.
[55] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “Aishell-3: A multi-speaker mandarin tts
corpus and the baselines,” arXiv Preprint arXiv:2010.11567, 2020.
[56] C. Veaux, J. Yamagishi, K. MacDonald, et al., “Cstr vctk corpus: English multi-speaker
corpus for cstr voice cloning toolkit,” University of Edinburgh. The Centre for Speech
Technology Research (CSTR), 2017.
[57] Y. Ren, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Almost unsupervised text
to speech and automatic speech recognition,” in International Conference on Machine
Learning, pp. 5410–5419, 2019.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊