(54.236.58.220) 您好!臺灣時間:2021/02/27 11:43
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:李岳庭
研究生(外文):Yueh-Ting Lee
論文名稱:使用基於發音方式與位置的多任務學習來改進華語大詞彙語音辨識
論文名稱(外文):Improving Mandarin LVCSR Using Place and Manner Based Multi-task Learning
指導教授:張智星張智星引用關係
口試委員:廖元甫王新民
口試日期:2019-06-25
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2019
畢業學年度:107
語文別:中文
論文頁數:52
中文關鍵詞:多任務學習發音特徵時延神經網路大詞彙語音辨識
DOI:10.6342/NTU201901599
相關次數:
  • 被引用被引用:0
  • 點閱點閱:120
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
在大詞彙語音辨識的領域中,以DNN-HMM取代GMM-HMM作為聲學模型效果已經有顯著提升。本篇論文使用多任務學習的神經網路模型(multi-task learning,MTL-DNN),除了主要的senone分類之外,我們以發音方式與位置的發音特徵,作為子任務來同時訓練DNN模型,使辨識結果效果提升。相較於前人的研究,我們提出三個改進方法,第一是將發音特徵的標籤分為四個區塊,每個區塊內的特徵彼此互斥,以取代傳統多重標籤(multi-label)的方式,作為子任務的輸出層來訓練MTL-TDNN模型。第二是以時延神經網路(time-delay neural networks,TDNN)來取代傳統神經網路。TDNN的特性可以將較多的前後文資訊加入訓練,第三是將子任務的輸出層接到較底層的隱藏層。實驗的語料為中文廣播新聞語料庫(MATBN),分為小資料集MATBN-20與大資料集MATBN-200,評估方式為字符錯誤率(character error rate,CER),與傳統單任務的TDNN模型做比較,最好的模型在MATBN-20與MATBN-200的相對進步幅度為3.33%與1%。
In large vocabulary continuous speech recognition (LVCSR), it is well known that the recognition performance has been improved by using DNN-HMM instead of GMM-HMM. In this thesis, we use multi-task learning model (MTL-DNN), aiming at simultaneously minimizing the cross-entropy losses with respect to the output scores of senones and articulatory attributes, such as place and manner. The proposed framework has three novelties when compared with previous studies. First, the subtasks designed for articulation classification assure that all attributes are mutually exclusive. Second, instead of fully-connected multilayer perceptrons, the well-known structure of time-delay neural networks is adopted to efficiently model long temporal contexts. Finally, in the proposed MTL-TDNN architecture, layer-wise neuron sharing of subtasks only occurs in the first few layers. We performed experiments on the Mandarin Chinese broadcast news corpus (MATBN), including a small dataset (MATBN-20) and a large dataset (MATBN-200). Compared with the conventional single-task learning TDNN model, the experiments show that the proposed framework achieves relative character error rate (CER) reductions of 3.3\% and 1\% on the small and big datasets, respectively.
誌謝iii
摘要v
Abstract vii
1 緒論 1
1.1 主題簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 工具簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 文獻回顧. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 章節概述. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 研究內容介紹 5
2.1 問題定義. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 聲學特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 梅爾頻率倒譜係數. . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 因子分析與i-向量. . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 聲學模型訓練. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 發音特徵:Place and Manner . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 時延神經網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 多任務學習. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 實驗 31
3.1 語料介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 中文廣播新聞語料庫MATBN . . . . . . . . . . . . . . . . . . 31
3.1.2 TCC-300 麥克風語音資料庫. . . . . . . . . . . . . . . . . . . 33
3.1.3 發音詞典與語言模型. . . . . . . . . . . . . . . . . . . . . . . 34
3.2 實驗流程. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 聲學特徵抽取. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 神經網路架構. . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 訓練流程與參數設定. . . . . . . . . . . . . . . . . . . . . . . 38
3.2.4 效果評估方式. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 錯誤分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 結論與未來展望47
4.1 結論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Bibliography 49
[1] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme Recognition Using Time-Delay Neural Networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989.
[2] V. Peddinti, D. Povey, and S. Khudanpur, “A Time Delay Neural Network Architecture for Efficient Modeling of Long Temporal Contexts,” in Proc. Interspeech, 2015.
[3] H. Zheng, Z. Yang, L. Qiao, J. Li, and W. Liu, “Attribute Knowledge Integration for Speech Recognition Based on Multi-task Learning Neural Networks,” in Proc. Interspeech, 2015.
[4] G. Hinton, L. Deng, D. Yu, G. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[5] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006.
[6] H. Andrew Senior, Sak, F. de Chaumont Quitry, T. Sainath, K. Rao et al., “Acoustic modelling with cd-ctc-smbr lstm rnns,” in Proc. IEEE ASRU, 2015.
[7] J. Li, G. Ye, A. Das, R. Zhao, and Y. Gong, “Advancing acoustic-to-word ctc model,” in Proc. IEEE ICASSP, 2018.
[8] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” in Proc. IEEE ASRU, 2011.
[9] 袁家樺等, 《漢語方言概要》. 語文出版社, 1960.
[10] S. King and P. Taylor, “Detection of Phonological Features in Continuous Speech Using Neural Networks,” Computer Speech and Language, vol. 14, no. 4, pp. 333–353, 2000.
[11] C.-H. Lee, M. A. Clements, S. Dusan, E. Fosler-Lussier, K. Johnson, B.-H. Juang, and L. R. Rabiner, “An Overview on Automatic Speech Attribute Transcription (ASAT),” in Proc. Interspeech, 2007.
[12] C. Zhang, Y. Liu, and C. H. Lee, “Detection-based Accented Speech Recognition Using Articulatory Features,” in Proc. IEEE ASRU, 2011.
[13] I. Bromberg, Q. Fu, J. Hou, J. Li, C. Ma, B. Matthews, A. Moreno-daniel, J. Morris, S. M. Siniscalchi, Y. Tsao, and Y. Wang, “Detection-Based ASR in the Automatic Speech Attribute Transcription Project,” in Proc. Interspeech, 2007.
[14] D. Yu, S. M. Siniscalchi, L. Deng, and C.-h. Lee, “Boosting Attribute and Phone Estimation Accuracies with Deep Neural Networks for Detection-based Speech Recognition,” in Proc. ICASSP, 2012.
[15] S. M. Siniscalchi, J. Li, and C.-H. Lee, “A Study on Lattice Rescoring with Knowledge Scores for Automatic Speech Recognition,” in Proc. Interspeech, 2006.
[16] W. Li, S. M. Siniscalchi, N. F. Chen, and C.-H. Lee, “Improving Non-native Mispronunciation Detection and Enriching Diagnostic Feedback with DNN-based Speech Attribute Modeling,” in Proc. ICASSP, 2016.
[17] R. Duan, T. Kawahara, M. Dantsuji, and J. Zhang, “Pronunciation Error Detection using DNN Articulatory Model Based on Multi-lingual and Multi-task Learning,” in Proc. ISCSLP, 2016.
[18] R. Duan, T. Kawahara, M. Dantsuji, and H. Nanjo, “Efficient Learning of Articulatory Models Based on Multi-Label Training and Label Correction for Pronunciation Learning,” in Proc. ICASSP, 2018.
[19] R. A. Caruana, “Multitask Learning: A Knowledge-Based Source of Inductive Bias,” in Proc. ICML, 1993.
[20] T. Evgeniou and M. Pontil, “Regularized Multi-task Learning,” in Proc. ACM SIGKDD, 2004.
[21] P. Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms,” CRIM, Montreal,(Report) CRIM-06/08-13, vol. 14, pp. 28–29, 2005.
[22] I. P. Association and Others, Handbook of the International Phonetic Association: A Guide to theUuse of the International Phonetic Alphabet. Cambridge University Press, 1999.
[23] H.-M. Wang, B. Chen, J.-W. Kuo, and S.-S. Cheng, “MATBN: A Mandarin Chinese Broadcast News Corpus,” International Journal of Computational Linguistics & Chinese Language Processing, vol. 10, no. 2, pp. 219–236, 2005.
[24] S. Broman and M. Kurimo, “Methods for Combining Language Models in Speech Recognition,” in Proc. Interspeech, 2005.
[25] D. Povey and K. Vesel, “Sequence-discriminative Training of Deep Neural Networks,” in Proc. Interspeech, 2013.
[26] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio Augmentation for Speech Recognition,” in Proc. Interspeech, 2015.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文
 
無相關期刊
 
無相關點閱論文
 
系統版面圖檔 系統版面圖檔