跳到主要內容

臺灣博碩士論文加值系統

(35.172.136.29) 您好!臺灣時間:2021/08/02 18:16
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:楊子毅
研究生(外文):Tzu-i Yang
論文名稱:基於識別向量分群與深層類神經網路之語者調適
論文名稱(外文):Speaker Adaptation over Deep Neural Network by Clustering Identity Vectors
指導教授:李琳山李琳山引用關係
口試委員:王小川陳信宏李宏毅簡仁宗
口試日期:2015-07-07
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:電信工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2015
畢業學年度:103
語文別:中文
論文頁數:69
中文關鍵詞:深層類神經網路識別向量語者調適向量結合語者分群
外文關鍵詞:deep neural networki-vectorspeaker adaptationvector combinationspeaker clustering
相關次數:
  • 被引用被引用:0
  • 點閱點閱:114
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
語音辨識的使用日趨廣泛,大量出現於各種應用環境中,而語者調適愈顯得日趨重要。深層類神經網路亦已成為聲學模型的主流,本論文將各語者的平均識別向量分群,分別為每一群語者訓練特定的深層類神經網路模型,再用這些事前訓練好的模型來作語者調適。本論文提出兩種作法,一是以測試語者的識別向量做為選擇標準,挑出最適合的語者分群模型;另一者則用監督式方法學習出結合向量來整合各個模型的輸出結果。我們使用高度口語化、個人化及雙語特性之語料測試,發現本論文所提出的架構在調適語料少時能迅速提升辨識正確率,並且在調適語料數目增加時也有不錯的表現。

誌謝 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

中文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

一、導論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 相關研究 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 研究方向 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 章節大綱 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

二、背景知識 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 深層類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 基本介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 訓練方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.3 丟棄演算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.4 漸進式丟棄演算法 . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 語者調適 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 特徵語音 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3 奇異值分解 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.4 特徵空間鑑別式線性回歸 . . . . . . . . . . . . . . . . . . . . 18

2.2.5 語者碼 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.6 其它相關語者調適方法簡介 . . . . . . . . . . . . . . . . . . . 21

2.3 本章總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

三、基礎實驗與分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 深層類神經網路訓練程序 . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.1 限制波茲曼機器 . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.2 訓練設定與步驟 . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 深層類神經網路聲學模型基礎實驗 . . . . . . . . . . . . . . . . . . . 30

3.2.1 聲學模型架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.2 丟棄演算法實驗 . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 本章總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

四、識別向量 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1 聯合因素分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 聯合因素分析簡介 . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.2 特徵語音矩陣訓練 . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.3 特徵頻道矩陣訓練 . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.4 殘留矩陣訓練 . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 識別向量 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 本章總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

vi

五、語者調適系統 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1 語者調適 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1.1 識別向量調適 . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1.2 語者分群調適 . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.3 深層類神經網路集成 . . . . . . . . . . . . . . . . . . . . . . . 47

5.1.4 實驗結果與分析 . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1.5 綜合比較分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 本章總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

六、結論與展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.1 結論與展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.1.1 本論文主要研究貢獻 . . . . . . . . . . . . . . . . . . . . . . . 59

6.1.2 本論文未來研究方向 . . . . . . . . . . . . . . . . . . . . . . . 60

參考文獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

附錄一:反向傳播演算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

[1] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh, “A fast learningalgorithm for deep belief nets,” Journal Neural Computation, vol. 18, no. 7, pp.357–389, 2006.

[2] Najim Dehak, Patrick Kenny, Reda Dehak, Pierre umouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” Audio, Speech,and Language Processing, IEEE Transactions on, vol. 19, no. 6, pp. 788–798, 2011.

[3] James Baker, “The dragon system-an overview,” Acoustics Speech and Signal Processing, vol. 23, no. 1, pp. 24–29, 1975.

[4] Lawrence Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Acoustics Speech and Signal Processing, vol.77, no. 2, pp. 257–286, 1989.

[5] Janet Baker, Li Deng, James Glass, Sanjeev Khudanpur, Chin-Hui Lee, Nelson Morgan, and Douglas O’Shaughnessy, “Developments and directions in speech recognition and understanding, part 1,” Signal Processing Magazine, vol. 26, no. 3, pp. 75–80, 2009.

[6] B-H Juang, “Maximum-likelihood estimation for mixture multivariate stochastic observations of Markov chains,” AT&T technical journal, vol. 64, no. 6, pp. 1235–1249, 1985.

[7] Povey D., Burget L., Agarwal M., Akyazi P., Kai Feng, Ghoshal A., Glembek O., Goel N.K., Karafiat M., Rastrow A., Rose R.C., Schwarz P., and Thomas S., “Subspace gaussian mixture models for speech recognition,” Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp. 4330–4333, 2010.

[8] Richard P Lippmann, “An introduction to computing with neural nets,” ASSP Magazine, vol. 4, no. 2, pp. 4–22, 1987.

[9] Herve A Bourlard and Nelson Morgan, “Connectionist speech recognition: a hybrid approach,” Springer, vol. 247, 1994.

[10] John Nickolls, Ian Buch, Michael Garland, and Kevin Skadron, “Scalable parallel programming with cuda,” Queue, vol. 6, no. 2, pp. 40–53, 2008.

[11] Abdel rahman Mohamed, Tara N Sainath, George Dahl, Bhuvana Ramab-hadran, Geoffrey E Hinton, and Michael A Picheny, “Deep belief networks using discriminative features for phone recognition,” Acoustics, Speech and Signal Processing(ICASSP), 2011 IEEE International Conference on, pp. 5060–5963, 2011.

[12] Abdel rahman Mohamed, George Dahl, and Geoffrey E Hinton, “Acoustic modeling using deep belief networks,” Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14–22, 2012.

[13] Abdel rahman Mohamed, George Dahl, and Geoffrey E Hinton, “Phone recognition with the mean-covariance restricted boltzmann machine,” Ad-vances in neural information processin systems, pp. 469–477, 2010.

[14] Yann LeCun and Yoshua Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, 1995.

[15] Ken ichi Funahashi and Yuichi Nakamura, “Approximation of dynamical systems by continuous time recurrent neural networks,” Neural Networks, vol. 6, no. 6, pp. 801 – 806, 1993.

[16] Sepp Hochreiter and Jurgen Schmidhuber, “Long short-term memory,” Neural Computation, pp. 1735–1780, 1997.

[17] Roland Kuhn, Jean-Claude Junqua, Patrick Nguyen, and Nancy Niedzielski, “Rapid speaker adaptation in eigenvoice space,” Speech and Audio Processing, IEEE Transactions on, vol. 8, no. 6, pp. 695–707, 2000.

[18] Jian Xue, Jinyu Li, Dong Yu, Mike Seltzer, and Yifan Gong, “Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network,” in ICASSP, 2014.

[19] T.N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhad-ran, “Low-rank matrix factorization for deep neural network training with high-dimensional output targets,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, May 2013, pp. 6655–6659.

[20] Jian Xue, Jinyu Li, and Yifan Gong, “Restructuring of deep neural network acoustic models with singular value decomposition,” in Interspeech, 2013.

[21] F. Seide, Gang Li, Xie Chen, and Dong Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, Dec 2011, pp. 24–29.

[22] O. Abdel-Hamid and Hui Jiang, “Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, May 2013, pp. 7942–7946.

[23] Patrick Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 4, pp. 1435–1447, 2007.

[24] P.J. Moreno, “Speech recognition in noisy environments,” Ph.D. thesis, 1996.

[25] Jinyu Li, Li Deng, Dong Yu, Yifan Gong, and Alex Acero, “A unified frame-work of hmm adaptation with joint compensation of additive and convolutive distortions,” Computer, Speech and Language, vol. 23, no. 3, pp. 389–405, 2009.

[26] Jinyu Li, Jui-Ting Huang, and Yifan Gong, “Factorized adaptation for deep neural network,” in ICASSP. 2014, pp. 5537–5541, IEEE.

[27] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams, “Learning representations by back-propagating errors,” Cognitive modeling, 1998.

[28] John Duchi, Elad Hazan, and Yoram Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” The Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011.

[29] Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.

[30] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp.1929–1958, 2014.

[31] S.J. Rennie, V. Goel, and S. Thomas, “Annealed dropout training of deep networks,” in Spoken Language Technology Workshop (SLT), 2014 IEEE, Dec 2014, pp. 159–164.

[32] C.J. Leggetter and P.C.Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models,” Computer Speech & Language, vol. 9, no. 2, pp. 171 – 185, 1995.

[33] M.J.F. Gales and P.C. Woodland, “Mean and variance adaptation within the mllr framework,” Computer Speech & Language, vol. 10, no. 4, pp. 249 – 264, 1996.

[34] M.J.F. Gales, “Maximum likelihood linear transformations for hmm-based speech recognition,” Computer Speech & Language, vol. 12, no. 2, pp. 75 – 98, 1998.

[35] J. Gauvain and Chin-Hui Lee, “Maximum a posteriori estimation for multi-variate gaussian mixture observations of markov chains,” Speech and Audio Processing, IEEE Transactions on, vol. 2, no. 2, pp. 291–298, Apr 1994.

[36] G. Zavaliagkos, R. Schwartz, and J. McDonough, “Maximum a posteriori adaptation for large scale hmm recognizers,” in Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, May 1996, vol. 2, pp. 725–728 vol. 2.

[37] M. Finke, P. Geutner, H. Hild, T. Kemp, K. Ries, and M. Westphal, “The karlsruhe-verbmobil speech recognition engine,” in Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, Apr 1997, vol. 1, pp. 83–86 vol.1.

[38] Neto Joao, Almeida Luis, Hochberg Mike, Martins Ciro, Nunes Luis, Renals Steve, and Robinson Tony, “Speaker-adaptation for hybrid hmm-ann continuous speech recognition system,” 1995.

[39] Bo Li and Khe Chai Sim, “Comparison of discriminative input and output transformations for speaker adaptation in the hybrid nn/hmm systems.,” in INTERSPEECH, Takao Kobayashi, Keikichi Hirose, and Satoshi Nakamura, Eds. 2010, pp. 526–529, ISCA.

[40] Roberto Gemello, Franco Mana, Stefano Scanzio, Pietro Laface, and Renato De Mori, “Linear hidden transformations for adaptation of hybrid ann/hmm models,” Speech Communication, vol. 49, no. 10, pp. 827 – 835, 2007, Intrinsic Speech Variations.

[41] G. Saon, H. Soltau, and D. Nahamoo, “Speaker adaptatioon of neural network acoustic models using i-vectors,” in ASRU, 2013.

[42] Po-Wei Chou, “Deep and convolutional neural networks for acoustic modeling in large vocabulary continuous speech recognition,” M.S. thesis, 2015.

[43] Shaofei Xue, Hui Jiang, Li-Rong Dai, and Qingfeng Liu, “Unsupervised speaker adaptation of deep neural network based on the combination of speaker codes and singular value decomposition for speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, IEEE.

[44] Tian Tan, Yanmin Qian, Maofan Yin, Yimeng Zhuang, and Kai Yu, “Cluster adaptive training for deep neural network,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, IEEE.

[45] Wu Chunyang and Gales Mark, “Multi-basis adaptive neural network for rapid adaptation in speech recognition,” in Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, IEEE.

[46] Pawel Swietojanski and Steve Renals, “Differentiable pooling for unsupervised speaker adaptation,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, IEEE.

[47] Yong Zhao, Jinyu Li, Jian Xue, and Yifan Gong, “Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, IEEE.

[48] G.E. Dahl, Dong Yu, Li Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 30–42, Jan 2012.

[49] Gupta V., Kenny P., Ouellet P., and T. Stafylakis, “I-vector-based speaker adaptation of deep neural networks for french broadcast audio transcription,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文
 
無相關期刊
 
無相關點閱論文