(3.235.41.241) 您好!臺灣時間:2021/04/15 05:25
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:周郁馨
研究生(外文):YU-SIN JHOU
論文名稱:基於長短期記憶網路和連結時序分類的喚醒詞辨識
論文名稱(外文):Wake-up Word Detection Using Long Short Term Memory Network and Connectionist Temporal Classification
指導教授:王家慶
學位類別:碩士
校院名稱:國立中央大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2019
畢業學年度:107
語文別:中文
論文頁數:32
中文關鍵詞:喚醒詞深度學習長短期記憶網路連結時序分類
外文關鍵詞:wake-up worddeep learninglong short-term memoryconnectionist temporal classifier
相關次數:
  • 被引用被引用:0
  • 點閱點閱:138
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
隨著深度學習(Deep learning)的發展,人工智慧的運用更加普遍,在語音辨識的任務中也有顯著的進步。所謂喚醒詞辨識,也被稱作關鍵詞檢測(keyword spotting),就是在連續語音訊號中尋找特定詞語的位置,深度學習可以比傳統的方法,像是隱藏式馬可夫模型(Hidden Markov Model,HMM),有更好的效果。一般使用深度學習網路的喚醒詞辨識系統,像是深度神經網路(Deep Neural Network,DNN)或是循環神經網路(Recurrent Neural Network,RNN),通常是使用大量的特定詞語音訊作為訓練資料,讓網路學習關鍵詞音訊中的特徵,再預測連續音訊中關鍵詞是否存在。但是這種喚醒詞辨識系統,只能辨識固定的喚醒詞,若需要更換或是增添新喚醒詞,必須要再次蒐集新的喚醒詞資料並重新訓練模型。
本論文使用了長短期記憶網路(Long Short-term Memory,LSTM)和連結時序分類(Connectionist Temporal Classifier,CTC)來實作喚醒詞辨識模型。和原本直接預測音訊中是否有喚醒詞不同,這個辨識模型利用長短期記憶網路預測音訊中的音素,並且使用連結時序分類評估音素的可能序列,再判斷音素序列中是否有喚醒詞。因為是訓練音素序列預測的網路,訓練資料可以使用非喚醒詞的音訊檔案,讓網路可以更精準地預測音訊中的音素;在更換喚醒詞時也不需要重新訓練網路,只要少量的新喚醒資料強化網路即可。
As the development of deep learning, the applications of artificial intelligence become more and more popular, and the performance of speech recognition also improve a lot. Wake-up word detection is also called keyword spotting, and it deals with the identification of keyword in audio signal. For now, Deep learning has better performance than traditional way such as hidden Markov model (HMM). To get a deep learning wake-up word model (for example, deep neural network, recurrent neural network), we have to used lots of specific word audio to train the model so that the model can learn the feature in wake-up word audio and predict if wake-up word is in the continuous audio signal. However, these keyword detection systems can only detect fixed keyword. If we want to change the keyword or add new keyword into system, we have to collect new keyword-specific data and re-train the model.
In this thesis, we use long short-term memory network (LSTM) and connectionist temporal classifier (CTC) as keyword detection model. It is different from general keyword detection because this system uses LSTM to predict the posterior of phoneme and CTC to produce the possibility of the phoneme sequence. Due to predicting phoneme sequence, we can use non-keyword data as training data and let the model predict sequence more accurately. Besides, when changing the wake-up word, this system does not have to re-train. We just need to use some new wake-up word data to modify the system.
中文摘要..................................i
Abstract................................ii
誌謝....................................iii
章節目次.................................iv
圖目錄...................................vi
表目錄..................................vii
第一章 緒論...............................1
第二章 文獻探討............................2
2.1 基於隱藏式馬可夫模型的喚醒詞辨識.......2
2.2 基於深度學習網路的喚醒詞辨識..........2
2.2.1卷積神經網路.....................2
2.2.2 遞迴神經網路....................4
2.2.3 長短期記憶網路..................4
2.2.4 門控循環單元....................4
2.3 基於連結時序分類的喚醒詞辨識..........5
第三章 系統架構...........................6
3.1 系統架構設計........................6
3.2 特徵擷取...........................6
3.3 LSTM-CTC..........................7
3.3.1 LSTM..........................8
3.3.2 CTC...........................9
3.4 Compare..........................10
第四章 實驗.............................11
4.1 資料集說明........................11
4.2 實驗環境、實驗參數和網路設定........11
4.3 實驗結果..........................14
4.3.1 喚醒詞評估....................14
4.3.2 結果比較......................14
4.3.3 喚醒詞更換....................15
第五章 結論及未來研究方向.................17
第六章 參考文獻..........................18
[1] Lawrence R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, 77(2), p.257-286, February 1989
[2] G. David Forney, “The viterbi algorithm”, Proceedings of the IEEE, 1973
[3] Yann LeCun, Yoshua Bengio and Geoffrey Hinton. “Deep learning”. Nature, 521(7553), p.436-444, May 2015
[4] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-Based Learning Applied to Document Recognition”, Proceedings of the IEEE,1998
[5] Sainath, Tara N., and Carolina Rarada, “Convolutional neural networks for small-footprint keyword spotting”. Sixteenth Annual Conference of the International Speech Communication Association, 2015
[6] L. C. Jain, L. R. Medsker, “Recurrent Neural Networks: Design and Applications”, CRC Press, Inc., Boca Raton, FL, 1999
[7] Xavier Glorot, and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks”, Xavier Glorot, Yoshua Bengio ; Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010
[8] Zhou jianlai, Liu jian, Song Yantao, Yu tiecheng,“Keyword spotting based on recurrent neural network”, Proceedings of the IEEE,1998
[9] Has¸im Sak, Andrew Senior, Franc¸oise Beaufays, “Long Short-term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition”, arXiv:1402.1128v1 [cs.NE] 5 Feb 2014
[10] Jing-yun ZHANG, Lu HUANG and Jia-song SUN, “Keyword Spotting with Long Short-term Memory Neural Network Architectures”, International Conference on Computer, Electronics and Communication Engineering (CECE 2017), 2017
[11] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”, arXiv:1412.3555v1 [cs.NE] 11 Dec 2014
[12] Kyunghyun Cho, Bart van Merrienboer and Dzmitry Bahdanau, “On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”, arXiv: 1409.1259v2 [cs.CL] 7 Oct 2014
[13] Alex Graves, and Navdeep Jaitly, “Towards End-to-End Speech Recognition with Recurrent Neural Networks”, ICML'14 Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 Pages II-1764-II-1772
[14] Wu, Gin-Der, and Chin-Teng Lin. "Word boundary detection with melscale frequency bank in noisy environment." IEEE transactions on speech and audio processing 8.5 (2000): 541-554.
[15] Shikha Gupta, Jafreezal Jaafar, Wan Fatimah wan Ahmad, and Arpit Bansal, “FEATURE EXTRACTION USING MFCC”, Signal & Image Processing : An International Journal (SIPIJ) Vol.4, No.4, August 2013
[16] lex Graves, Santiago Fern´andez, Faustino Gomez and J¨urgen Schmidhuber,“Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks”, in ICML 2006 – Proceedings of the 23th International Conference on Machine Learning, June 25–29, Pittsburgh, Pennsylvania, USA, Proceedings, 2006, pp. 369–376.
[17] Zhiming Wang, Xiaolong Li and Jun Zhou, “SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORK AND CONNECTIONIST TEMPORAL CLASSIFIER”, arXiv: 1709.03665v1 [cs.CL] 12 Sep 2017
[18] CireşAn, Dan, et al. "Multi-column deep neural network for traffic signclassification." Neural Networks 32 (2012): 333-338.
[19] Rowley, Henry A., Shumeet Baluja, and Takeo Kanade. "Neural networkbased face detection." IEEE Transactions on pattern analysis and machine intelligence20.1 (1998): 23-38.
[20] Vassil Panayotov, Guoguo Chen, Daniel Povey and Sanjeev Khudanpur, “Librispeech: An ASR corpus based on public domain audio books”, Proceedings of the IEEE, 2015
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔