跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.90) 您好!臺灣時間:2025/01/21 19:51
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:敖家維
研究生(外文):Chia-Wei Ao
論文名稱:基於專注式類神經網路之依例查詢口述語彙偵測
論文名稱(外文):Query-by-example Spoken Term Detection based onAttention-based Neural Network
指導教授:李宏毅李宏毅引用關係
口試日期:2017-07-10
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:電信工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2017
畢業學年度:105
語文別:中文
論文頁數:75
中文關鍵詞:專注式模型依例查詢
外文關鍵詞:Attention-based ModelQuery-by-example
相關次數:
  • 被引用被引用:0
  • 點閱點閱:429
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
本論文之主軸在探討語音數位內容之口述詞彙偵測。由於近年來網路蓬勃發
展,使得網路上包含語音資訊的多媒體如線上課程、電影、戲劇、會議錄音等日
漸增加,因此,語音數位內容之檢索也隨之受到重視。語音數位內容檢索的關鍵
部分為口述語彙偵測,找出語音文件中出現查詢詞的部分。本論文的查詢詞為語
音訊號,並非文字。傳統的方法都會藉由語音辨識系統先將查詢詞轉為文字,而
本論文則不經過語音辨識系統,使用機器學習中的類神經網路,在訓練語料中學
習聲音的特徵,如此便可直接在語音訊號上進行口述詞彙偵測,以避免語音辨識
系統錯誤率影響檢索系統的問題。
本論文採用了專注式機制,此機制能夠使模型關注在語音文件中某個區塊,
避免多餘的雜訊影響。回顧機制能夠使模型依照先前的輸入而關注在語音文件中
不同地方,進而模型能夠多次關注語音文件,且更精準的找到查詢詞。同時也嘗
試使用語音詞向量,將語音文件編碼成為一向量,其向量能夠有詞與詞之間的相
關性,藉由語音文件向量進行口述詞彙偵測。
誌謝. . i
中文摘要. . iii

一、導論. . 1
1.1 研究動機. . 1
1.2 研究方向. . 3
1.3 章節安排. . 4

二、背景知識. . 5
2.1 資訊檢索與語音資訊檢索. . 5
2.1.1 資訊檢索. . 5
2.1.2 語音資訊檢索. . 7
2.1.3 片段式動態時間校準(Segmental DTW). . 8
2.1.4 資訊檢索評估機制 . . 11
2.2 深層類神經網路(Deep Neural Network, DNN). . 14
2.2.1 簡介. . 14
2.2.2 運作原理. . 15
2.2.3 訓練類神經網路. . 16
2.2.4 類神經網路的困難. . 18
2.3 遞迴式類神經網路(Recurrent Neural Network,RNN). . 20
2.3.1 簡介. . 20
2.3.2 運作原理. . 21
2.3.3 沿時間反向傳播演算法. . 21
2.3.4 長短期記憶神經網路. . 22
2.4 本章總結. . 25

三、基於遞迴類神經網路之依例查詢口述語彙偵測. . 26
3.1 簡介. . 26
3.2 利用遞迴式神經網路的特徵向量表示法. . 27
3.2.1 抽取聲學特徵. . 27
3.2.2 序列對序列模型(Sequence-to-Sequence Model). . 30
3.3 系統架構. . 31
3.3.1 系統概觀. . 31
3.3.2 遞迴類神經網路模型. . 32
3.3.3 訓練方式. . 32
3.4 實驗結果與分析. . 33
3.4.1 實驗設定. . 33
3.4.2 基準實驗. . 34
3.4.3 實驗結果與分析.. 34
3.5 本章總結.. 36

四、基於專注式類神經網路之依例查詢口述語彙偵測.. 37
4.1 簡介. . 37
4.2 專注式機制. . 38
4.3 模型架構. . 39
4.3.1 系統架構簡介. . 39
4.3.2 語音查詢詞之向量表示法. . 40
4.3.3 專注式機制與語音文件表示法. . 41
4.3.4 回顧機制. . 42
4.3.5 分類器. . 43
4.4 非監督式訓練. . 43
4.5 實驗與分析. . 44
4.5.1 基準實驗與實驗設定. . 44
4.5.2 實驗結果與比較. . 44
4.5.3 結合基準實驗. . 46
4.5.4 專注式機制探討. . 47
4.5.5 非監督式訓練結果. . 52
4.6 本章總結. . 53

五、基於語音詞向量之依例查詢口述語彙偵測. . 54
5.1 簡介. . 54
5.2 語音詞向量. . 54
5.3 模型架構. . 56
5.3.1 模型簡介. . 56
5.3.2 推理機制(Inference mechanism). . 56
5.3.3 訓練方式. . 56
5.4 實驗與分析 . . 57
5.4.1 實驗設定與基準實驗. . 57
5.4.2 實驗結果與分析. . 58
5.5 本章總結. . 61

六、基於專注式和語音詞向量之依例查詢口述語彙偵測. . 62
6.1 簡介. . 62
6.2 模型架構. . 62
6.2.1 模型簡介. . 62
6.2.2 訓練方式. . 64
6.3 實驗與分析. . 65
6.3.1 實驗設定與基準實驗. . 65
6.3.2 實驗結果與分析. . 66
6.4 本章總結. . 67

七、結論與展望. . 68
7.1 結論. . 68
7.2 未來與展望. . 69

參考文獻. . 70
[1] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
[2] Yaodong Zhang and James R Glass, “Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams,” in Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on. IEEE, 2009, pp. 398–403.
[3] 沈昇勳, “藉助線上課程之自動結構化、分類與理解以提升學習效率,” 2016.
[4] Ciprian Chelba, Timothy J Hazen, and Murat Sarac¸lar, “Retrieval and browsing of spoken content,” Signal Processing Magazine, IEEE, vol. 25, no. 3, pp. 39–49,2008.
[5] Lin-shan Lee and Berlin Chen, “Spoken document understanding and organization,” Signal Processing Magazine, IEEE, vol. 22, no. 5, pp. 42–60, 2005.
[6] “Text retrieval conference,” Website, http://trec.nist.gov/.
[7] Murat Saraclar and Richard Sproat, “Lattice-based search for spoken utterance retrieval,” Urbana, vol. 51, pp. 61801, 2004.
[8] Jonathan Mamou, David Carmel, and Ron Hoory, “Spoken document retrieval from call-center conversations,” in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006,pp. 51–58.
[9] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp.1096–1103.
[10] Jiwei Li, Minh-Thang Luong, and Dan Jurafsky, “A hierarchical neural autoencoder for paragraphs and documents,” arXiv preprint arXiv:1506.01057, 2015.
[11] Pierre Baldi, “Autoencoders, unsupervised learning, and deep architectures.,” ICML unsupervised and transfer learning, vol. 27, no. 37-50, pp. 1, 2012.
[12] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
[13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[14] Jeffrey Pennington, Richard Socher, and Christopher D Manning, “Glove: Global vectors for word representation.,” in EMNLP, 2014, vol. 14, pp. 1532–1543.
[15] Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee, and Lin-Shan Lee, “Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” arXiv preprint arXiv:1603.00982, 2016.
[16] Stephen E Robertson, “The probability ranking principle in ir,” .
[17] Ian Ruthven and Mounia Lalmas, “A survey on the use of relevance feedback for information access systems,” The Knowledge Engineering Review, vol. 18, no. 02, pp. 95–145, 2003.
[18] EllenMVoorhees, “Query expansion using lexical-semantic relations,” in SIGIR’94. Springer, 1994, pp. 61–69.
[19] Jinxi Xu and W Bruce Croft, “Query expansion using local and global document analysis,” in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1996, pp. 4–11.
[20] Chun-an Chan and Lin-shan Lee, “Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping.,” in INTERSPEECH, 2010, pp. 693–696.
[21] Timothy J Hazen, Wade Shen, and Christopher White, “Query-by-example spoken term detection using phonetic posteriorgram templates,” in Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on. IEEE, 2009, pp. 421–426.
[22] John S Garofolo, Cedric GP Auzanne, and Ellen M Voorhees, “The trec spoken document retrieval track: A success story.,” .
[23] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[24] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams, “Learning representations by back-propagating errors,” Cognitive modeling, vol. 5, no. 3, pp. 1.
[25] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[26] Jeffrey L Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990.
[27] Paul J Werbos, “Backpropagation through time: what it does and how to do it,”Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
[28] Felix Gers, Long short-term memory in recurrent neural networks, Ph.D. thesis, Universit¨at Hannover, 2001.
[29] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210.
[30] Douglas B Paul and Janet M Baker, “The design for the wall street journal-based csr corpus,” in Proceedings of the workshop on Speech and Natural Language. Association for Computational Linguistics, 1992, pp. 357–362.
[31] Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
[32] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al., “End-to-end memory networks,”in Advances in neural information processing systems, 2015, pp. 2440–2448.
[33] Alex Graves, Greg Wayne, and Ivo Danihelka, “Neural turing machines,” arXiv preprint arXiv:1410.5401, 2014.
[34] Ankit Kumar and Ozan Irsoy, “Ask me anything: Dynamic memory networks for natural language processing,” .
[35] Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C Lawrence Zitnick, Devi Parikh, and Dhruv Batra, “Vqa: Visual question answering,” International Journal of Computer Vision, pp. 1–28.
[36] Alexander M Rush, Sumit Chopra, and JasonWeston, “A neural attention model for abstractive sentence summarization,” arXiv preprint arXiv:1509.00685, 2015.
[37] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning, 2015, pp. 2048–2057.
[38] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4960–4964.
[39] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,2014.
[40] Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.
[41] George H Dunteman, Principal components analysis, Number 69. Sage, 1989.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top