跳到主要內容

臺灣博碩士論文加值系統

(35.172.136.29) 您好!臺灣時間:2021/08/02 18:29
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:蔡政昱
研究生(外文):Cheng-Yu Tsai
論文名稱:音型之交互增強及多層次音型深層類神經網路使用於非督導式語音特徵抽取與口述語彙發掘
論文名稱(外文):Mutual Reinforcement for Acoustic Tokens and Multi-level Acoustic Tokenizing Deep Neural Network for Unsupervised Speech Feature Extraction and Spoken Term Discovery
指導教授:李琳山李琳山引用關係
口試委員:王小川陳信宏鄭秋豫簡仁宗李宏毅
口試日期:2015-07-07
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:電信工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2015
畢業學年度:103
語文別:中文
論文頁數:116
中文關鍵詞:非督導式語音特徵抽取非督導式口述語彙發掘
外文關鍵詞:Unsupervised Speech Feature ExtractionUnsupervised Spoken Term Discovery
相關次數:
  • 被引用被引用:0
  • 點閱點閱:106
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
本論文主要探討語音處理中兩個非督導式學習的核心問題──非督導式語音特徵抽取(Unsupervised Speech Feature Extraction) 及非督導式口述語彙發掘(Unsupervised Spoken Term Discovery)。現今成功的語音辨識技術,皆建立在高度督導式的學習架構上,不但仰賴大量對於欲辨識語言的專業知識,亦需要大量具有人工標註的訓練語料。然而此兩者皆需付出相當成本才能夠取得,在當前的巨量資料(Big Data) 時代下,每日都有無窮無盡的新語音訊號被產生出來,希望以人工方式一一為每筆資料加上標註可說是相當不切實際的。因此,不需任何人工標註的非督導式學習在近年獲得愈來愈多的關注,它不但免除了人工標註之成本,同時這樣的學習方式也更貼近人類嬰幼兒的語言學習過程。
針對非督導式口述語彙發掘,本論文主要以多層次音型產生器(Multi-level Acoustic Tokenizer, MAT) 所自動習得的音型(Acoustic Tokens) 進行改良。我們使用自動習得之音型與其型別(Type) 訓練一遞迴式類神經網路語言模型(Recurrent Neural Network Language Model, RNNLM),抽取每個音型的詞嵌入(Word Embedding),並探討詞嵌入對於音型的型別歸類錯誤之修正能力。我們也提出了多層次音型之交互增強(Mutual Reinforcement for Acoustic Tokens),將多套獨立的音型所攜帶的聲學及語言資訊(acoustic and linguistic information) 整合起來,以產生更佳的音型產生器所需的訓練用初始值(initialization),以便習得更佳的音型。
本論文並提出多層次音型深層類神經網路(MAT-Deep Neural Network, MATDNN),包含了一多層次音型產生器以及一多目標深層類神經網路(Multi-target Deep Neural Network, MDNN),同時將非督導式語音特徵抽取和非督導式口述語彙發掘兩項工作(task) 整合在一起考慮,利用迭代學習架構(Iterative Learning Framework) 將其中一方的結果用於另一方的訓練之中,使其在兩個不同問題上得到的成果能夠彼此互惠,促成更多的進步。最後我們把這整套架構方法用在2015年Interspeech的零標註語音競賽(Zero Resource Speech Challenge)上,使用其語料庫以及評估度量,在非督導式語音特徵抽取及非督導式口述語彙發掘的兩個賽項上都獲得比基準數據(baseline)的JHU系統更好的結果。

中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
一、緒論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 研究方向與主要成果. . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 章節大要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
二、背景知識. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 零標註情境(Zero Annotation Scenario) . . . . . . . . . . . . . . . . . . 8
2.1.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 非督導式語音特徵抽取(Unsupervised Speech Feature Extraction) 11
2.1.3 非督導式口述語彙發掘(Unsupervised Spoken Term Discovery) 14
2.2 多層次音型產生器(Multi-level Acoustic Tokenizer) . . . . . . . . . . . 16
2.2.1 介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 高斯混合模型-隱藏式馬可夫模型(GMM-HMM) . . . . . . . . 19
2.2.3 模型粗細度空間(Model Granularity Space) . . . . . . . . . . . 20
三、以詞嵌入輔助口述語彙發掘. . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 詞嵌入(Word Embedding) . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 遞迴式類神經網路語言模型. . . . . . . . . . . . . . . . . . . 25
3.1.3 跳躍文法模型. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 利用遞迴式類神經網路語言模型抽取詞嵌入. . . . . . . . . . . . . . 29
3.3 中文音節模擬實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 無誤音型之詞嵌入模擬實驗. . . . . . . . . . . . . . . . . . . 32
3.3.2 含辨識錯誤之詞嵌入模擬實驗. . . . . . . . . . . . . . . . . . 39
3.4 自多層次音型所抽取之詞嵌入. . . . . . . . . . . . . . . . . . . . . . 42
3.5 實驗結果與分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 章節總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
四、零標註語音競賽. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 競賽規則與使用語料. . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 賽項一(Track 1):非督導式類次詞模型建立. . . . . . . . . . . . . . 50
4.3.1 賽項目標與介紹. . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 評估度量方法. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 賽項二(Track 2):口述語彙發掘. . . . . . . . . . . . . . . . . . . . . 55
4.4.1 賽項目標與介紹. . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.2 評估度量方法. . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 章節總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
五、多層次音型之交互增強. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 音型邊界融合. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 基於潛藏狄氏分配的音型標註重新初始化. . . . . . . . . . . . . . . 73
5.4 實驗結果與分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4.1 英語語料實驗結果與分析. . . . . . . . . . . . . . . . . . . . 78
5.4.2 崇加語語料實驗結果與分析. . . . . . . . . . . . . . . . . . . 83
5.5 章節總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
六、多層次音型深層類神經網路. . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.1 系統介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 多目標深層類神經網路(Multi-target DNN) . . . . . . . . . . . . . . . 91
6.3 迭代學習架構(Iterative Learning Framework) . . . . . . . . . . . . . . 95
6.4 零標註語音競賽之實驗結果與分析. . . . . . . . . . . . . . . . . . . 96
6.4.1 賽項一. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4.2 賽項二. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 章節總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
七、結論與展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1 結論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2 展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

[1] Lawrence R Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[2] Navdeep Jaitly, Patrick Nguyen, AndrewWSenior, and Vincent Vanhoucke, “Application of pretrained deep neural networks to large vocabulary speech recognition.,”
in INTERSPEECH, 2012.
[3] George E Dahl, Dong Yu, Li Deng, and Alex Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” Audio, Speech, and
Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 30–42, 2012.
[4] Thomas Kemp and Alex Waibel, “Unsupervised training of a speech recognizer: recent experiments.,” in Eurospeech, 1999.
[5] Lori Lamel, Jean-Luc Gauvain, and Gilles Adda, “Lightly supervised and unsupervised acoustic model training,” Computer Speech & Language, vol. 16, no. 1, pp.
115–129, 2002.
[6] Alex S Park and James R Glass, “Unsupervised pattern discovery in speech,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 16, no. 1, pp. 186–
197, 2008.
[7] Gautam K Vallabha, James L McClelland, Ferran Pons, Janet F Werker, and Shigeaki Amano, “Unsupervised learning of vowel categories from infant-directed speech,” Proceedings of the National Academy of Sciences, vol. 104, no. 33, pp.13273–13278, 2007.
[8] Fang Zheng, Guoliang Zhang, and Zhanjiang Song, “Comparison of different implementations of mfcc,” Journal of Computer Science and Technology, vol. 16, no.6, pp. 582–589, 2001.
[9] Tara N Sainath, Brian Kingsbury, and Bhuvana Ramabhadran, “Auto-encoder bottleneck features using deep belief networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4153–4156.
[10] Ruslan Salakhutdinov and Geoffrey E Hinton, “Deep boltzmann machines,” in International Conference on Artificial Intelligence and Statistics, 2009, pp. 448–455.
[11] Aren Jansen and Benjamin Van Durme, “Efficient spoken term discovery using randomized algorithms,” in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011, pp. 401–406.
[12] Armando Muscariello, Guillaume Gravier, and Fr´ed´eric Bimbot, “Unsupervised motif acquisition in speech via seeded discovery and template matching combination,”
Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no.7, pp. 2031–2044, 2012.
[13] Cheng-Tao Chung, Chun-an Chan, and Lin-shan Lee, “Unsupervised discovery of linguistic structure including two-level acoustic patterns using three cascaded stages of iterative optimization,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8081–8085.
[14] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, and Raj Foreword By-Reddy, Spoken language processing: A guide to theory, algorithm, and system development, Prentice Hall PTR, 2001.
[15] Kristen Precoda, “Non-mainstream languages and speech recognition: Some challenges,” CALICO journal, vol. 21, no. 2, pp. 229–243, 2013.
[16] Peter W Jusczyk, The discovery of spoken language, MIT press, 2000.
[17] Steven Pinker, The language instinct: The new science of language and mind, vol.7529, Penguin UK, 1994.
[18] Jenny R Saffran, “Constraints on statistical language learning,” Journal of Memory and Language, vol. 47, no. 1, pp. 172–196, 2002.
[19] James Glass, “Towards unsupervised speech processing,” in Information Science, Signal Processing and their Applications (ISSPA), 2012 11th International Conference on. IEEE, 2012, pp. 1–4.
[20] Alex Park and James R Glass, “Towards unsupervised pattern discovery in speech,” in Automatic Speech Recognition and Understanding, 2005 IEEE Workshop on. IEEE, 2005, pp. 53–58.
[21] Armando Muscariello, Guillaume Gravier, and Fr´ed´eric Bimbot, “Audio keyword extraction by unsupervised word discovery,” in INTERSPEECH 2009: 10th Annual Conference of the International Speech Communication Association, 2009.
[22] Yaodong Zhang and James R Glass, “Towards multi-speaker unsupervised speech pattern discovery,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 4366–4369.
[23] Aren Jansen, Kenneth Church, and Hynek Hermansky, “Towards spoken term discovery at scale with zero resources.,” in INTERSPEECH, 2010, pp. 1676–1679.
[24] Man-hung Siu, Herbert Gish, Steve Lowe, and Arthur Chan, “Unsupervised audio patterns discovery using hmm-based self-organized units,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
[25] Armando Muscariello, Guillaume Gravier, and Fr´ed´eric Bimbot, “Zero-resource audio-only spoken term detection based on a combination of template matching techniques,” in INTERSPEECH 2011: 12th Annual Conference of the International Speech Communication Association, 2011.
[26] Man-Hung Siu, Herbert Gish, Arthur Chan, and William Belfield, “Improved topic classification and keyword discovery using an hmm-based speech recognizer trained without supervision,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
[27] Aren Jansen, Samuel Thomas, and Hynek Hermansky, “Weak top-down constraints for unsupervised acoustic model training.,” in ICASSP, 2013, pp. 8091–8095.
[28] Leonardo Badino, Claudia Canevari, Luciano Fadiga, and Giorgio Metta, “An autoencoder based approach to unsupervised learning of subword units,” in Acoustics,Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 7634–7638.
[29] Cheng-Tao Chung, Chun-an Chan, and Lin-shan Lee, “Unsupervised spoken term detection with spoken queries by multi-level acoustic patterns with varying model granularity,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 7814–7818.
[30] Geoffrey E Hinton, “Learning distributed representations of concepts,” in Proceedings of the eighth annual conference of the cognitive science society. Amherst, MA, 1986, vol. 1, p. 12.
[31] Joseph Turian, Lev Ratinov, Yoshua Bengio, and Dan Roth, “A preliminary evaluation of word representations for named-entity recognition,” in NIPS Workshop on Grammar Induction, Representation of Language and Language Learning, 2009, pp. 1–8.
[32] Richard Socher, Christopher D Manning, and Andrew Y Ng, “Learning continuous phrase representations and syntactic parsing with recursive neural networks,” in Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop, 2010, pp. 1–9.
[33] William Blacoe and Mirella Lapata, “A comparison of vector-based representations for semantic composition,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012, pp. 546–556.
[34] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin, “Learning sentiment-specific word embedding for twitter sentiment classification,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014, pp. 1555–1565.
[35] Wei Xu and Alexander I Rudnicky, “Can artificial neural networks learn language models?,” 2000.
[36] Andriy Mnih and Geoffrey Hinton, “Three new graphical models for statistical language modelling,” in Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 641–648.
[37] Tomas Mikolov, Martin Karafi´at, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur, “Recurrent neural network based language model.,” in INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, 2010, pp. 1045–1048.
[38] Ronan Collobert, JasonWeston, L´eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa, “Natural language processing (almost) from scratch,” The Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011.
[39] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems, 2013, pp. 3111–3119.
[40] Jeffrey Pennington, Richard Socher, and Christopher D Manning, “Glove: Global vectors for word representation,” Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), vol. 12, 2014.
[41] Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai, “Class-based n-gram models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467–479, 1992.
[42] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Christian Janvin, “A neural probabilistic language model,” The Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003.
[43] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio, “On the difficulty of training recurrent neural networks,” arXiv preprint arXiv:1211.5063, 2012.
[44] Kaisheng Yao, Baolin Peng, Geoffrey Zweig, Dong Yu, Xiaolong Li, and Feng Gao, “Recurrent conditional random field for language understanding,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4077–4081.
[45] Yun-Chiao Li, “Enhenced semantic retrieval of spoken content with query expansion and automatically discovered acoustic patterns,” 2014.
[46] Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky, “Rnnlm-recurrent neural network language modeling toolkit,” in Proc. of the 2011 ASRU Workshop, 2011, pp. 196–201.
[47] Sainbayar Sukhbaatar and Rob Fergus, “Learning from noisy labels with deep neural networks,” arXiv preprint arXiv:1406.2080, 2014.
[48] M.A. Pitt, L. Dilley, K. Johnson, S. Kiesling,W. Raymond, E. Hume, and E. Fosler-Lussier, “Buckeye corpus of conversational speech (2nd release),” 2007, Columbus, OH: Department of Psychology, Ohio State University (Distributor).
[49] Nic J De Vries, Marelie H Davel, Jaco Badenhorst,Willem D Basson, Febe DeWet, Etienne Barnard, and Alta De Waal, “A smartphone-based asr data collection tool for under-resourced languages,” Speech communication, vol. 56, pp. 119–131, 2014.
[50] Thomas Schatz, Vijayaditya Peddinti, Francis Bach, Aren Jansen, Hynek Hermansky, and Emmanuel Dupoux, “Evaluating speech features with the minimal-pair abx task: Analysis of the classical mfc/plp pipeline,” in INTERSPEECH 2013: 14th Annual Conference of the International Speech Communication Association, 2013, pp. 1–5.
[51] Thomas Schatz, Vijayaditya Peddinti, Xuan-Nga Cao, Francis Bach, Hynek Hermansky, and Emmanuel Dupoux, “Evaluating speech features with the minimal-pair abx task (ii): Resistance to noise,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
[52] Bogdan Ludusan, Maarten Versteegh, Aren Jansen, Guillaume Gravier, Xuan-Nga Cao, Mark Johnson, and Emmanuel Dupoux, “Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems,” in Language Resources and Evaluation Conference, 2014.
[53] Yun-Chiao Li, Hung-yi Lee, Cheng-Tao Chung, Chun-an Chan, and Lin-shan Lee, “Towards unsupervised semantic retrieval of spoken content with query expansion based on automatically discovered acoustic patterns,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 198–203.
[54] Thomas L Griffiths and Mark Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Sciences, vol. 101, no. suppl 1, pp. 5228–5235, 2004.
[55] David M Blei, Andrew Y Ng, and Michael I Jordan, “Latent dirichlet allocation,” the Journal of machine Learning research, vol. 3, pp. 993–1022, 2003.
[56] Gregor Heinrich, “Parameter estimation for text analysis,” Tech. Rep., Technical report, 2005.
[57] Christian Plahl, Ralf Schluter, and Hermann Ney, “Hierarchical bottle neck features for lvcsr.,” in Interspeech, 2010, pp. 1197–1200.
[58] Frantisek Grezl and Petr Fousek, “Optimizing bottle-neck features for lvcsr,” in Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on. IEEE, 2008, pp. 4729–4732.
[59] Dong Yu and Michael L Seltzer, “Improved bottleneck features using pretrained deep neural networks.,” in INTERSPEECH, 2011, vol. 237, p. 240.
[60] Ahilan Kanagasundaram, Robbie Vogt, David B Dean, Sridha Sridharan, and Michael W Mason, “I-vector based speaker recognition on short utterances,” in Proceedings of the 12th Annual Conference of the International Speech Communication Association. International Speech Communication Association (ISCA), 2011, pp. 2341–2344.
[61] Vishwa Gupta, Patrick Kenny, Pierre Ouellet, and Themos Stafylakis, “I-vectorbased speaker adaptation of deep neural networks for french broadcast audio transcription,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 6334–6338.
[62] Lawrence Davis et al., Handbook of genetic algorithms, vol. 115, Van Nostrand Reinhold New York, 1991.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文
 
無相關期刊
 
無相關點閱論文