跳到主要內容

臺灣博碩士論文加值系統

(44.192.49.72) 您好!臺灣時間:2024/09/14 05:45
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:陳柏含
研究生(外文):Bo-han Chen
論文名稱:基於旅遊對話運用嵌入式中文語音辨識系統之實作
論文名稱(外文):Implementation of Embedded Mandarin SpeechRecognition System in Travel Domain
指導教授:陳嘉平陳嘉平引用關係
指導教授(外文):Chia-Ping Chen
學位類別:碩士
校院名稱:國立中山大學
系所名稱:資訊工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2009
畢業學年度:97
語文別:英文
論文頁數:65
中文關鍵詞:加權有限狀態轉換機隱藏式馬可夫模型自動語音辨識
外文關鍵詞:Weighted Finite State TransducerHidden Markov ModelAutomatic Speech Recognition
相關次數:
  • 被引用被引用:1
  • 點閱點閱:285
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
在本論文中,我們在行動裝置上開發一套二階段式中文自動語音辨識器。第一個辨識階段主要是辨識中文音節,以離散隱藏式馬可夫模型作為基礎模型,搜尋方式則為時間同步詞彙樹維特比搜尋。在第二階段,我們則是運用加權有限狀態轉換機來表示語言模型、發音模型以及前N名音節假說結果,再經由加權有限狀態轉換機上之組合及最短路徑運算,得到最好的詞串結果。本系統主要應用於旅遊領域,並且分割聲學模型及語言模型的應用於獨立的階段。實驗部份提供在實機ASUS P565(硬體配備:800MHz CPU 128 RAM作業系統:Window Mobile 6.1)上獲得的辨識數據。我們採用26小時TCC-300麥克風語料作為151個聲學模型的訓練集。為了在PC及PDA平台測試音節及字的辨識率,我們採用3分鐘自行錄制旅遊語料作為測試集。第二階段的語言模型則是選用BTEC語料庫中由3500個詞訓練得到的詞雙連模型。
在第一階段中,所獲得的最好的音節辨識結果38.8%(前30個假說)。前述的結果是使用連續隱藏式馬可夫模型。同樣音節結果在第二階段下達到27.6%的字辨識
率。
We build a two-pass Mandarin Automatic Speech Recognition (ASR) decoder on mobile device (PDA). The first-pass recognizing base syllable is implemented by discrete Hidden Markov Model (HMM) with time-synchronous, tree-lexicon Viterbi search. The second-pass dealing with language model, pronunciation lexicon and N-best syllable hypotheses from first-pass is implemented by Weighted Finite State Transducer (WFST). The best word sequence is obtained by shortest path algorithms over the composition result. This system limits the application in travel domain and it decouples the application of acoustic model and the application of language model into independent recognition passes. We report the real-time recognition performance performed on ASUS P565 with a 800MHz processor, 128MB RAM running Microsoft Window Mobile 6 operating system.
The 26-hour TCC-300 speech data is used to train 151 acoustic model. The 3-minute speech data recorded by reading the travel-domain transcriptions is used as the testing set for evaluating the performances (syllable, character accuracies) and real-time factors on PC and on PDA. The trained bi-gram model with 3500-word from BTEC corpus is used in second-pass.
In the first-pass, the best syllable accuracy is 38.8% given 30-best syllable hypotheses using continuous HMM and 26-dimension feature. Under the above syllable hypotheses and acoustic model, we obtain 27.6% character accuracy on PC after the second-pass.
List of Tables iii
List of Figures v
誌謝vii
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2 Speech Recognition System 5
2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Decoding Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Output Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Mandarin Pronunciation Model . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Search Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.1 Linearly-Structured Lexicon . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Prefix Tree-Structured Lexicon . . . . . . . . . . . . . . . . . . . 12
Chapter 3 Weighted Finite State Machines 16
3.1 Semi-ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Related Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Kleene Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.3 Composition Algorithm . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3.1 Composition Filter . . . . . . . . . . . . . . . . . . . . . 25
3.4 Transducers in ASR Decoder . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1 N-best Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Lexicon Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Language Model Transducer . . . . . . . . . . . . . . . . . . . . 28
Chapter 4 Experiment 36
4.1 Evaluation Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Description of Speech Corpus . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Description of Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Evaluation of Syllable Accuracy (1-best) . . . . . . . . . . . . . . . . . . 38
4.5.1 Mixture Number Reduction and Feature Dimension Reduction 39
4.5.2 Comparisons of Tree Lexicon and Linear Lexicon . . . . . . . . 42
4.6 Evaluation of Syllable Accuracy (Oracle) . . . . . . . . . . . . . . . . . . 42
4.7 Evaluation of Character Accuracy . . . . . . . . . . . . . . . . . . . . . . 43
4.7.1 BTEC Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7.2 Application of XCIN . . . . . . . . . . . . . . . . . . . . . . . . . 44
Chapter 5 Conclusion and Future Works 50
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
[1] D. Huggins-Daines, M. Kumar, A. Chan, A. Black, M. Ravishankar, and A. Rudnicky,
“Pocketsphinx: A free, real-time continuous speech recognition system
for hand-held devices,” 2006.
[2] 陳鴻彬,陳柏琳,林順喜,語音辨識及資訊檢索技術於數位典藏多媒體文物之
應用,第三屆數位典藏技術研討會,頁239-246。
[3] X. L. Aubert, “An overview of decoding techniques for large vocabulary continuous
speech recognition,” Computer Speech and Language, vol. 16, 2002.
[4] M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted finite-state
transducers,” Springer Handbook of Speech Processing., vol. 3, 2007.
[5] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech
recognition,” in ASR2000-Automatic Speech Recognition: Challenges for the new
Millenium ISCA Tutorial and Research Workshop (ITRW), ISCA, 2000.
[6] C. Allauzen, M. Mohri, M. Riley, and B. Roark, “A generalized construction
of integrated speech recognition transducers,” in IEEE International Conference
on Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04), vol. 1,
2004.
[7] I. Hetherington, “PocketSUMMIT: small-footprint continuous speech recognition,”
in Proc. of INTERSPEECH, pp. 1465–1468, 2007.
[8] C. H. Yu, “Large Vocabulary Continuous Mandarin Speech Recognition Using
Finite-State Machine,” Master’s thesis, National Taiwan University, 2004.
[9] L. Rabiner, “A tutorial on hidden Markov models and selected applications in
speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[10] J. T. Huang, “Improved large vocabulary continuous mandarin speech recognition
by prosody modeling,” Master’s thesis, National Taiwan University, 2006.
[11] S. Young, N. Russell, and J. Thornton, Token passing: a simple conceptual model
for connected speech recognition systems. University of Cambridge, Department of
Engineering, 1989.
[12] D. Jurafsky, J. Martin, A. Kehler, K. Vander Linden, and N. Ward, Speech and
language processing: An introduction to natural language processing, computational
linguistics, and speech recognition. MIT Press, 2000.
[13] E. Matusov, S. Kanthak, and H. Ney, “On the integration of speech recognition
and statistical machine translation,” in Ninth European Conference on Speech
Communication and Technology, ISCA, 2005.
[14] M. Mohri and M. Riley, “A weight pushing algorithm for large vocabulary
speech recognition,” in Seventh European Conference on Speech Communication and
Technology, ISCA, 2001.
[15] M. Mohri, “Semiring Frameworks and Algorithms for Shortest-Distance Problem,”
Journal of Automata, Languages and Combinatorics, vol. 7.
[16] M. Mohri, “Generic Epsilon-Removal and Input Epsilon-Normalization Algorithms
forWeighted Transducers,” International Journal of Foundations of Computer
Science, vol. 13, no. 1, pp. 129–143, 2002.
[17] “HTK Toolkit, http://htk.eng.cam.ac.uk/.”
[18] C. Allauzen, M. Mohri, and B. Roark, “Generalized algorithms for constructing
statistical language models,” in Proceedings of the 41st Annual Meeting on Association
for Computational Linguistics-Volume 1, pp. 40–47, Association for Computational
Linguistics Morristown, NJ, USA, 2003.
[19] T. Takezawa, E. Sumita, F. Sugaya, H. Yamamoto, and S. Yamamoto, “Toward a
broad-coverage bilingual corpus for speech translation of travel conversations in
the real world,” in Proc. of the Third Int. Conf. on Language Resources and Evaluation
(LREC), pp. 147–152, 2002.
[20] A. Stolcke, “SRILM-an extensible language modeling toolkit,” in Seventh International
Conference on Spoken Language Processing, ISCA, 2002.
[21] E. Bocchieri and D. Blewett, “A decoder for LVCSR based on fixed-point arithmetic,”
in 2006 IEEE International Conference on Acoustics, Speech and Signal Processing,
2006. ICASSP 2006 Proceedings, vol. 1, 2006.
[22] T. Kohler, C. Fugen, S. St ‥ uker, and A. Waibel, “Rapid porting of ASR-systems
to mobile devices,” in Ninth European Conference on Speech Communication and
Technology, ISCA, 2005.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top