跳到主要內容

臺灣博碩士論文加值系統

(3.236.84.188) 您好!臺灣時間:2021/08/05 01:21
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:林哲光
研究生(外文):Che-Kuang Lin
論文名稱:中文自發性語音辨識中偵測修正性不流暢現象之新方法
論文名稱(外文):New Approaches for Detecting Edit Disfluencies in Transcribing Spontaneous Mandarin Speech
指導教授:李琳山李琳山引用關係
指導教授(外文):Lin-Shan Lee
學位類別:博士
校院名稱:國立臺灣大學
系所名稱:電信工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2009
畢業學年度:97
語文別:英文
論文頁數:74
中文關鍵詞:修正性不流暢不流暢的中斷點抑揚頓挫語音辨識自發性語音
外文關鍵詞:edit disfluencyinterruption point detectionprosodyspeech recognitionspontaneous speech
相關次數:
  • 被引用被引用:1
  • 點閱點閱:203
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
理想的語音辨識系統(speech recognition system)必須能處理人類自然發生的口語語音或自發性語音(spontaneous speech)。相對於清晰朗讀或是有事先準備而產生的語音,這種自發性語音具有一些特質,會增加系統在處理上的難度。其中的一項重要特質就是隨處可見常常發生的修正性不流暢(edit disfluency)現象。要能正確而不失真地解讀說話者要傳達的意思,系統必須要能偵測這樣的修正性不流暢,並且妥善處理。
在本論文中,我們提出一套處理自發性語音中修正性不流暢的架構,透過找出語音中不流暢的中斷點(interruption points, IPs),並且比對前後所講的字詞之間的關係,來找出語句的結構,並刪去語句中多餘或說話者講錯想更正的應修正字詞(edit words, including reparandum and optional editing terms),以利於語意的理解。在這個架構中,我們提出一套有效的特徵參數(features)和模型來偵測語音修正性不流暢的中斷點,並且根據偵測的結果改進辨識結果的正確性和可理解性。這套特徵參數經過仔細設計,考慮了中文語音所特有的各種語言特性。而用來偵測不流暢中斷點的模型,則是改進自機器學習(machine learning)研究中相當著名的兩個方法:決策樹(decision trees, DTs)以及最大熵值模型(maximum entropy models, MEs)。透過結合兩者的優點,我們得到一個更加適合偵測不流暢中斷點的模型:以決策樹為基礎的最大熵值模型(DT-ME)。此外,我們又進一步提出一套分析語音的韻律或抑揚頓挫(prosody)結構的方法:統計式潛藏韻律模型(latent prosodic modeling, LPM)。透過分析說話者正常流利說話時的抑揚頓挫,並比較其說話中斷語流不順時的情形,我們於是可以將前述的DT-ME模型進一步改進,得到更精確的偵測模型。另一方面,透過使用條件隨機域模型(conditional random field,CRF),我們得以分析不流暢的中斷點前後的詞語間的關係,找出並刪去應修正字詞,以分析語句的結構,正確掌握語意。
在中文口語對話語音上的實驗結果顯示,我們提出的這套架構能有效偵測處理中文口語中的修正性不流暢現象,並且顯著降低偵測的錯誤率。對於語句結構的較佳掌握也帶來了較佳的辨識結果(辨識正確率的提升)。此外,我們更進一步觀察我們提出的潛藏韻律模型所分析出來的抑揚頓挫。我們也透過分析對偵測不同種類修正性不流暢現象有效果的特徵參數如何不同,來進一步了解這些不流暢在特性上的差別。
Detection of edit disfluencies is one of the keys to transcribing spontaneous utterances. In this dissertation, we present improved features and models to detect edit disfluencies and enhance transcription of spontaneous Mandarin speech using hypothesized disfluency interruption points (IPs) and edit word detection. A comprehensive set of prosodic features that takes into account the special characteristics of edit disfluencies in Mandarin is developed, and an improved model combining decision trees and maximum entropy is proposed to detect IPs. This model is further adapted to desired prosodic conditions by latent prosodic modeling, a probabilistic framework for analyzing speech prosody in terms of a set of latent prosodic states. These techniques contribute to higher recognition accuracy (by rescoring with the hypothesized IPs) and better edit word detection (using conditional random fields defined on Chinese characters) in the final transcription, as verified by experiments on a spontaneous Mandarin speech corpus. Detailed analysis on the output latent states of the proposed latent prosodic modeling is conducted. Further analysis on the relevance of the proposed prosodic features to each type of edit disfluency is also conducted for further insight into the characteristics of various disfluency categories.
ABSTRACT I
中文摘要 II
TABLE OF CONTENTS V
LIST OF FIGURES VII
LIST OF TABLES VIII
1 INTRODUCTION 1
1.1 BACKGROUND 1
1.2 PRIMARY ACHIEVEMENTS OF THIS DISSERTATION 4
1.3 CHAPTER OUTLINE 7
2 BACKGROUND REVIEW AND EXPERIMENTAL ENVIRONMENTS 9
2.1 INTRODUCTION 9
2.2 REVIEW OF EXISTING APPROACHES FOR HANDLING DISFLUENCY IN SPEECH 9
2.2.1 Spontaneous speech processing 9
2.2.2 Decision Trees (DTs) for Classification 11
2.2.3 Maximum Entropy Models (MEs) for Classification 12
2.3 EXPERIMENTAL ENVIRONMENTS 14
2.3.1 Speech Corpora 14
2.3.2 Baseline System 15
2.4 SUMMARY 16
3 OVERVIEW OF THE PROPOSED FRAMEWORK FOR EDIT DISFLUENCY DETECTION 17
4 INTERRUPTION POINT DETECTION 21
4.1 INTRODUCTION 21
4.2 PROSODIC FEATURE EXTRACTION 21
4.2.1 Pitch-related feature 22
4.2.2 Duration-related features 24
4.2.3 Energy-related features 25
4.3 INITIAL INTERRUPTION POINT DETECTION MODELS 27
4.3.1 Integration of DT and ME (DT-ME) 28
4.4 LATENT PROSODIC MODELING (LPM) 29
4.5 USING AN LPM-ADAPTED MODEL FOR INTERRUPTION POINT DETECTION 34
4.6 EXPERIMENTAL RESULTS 37
4.6.1 Analysis of LPM Latent Prosodic States 37
4.6.2 Prosodic Feature Set Comparison in IP Detection 41
4.6.3 Initial IP Detection Model Comparison 43
4.6.4 Refined LPM-adapted DT-ME models for IP Detection 44
4.6.5 Comparison Between Lexical and Prosodic Information 45
4.6.6 Feature Analysis 46
4.7 SUMMARY 50
5 ENHANCED TRANSCRIPTOIN WITH EDIT WORD DETECTION 51
5.1 INTRODUCTION 51
5.2 SECOND-PASS RECOGNITION USING HYPOTHESIZED IPS 51
5.3 EDIT WORD DETECTION 53
5.4 EXPERIMENTAL RESULTS 56
5.4.1 Speech Recognition with IP Detection 56
5.4.2 Edit Word Detection 57
5.5 SUMMARY 59
6 CONCLUSIONS AND FUTURE WORKS 61
6.1 CONCLUSIONS 61
6.2 FUTURE WORKS 61
APPENDIX A. LIST OF THE ACOUSTIC MODELS FOR INITIALS/FINALS 65
BIBLIOGRAPHY 67
Bibliography
[1]J. G. Kahn, M. Ostendorf, and C. Chelba, “Parsing conversational speech using enhanced segmentation,” in Proc. of HLT/NAACL, 2004.
[2]S. Strassel, Simple Metadata Annotation Specification V6.2, Linguistic Data Consortium, 2004. [Online]. Available: http://www.ldc.upenn.edu/Projects/MDE/Guidelines/SimpleMDE V6.2.pdf
[3]S.-C. Tseng and Y.-F. Liu, “Annotation of Mandarin Conversational Dialogue Corpus,” Academia Sinica, CKIP Tech. Rep.-01, 2002.
[4]C.-K. Lin and L.-S. Lee, “Improved features and models for detecting edit disfluencies in transcribing spontaneous Mandarin speech”, to appear in IEEE Transactions on Audio, Speech, and Language Processing in 2009.
[5]C.-K. Lin and L.-S. Lee, “Improved spontaneous Mandarin speech recognition by disfluency interruption point (IP) detection using prosodic features”, in Proc. Interspeech, 2005.
[6]C.-K. Lin, S.-C. Tseng, and L.-S. Lee, “Important and New Features with Analysis for Disfluency Interruption Point (IP) Detection in Spontaneous Mandarin Speech”, in Proc. Disfluency in Spontaneous Speech, 2005.
[7]C.-K. Lin and L.-S. Lee, “Latent Prosodic Modeling (LPM) for Speech with Applications in Recognizing Spontaneous Mandarin Speech with Disfluencies”, in Proc. ICSLP 2006.
[8]S. Furui, M. Nakamura, T. Ichiba, and K. Iwano, “Analysis and recognition of spontaneous speech using corpus of spontaneous japanese,” Speech Communication, vol. 47, pp. 208–219, 2005.
[9]H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and G. Zweig, “The IBM 2004 conversational telephony system for rich transcription,” in Proc. IEEE ICASSP, 2005, pp. 205–208.
[10]T. Hain, P.C. Woodland, G. Evermann, M.J.F.Gales, X. Liu, G. L. Moore, D. Povey, and L. Wang, “Automatic transcription of conversational telephone speech,” IEEE Trans. Speech Audio Process., vol. 13, no. 6, pp. 1173–1185, Nov. 2005.
[11] C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Multistage speaker diarization of broadcast news,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1505–1512, Sep. 2006.
[12]S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker diarization systems,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1557–1565, Sep. 2006.
[13] Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper, “Enriching speech recognition with automatic detection of sentence boundaries and disfluencies,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1526–1540, Sep. 2006.
[14]M. Lease, M. Johnson, and E. Charniak, “Recognizing disfluencies in conversational speech,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1566–1573, Sep. 2006.
[15] J.-F. Yeh and C.-H. Wu, “Edit disfluency detection and correction using a cleanup language model and an alignment model,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1574–1583, Sep. 2006.
[16]L. Deng, D. Yu, and A. Acero, “Structured speech modeling,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1492–1504, Sep. 2006.
[17]M. J. F. Gales, D. Y. Kim, P. C. Woodland, H. Y. Chan, D. Mrva, R. Sinha, and S. E. Tranter, “Progress in the CU-HTK broadcast news transcription system,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1513–1525, Sep. 2006.
[18]S. Matsoukas, J.-L. Gauvain, G. Adda, T. Colthurst, C.-L. Kao, O. Kimball, L. Lamel, F. Lefevre, J. Z. Ma, J. Makhoul, L. Nguyen, R. Prasad, R. Schwartz, H. Schwenk, and B. Xiang, “Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1541–1556, Sep. 2006.
[19]S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker diarization systems,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1557–1565, Sep. 2006.
[20]H. Jiang, X. Li, and C. Liu, “Large margin Hidden Markov Models for speech recognition,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1584–1595, Sep. 2006.
[21]S. F. Chen, B. Kingsbury, L. Mangu, D. Povey, G. Saon, H. Soltau, and G. Zweig, “Advances in speech transcription at IBM under the DARPA EARS Program,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1596–1608, Sep. 2006.
[22]P. Heeman and J. Allen, “Speech repairs, intonational phrases and discourse markers: Modeling speakers’ utterances in spoken dialogue,” Computational Linguistics, vol. 25, pp. 527–571, 1999.
[23]E. Charniak and M. Johnson, “Edit detection and parsing for transcribed speech,” in Proc. of NAACL, 2001, pp. 118–126.
[24]M. Johnson and E. Charniak, “A TAG-based noisy channel model of speech repairs,” in Proc. of ACL, 2004.
[25]M. Honal and T. Schultz, “Automatic disfluency removal on recognized spontaneous speech - rapid adaptation to speaker dependent disfluencies,” in Proc. of ICASSP, 2005.
[26]M. Honal and T. Schultz, “Corrections of disfluencies in spontaneous speech using a noisy channel approach,” in Proc. of Eurospeech, 2003.
[27]C. Nakatani and J. Hirschberg, “A corpus-based study of repair cues in spontaneous speech,” Journal of the Acoustical Society of America, pp.1603–1616, 1994.
[28]E. Shriberg, "Phonetic consequences of speech disfluency," in Proc. of the International Conference of Phonetics Sciences, 1999, pp. 619–622.
[29]R. Lickley, “Juncture cues to disfluency,” in Proc. of ICSLP, 1996.
[30]G. Savova and J. Bachenko, “Prosodic features of four types of disfluencies,” in Proc. of DiSS, 2003, pp. 91–94.
[31]E. Shriberg and A. Stolcke, “A prosody-only decision-tree model for disfluency detection,” in Proc. of Eurospeech, 1997, pp. 2383–2386.
[32]E. Shriberg, A. Stolcke, D. Hakkani-Tur, and G. Tur, “Prosody-based automatic segmentation of speech into sentences and topics,” Speech Communication, pp. 127–154, 2000.
[33]Y. Liu, A. Stolcke, E. Shriberg, and M. Harper, “Comparing HMM, maximum entropy, and conditional random fields for disfluency detection,” in Proc. of Eurospeech, 2005, pp. 3313–3316.
[34]Y. Liu, E. Shriberg, A. Stolcke, and M. Harper, “Structural metadata research in the ears program,” presented at the ICASSP, invited paper, 2005, pp. 957–960.
[35]Y. Liu, E. Shriberg, and A. Stolcke, “Automatic disfluency identification in conversational speech using multiple knowledge sources,” in Proc. of Eurospeech, 2003, pp. 957–960.
[36]Y. Liu, E. Shriberg, A. Stolcke, M. Harper, "Using machine learning to cope with imbalanced classes in natural speech: Evidence from sentence boundary and disfluency detection", in Proc. of ICSLP, 2004.
[37]M. Snover, B. Dorr, and R. Schwartz, “A lexically-driven algorithm for disfluency detection,” in Proc. of HLT/NAACL, 2004.
[38]J. Kim, S. Schwarm, and M. Ostendorf, “Detecting structural metadata with decision trees and transformation-based learning,” in Proc. of HLT/NAACL, 2004.
[39] A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra, "A maximum entropy approach to natural language processing", Computational Linguistics, 22:39–72, 1996.
[40]S.-C. Tseng, “Processing Spoken Mandarin Corpora,” in Traitement Automatique des Langues. Special Issue: Spoken Corpus Processing. 45(2): 89–108.
[41]R. H. Ryrd, P. Lu, and J. Nocedal, “A limited memory algorithm for bound constrained optimization,” SIAM J. Sci. Statist. Comput., vol. 16, no. 5, pp. 1190–1208, 1995.
[42]S. Chen and R. Rosenfeld, “A Gaussian prior for smoothing maximum entropy models,” Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep., 1999.
[43]H. Chipman, E. I. George, and R. E. McCulloch, “Bayesian CART model search,” Journal of the American Statistical Association, vol. 93, no.443, pp. 935-947, 1998.
[44]C.-Y. Tseng, S.-H. Pin, Y.-L. Lee, H.-M. Wang and Y.-C. Chen, "Fluent speech prosody: framework and modeling", Speech Communication, Vol.46, Issues 3–4 (July 2005), Special Issue on Quantitative Prosody Modeling for Natural Speech Description and Generation, pp. 284–309.
[45] T. Hofmann, "Probabilistic latent semantic analysis," in Uncertainty in Artificial Intelligence, 1999.
[46]K. Daniels and C. Giraud-Carrier, “Learning the Threshold in Hierarchical Agglomerative Clustering,” in Proc. ICMLA, pp. 270-278, 2006.
[47]C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, 2001. www.csie.ntu.edu.tw/~cjlin/libsvm .
[48]Y.-C. Hsieh, Y.-T. Huang, C.-C. Wang, L.-S. Lee, “Improved spoken document retrieval with dynamic key term lexicon and Probabilistic Latent Semantic Analysis (PLSA),” in Proc. of ICASSP, 2006.
[49]Rich Transcription (RT-04F) Evaluation Plan (2004). [Online]. Available: http://www.nist.gov/speech/tests/rt/rt2004/fall/docs/rt04f-eval-plan-v14.doc
[50]F. Wilcoxon, “ Individual Comparisons by Ranking Methods,” Biometrics, 1945, vol. 1.
[51]J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: probabilistic models for segmenting and labeling sequence data,” in Proc. ICML, 2001, pp. 282–289.
[52]Y.-J. Cheng, “Evaluation and analysis of Minimum Phone Error training and its modified versions for large vocabulary Mandarin speech recognition”, Master Thesis, National Taiwan University, June, 2008.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top