跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.86) 您好!臺灣時間:2025/02/15 08:22
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:黃朝欽
研究生(外文):Chaug-Ching Huang
論文名稱:在低噪訊比環境下之音訊切割與拜氏類神經網路為依據之中文語音辨認
論文名稱(外文):Audio Segmentation under Low SNR Noisy Environment and Bayesian Neural Network Based Mandarin Speech Recognition
指導教授:王駿發
指導教授(外文):Jhing-Fa Wang
學位類別:博士
校院名稱:國立成功大學
系所名稱:電機工程學系碩博士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2007
畢業學年度:95
語文別:英文
論文頁數:100
中文關鍵詞:概似比交越率音訊端點偵測演算法語者改變偵測演算法語音音樂切割拜氏類神經網路
外文關鍵詞:Speech/Music SegmentationAudio Activity Detection AlgorithmLikelihood Ratio Crossing RateSpeaker Change DetectionBayesian Neural Network
相關次數:
  • 被引用被引用:0
  • 點閱點閱:266
  • 評分評分:
  • 下載下載:56
  • 收藏至我的研究室書目清單書目收藏:0
隨著數位技術變的廉價與開放,多媒體資料也被更加廣泛的使用。為了有效處理這些大量的資訊,找出一種有效率的方法讓這些視聽訊息更容易被人們互相傳遞與使用將是非常迫切需要的工作。本論文的主要目的就是發展一個可在現實環境下工作的中文語音辨認系統,因此系統的建立必須同時考慮環境噪音與其它的聲音的影響,所以我們將論文分為音頻的信號處理和中文語音辨認兩個部分來討論。
在音頻的信號處理中,ㄧ個新的語音音樂切割演算法首先被用來將輸入的音訊切割成為語音段與音樂段,在此同時,演算法也將考慮環境雜訊的影響,因此一個以統計模型為依據的音訊端點偵測(audio activity detection, AAD)演算法首先被用來將輸入的音訊切割為純雜訊段與含雜訊的音訊段,針對含雜訊的音訊段,一種新的具有強健性的特徵參數概似比交越率likelihood ratio crossing rate (LRCR)被使用分辨語音與音樂。而後對每一語音音段,ㄧ個利用改進的拜氏資訊準則(Bayesian Information Criterion, BIC)為依據的語者改變偵測演算法被用來找出每一音段中不同語者的邊界位置,依據這些邊界,每一語者的聲音將可在語音辨認部份中被辨認出來,從實驗中可證明本論文所提出的方法非常有效。
在中文語音辨認部分,本論文提出ㄧ系列利用拜氏類神經網路(Bayesian Neural Network)與使用不同訓練方法在中文語音辨認的聲音匹配的研究,所提出的網路不僅加快了辨認速度同時也提升了辨認率。為了有效驗證所提出的類神經網路,本論文分別建立單音節中文語音辨認系統、多音節中文詞彙辨認系統與口述中文語音聽寫系統來進行測試,從實驗結果顯示,所提出的類神經網路確實有效。
With the digital technology getting inexpensive and popular, there has been a tremendous increase in the availability of multimedia data. For feasible access to this huge amount of data, there is a pressing need for efficient methods that enable easier dissemination of audiovisual information to humans. The goal of this dissertation is to develop a Mandarin speech recognition system in real world application. Therefore, environment noise and other audio signal should be considered simultaneously. We divide the discussions of the dissertation into two parts, the audio signal processing part and Mandarin speech recognition part.
In audio signal processing part, a novel speech/music segmentation algorithm is proposed to segment the audio signal into speech segments and music segments. Furthermore, the noisy environment affections are also considered in this algorithm. A statistical model based audio activity detection (AAD) algorithm is performed to segment the noisy audio signal into noisy segments and noisy audio segments. A new feature likelihood ratio crossing rate (LRCR) is proposed and extracted from noisy audio segments to make the speech/music classifier more robust. For each speech segment, a modified Bayesian Information Criterion (BIC) based speaker change detection algorithm is then used to find the speaker segment boundaries within the speech segment. According to the boundaries, each speaker segment can be recognized in speech recognition part. The experimental results demonstrate the effectiveness of the proposed algorithms.
In Mandarin speech recognition part, a series of studies on Bayesian neural networks with different training algorithm for acoustic matching of Mandarin speech recognition are proposed. These networks not only improve the recognition speed but also further promote the performance. Besides, three topics that contain syllable recognition, word recognition and dictation system are constructed to test and verify the performance of proposed neural network respectively. Experimental results show all of them are effectively in Mandarin speech recognition.
ABSTARACT (Chinese)  I
ABSTRACT (English)   III
ACKNOWLEDGEMENT    V
CONTENTS       VI
FIGURE CAPTIONS    IX
TABLE CAPTIONS    XI

CHAPTER 1   INTRODUCTION 1
 1.1 Motivations 1
 1.2 Objective of Dissertation 2
 1.3 Outline of the Dissertation 3
Chapter 2   SPEECH/MUSIC SEGMENTATION UNDER LOW SNR NOISY ENVIRONMENT 4
 2.1 Introduction 4
 2.2 Statistical Model-based AAD algorithm 6
  2.2.1 Likelihood Ratio Calculation and Threshold Estimation Procedure 6
  2.2.2 Merging Process 9
 2.3 LRCR/KNN based Speech/Music Classification 11
  2.3.1 Extraction of the LRCR Feature 11
  2.3.2 LRCR/KNN based Speech/Music Classifier 15
 2.4 Experimental Results 16
 2.5 Conclusions 23
CHAPTER 3   SPEAKER CHANGE DETECTION UNDER LOW SNR NOISY ENVIRONMENT 24
 3.1 Introduction  24
 3.2 The Bayesian Information Criterion Algorithm 27
 3.3 Modified BIC based Speaker Change Detection 29
 3.4 Experimental Results 31
 3.5 Conclusions 32
CHAPTER 4   MANDARIN SYLLABLE RECOGNIZER BASED ON BAYESIAN NEURAL NETWORK 33
 4.1 Introduction 33
 4.2 Bayesian Neural Network 35
 4.2.1 Bayesian Neural Network Architecture and Operation 35
 4.2.2 Training the Bayesian Neural Network 36
 4.2.3 Probability Estimation of the Bayesian Neural Network 41
 4.3 Incremental Leaning LVQ (ILVQ) Algorithm 43
 4.4 The Experimental Results of Syllable Recognition 46
 4.5 Conclusions 51
CHAPTER 5   MANDARIN WORD RECOGNITION SYSTEM BASED ON TWO-PASS BAYESIAN NEURAL NETWORK 52
 5.1 Introduction 52
 5.2 Two-pass Bayesian Neural Network 55
 5.3 Implementing and Searching of Lexicon 59
 5.4 Experimental Results 64
 5.5 Conclusions 69
CHAPTER 6   MANDARIN SPEECH DICTATION SYSTEM BASED ON DYNAMIC PROGRAMMING BAYESIAN NEURAL NETWORK 71
 6.1 Introduction  71
 6.2 Syllable Recognizer Based on DPBNN 74
  6.2.1 DPBNN Architecture 75
  6.2.2 The Training Algorithm of DPBNN 76
 6.3 Language Processing Model 77
  6.3.1 The Statistical Language Model 77
  6.3.2 The Hierarchical Syntactical Analysis 78
 6.4 Experimental Results 79
 6.5 Conclusions 83
CHAPTER 7   CONCLUSIONS AND FUTURE WORK 85
 7.1 Summary 85
 7.2 Summary of Contributions 86
 7.3 Future Work 89
REFERENCES 91
PUBLICATIONS 98
BIOGRAPHY 100
[1] A. Albiol, L. Torres and E.J. Delp, ” Combining Audio and Video for Video Sequence Indexing Applications,” in Proc ICME '02, Vol. 2, pp. 353-356, Aug. 2002.
[2] Shu-Ching Chen, Mei-Ling Shyu, Wenhui Liao, Chengcui Zhang,” Scene Change Detection by Audio and Video Clues,” in Proc ICME '02, Vol. 2, pp. 26-29, Aug. 2002.
[3] T. Zhang, Jay Kuo, C.C., ”Audio Content Analysis for Online Audiovisual Data Segmentation and Classification,” in IEEE Transactions on Speech and Audio Processing, Vol. 9 Issue: 4 , pp.441-457, May 2001
[4] Uri Iurgel, Ralf Meermeier, Stefan Eickeler, and Gerhard Rigoll, “New Approaches to Audio-Visual Segmentation of TV News for Automatic Topic Retrieval,” in IEEE Proc. ICASSP’01, Vol. 3 , pp. 1397-1400, May 2001
[5] K. El-Maleh, M. Klein, G. Petrucci, and P. Kabal, “Speech/Music Discrimination for Multimedia Application,” in IEEE Proc. ICASSP’00, Vol. 4, pp. 2445-2448, June 2000.
[6] S. Srinivasan, D. Petkovic, and D. Ponceleon, “Toward Robust Features for Classifying Audio in the CueVideo system,” in Proc. 7th ACM Int. Conf. Multimedia, pp. 393–400, 1999.
[7] C. Saraceno and R. Leonardi, ”Audio as a Support to Scene Change Detection and Characterization of Video Sequences,” in IEEE Proc. ICASSP’97, Vol. 4, pp. 2597-2600, April 1997.
[8] S. Chen and P. Gopalakrishnan, “Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion,” in DARPA Proc. Broadcast News Transcription and Understanding Workshop, 1998.
[9] Johnson, S. and Woodland, P., ”Speaker Clustering Using Direct Maximization of the MLLR Adapted Likelihood,” in Int. Conf. on Spoken Language Processing (ICSLP), Vol. 5, pp. 1775- 1778, Sydney, Australia. 1998.
[10] Ramabhadran B., Huang J., Chaudhari U., Iyengar G., and Nock, H. J., ”Impact of Audio Segmentation and Segment Clustering on Automated Transcription Accuracy of Large Spoken Archives,” in Proc. EUROSPEECH’03, pp. 2589-2592, 2003.
[11] Lie. Lu, Hong-Jiang Zhang, and Hao Jiang, “Content Analysis for Audio Classification and Segmentation,” in IEEE Transaction on Speech and Audio Processing, Vol. 10, No.7, pp. 504-516, Oct. 2002.
[12] H. Sundaram and S.-F. Chang,” Audio Scene Segmentation Using Multiple Features, Models and Time Scales,” in IEEE Proc. ICASSP’00, Vol. 6, pp. 2441-2444, June 2000.
[13] T. Kemp, M. Schmidt, M. Westphal and A.Waibel, ”Strategies for Automatic Segmentation of Audio Data,” in IEEE Proc. ICASSP’00, Vol. 3, pp. 1423-1426, June 2000.
[14] Z. Liu, Y. Wang, and T. Chen, “Audio Feature Extraction and Analysis for Scene Segmentation and Classification,” in Journal Signal Processing System, Special Issue on Multimedia Signal Processing, pp. 61-79, Oct. 1998.
[15] M. Siegler, U. Jain, B. Ray and R. Stern, “Automatic Segmentation, Classification and Clustering of Broadcast News Audio,” in DARPA Proc. Speech Recognition Workshop, pp. 97-99, 1997.
[16] J. Saunders, “Real-time Discrimination of Broadcast Speech/Music,” in IEEE Proc. ICASSP’96, Vol. 2, pp. 993-996, Atlanta, May 1996.
[17] P. Delacourt, D. Kryze, C. Wellekens, "Detection of Speaker Changes in an Audio Document," in Proc. Eurospeech'99, pp. 1195-1198, 1999.
[18] Tzanetakis, G., Chen, M-Y., ”Building Audio Classification for Broadcast News Retrieval,” in Proc. WIAMIS'04, April 2004.
[19] Jonathan Foote, "Automatic Audio Segmentation Using a Measure of Audio Novelty." in Proc. ICME’00, Vol. 1, pp. 452-455, 2000.
[20] E. Scheirer and M. Slaney, “Construction and Evaluation of a Robust Multifeature Music/Speech Discriminator,” in IEEE Proc. ICASSP’97, Vol. 2, pp. 1331-1334, April 1997.
[21] G. Williams and D. Ellis, “Speech/Music Discrimination Based on Posterior Probability Features,” in Proc. EUROSPEECH'99, pp. 687-690, Sep. 1999.
[22] J. Ajmera, Iain A. McCowan and H. Bourlard, “Robust HMM-Based Speech/Music Segmentation,” in IEEE Proc. ICASSP’02, Vol. 1, pp. 297-300, April 2002.
[23] J. Shon, N. Kim, and W. Sung, ”A Statistical Model-Based Voice Activity Detection,” in IEEE Signal Processing Letter, Vol. 6, pp. 1-3, Jan. 1999.
[24] Kubala, F., “The 1996 BBN Byblos Hub-4 Transcription System,” in Proceedings of the speech recognition workshop, pp. 90-93, 1997.
[25] Woodland, P., Gales, M., Pye, D., and Young, S., “The Development of the 1996 HTK Broadcast News Transcription System,” in Proceedings of the speech recognition workshop, pp. 73-78, 1997.
[26] Bakis R., ”Transcription of Broadcast News Shows with the IBM Large Vocabulary Speech Recognition System,” in Proceedings of the speech recognition workshop, pp. 67-72, 1997.
[27] Bonastre, J. F., Delacourt, P., Fredouille, C., Merlin, T., and Wellekens, C., ”A Speaker Tracking System Based on Speaker Turn Detection for NIST Evaluations,” In IEEE Proc. ICASSP’00, pp. 1177-1180, 2000.
[28] Delacourt, P., Kryze, D., and Wellekens, C. J., “Detection of Speaker Changes in an Audio Document,” In Proc. EUROSPEECH’91, pp. 1195-1198, 1991.
[29] Mori, K. and Nakagawa, S., “Speaker Change Detection and Speaker Clustering using VQ Distortion for Broadcast News Speech Recognition,” In Int. Conf. on Pattern Recognition (ICPR), 2002.
[30] Vandecatseye, A. and Martens, J. P., “A Fast, Accurate and Stream-Based Speaker Segmentation and Clustering Algorithm,” in Proc. EUROSPEECH’03.
[31] Tritschler, A and Gopinath, R., “Improved Speaker Segmentation and Segments Clustering Using the Bayesian Information Criterion,” in Proc. EUROSPEECH’99, pp. 679-682, 1999.
[32] Delacourt, P. and Wellekens, C. J., “DISTBIC: A Speaker Based Segmentation for Audio Data Indexing,” in Speech Communication, Vol. 32, pp. 111-126, 2000.
[33] K. Mori and S. Nakagawa, “Speaker Change Detection and Speaker Clustering Using VQ Distortion for Broadcast News Speech Recognition,” in IEEE Proc. ICASSP’01, Vol. 1, pp. 413–416, May 2001.
[34] F. Bimbot and al., “Second Order Statistical Measures for Text-Independent Speaker Identification,” in Speech communication, Vol. 17, pp. 177-192. Aug. 1995.
[35] H. Gish, M.-H. Siu, and R. Rohlicek, “Segregation of Speakers for Speech Recognition and Speaker Identification,” in IEEE Proc. ICASSP’91, pp. 873–876, 1991.
[36] L. S. Lee, C. Y. Tseng and M. Ouh-Young, "The Synthesis Rules in a Chinese Text-to -Speech System," in IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-37, pp. 1309-1320, 1989.
[37] J. F. Wang, C. H. Wu, S. H. Chang and J. Y. Lee, “A hierarchical neural network model based on a C/V segmentation algorithm for isolated Mandarin speech recognition,” IEEE Trans. on Signal Processing, Vol. 39, pp.2141-2145, Sep. 1991.
[38] A. WAibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang, "Phoneme Recognition Using Time-Delay Neural Networks," in IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-37, pp. 328-339, 1989.
[39] Edmondo Trentin and Marco Gori, “Robust Combination of Neural Networks and Hidden Markov Models for Speech Recognition,” in IEEE Transactions on Neural Network, Vol. 14, Issue 6, pp. 1519 – 1531, Nov. 2003.
[40] El-Ramly, S.H.; Abdel-Kader, N.S.; El-Adawi, R.,” Neural Networks Used for Speech Recognition,” in Proceedings of the Nineteenth National Radio Science Conference (NRSC 2002), pp. 200-207, March 19-21, 2002.
[41] Min-Lun Lan; Shing-Tai Pan; Chih-Chin Lai, ”Using Genetic Algorithm to Improve the Performance of Speech Recognition Based on Artificial Neural Network,” in Innovative Computing, Information and Control (ICICIC '06), Vol. 2, pp. 527-530, Aug. 30-01, 2006.
[42] W. Y. Huang and R. P. Lippmann, "Comparison Between Neural Network and Conventional Classifier," in Proc. Int. Conf. on Neural Networks, pp.485-493, June 1987.
[43] R. P. Lippmann, “An Introduction to Computing with Neural Nets,” in IEEE ASSP Mag., pp. 4-22, Apr. 1987.
[44] T. Kohonen, G. Barna and R. Chrisley, "Statistically Pattern Recognition with Neural Networks: Benchmarking Studies," in IEEE, Proc. ICNN’88, Vol. 1, pp. 61-68, July 1988.
[45] M. L. Brady, R. Raghavan and J. Slawny, “Back Propagation Failed to Separate Where Perceptrons succeed,” in IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 36, pp. 665-674, 1989.
[46] D. DeSieno, “Adding a conscience to competitive learning,” in IEEE Int. Conf. On Neural Networks Processing, pp. 117-124, 1988.
[47] L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," in IEEE Proc. ICASSP’89, pp. 257-286, 1989.
[48] N. Sugamura, K. Shikano, and S. Furui, "Isolated Word Recognition Using Phoneme-Like Templates," in IEEE Proc. ICASSP’83, Boston, U. S. A., pp. 723-726, 1983.
[49] J. Bridle, M. Brown and R. Chamberlain, "An Algorithm for Connected Word Recognition," in IEEE Proc. ICASSP’82, pp. 899-902, 1982.
[50] D. Rumelhart and J. McClelland, in Parallel Distributed Processing, Vol. 1, M. I. T. Press, Cambridge, Ma, 1986.
[51] Jhing-Fa Wang, Chung-Hsien Wu, Chaug-Ching Huang, and Jau-Yien Lee, "Integrating Neural Nets and One-Stage Dynamic Programming for Speaker Independent Continuous Mandarin Digit Recognition," in IEEE Proc. ICASSP’91, pp. 69-72, May 14-17, 1991.
[52] Roland Kuhn and Renato De Mori, "A Cache-Based Natural Language Model for Speech Recognition," in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, No. 6, pp. 570-582, June 1991.
[53] Y. S. Lu, "A Connected Mandarin Speech Recognition System with Incremental Learning Ability," in Master Thesis, Electrical Engineering Department, Cheng Kung University, June 1991.
[54] S. N. Tsay, "A Parallel Processing Architecture for the Fast Continuous Mandarin Digit Speech Recognizer," in Master Thesis, Electrical Engineering Department, Cheng Kung University, June 1991.
[55] H. R. Huang, "A Neural Network Mandarin Speech Dictation System," in Master Thesis, Department of Electrical Engineering, National Cheng Kung University, June 1992.
[56] Chao, Y. R., “A Grammar of Spoken Chinese,” in UC Berkeley Press, 1968.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top