(3.235.191.87) 您好!臺灣時間:2021/05/13 14:01
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

: 
twitterline
研究生:邱聖權
研究生(外文):Sheng-chiuan Chiou
論文名稱:強健性自動語音辨識之基於聽覺模型的梅爾倒頻譜參數擷取調整
論文名稱(外文):Auditory Based Modification of MFCC Feature Extraction for Robust Automatic Speech Recognition
指導教授:陳嘉平陳嘉平引用關係
指導教授(外文):Chia-ping Chen
學位類別:碩士
校院名稱:國立中山大學
系所名稱:資訊工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2009
畢業學年度:97
語文別:英文
論文頁數:80
中文關鍵詞:噪音強健性聽覺模型後遮蔽自動語音辨識
外文關鍵詞:forward maskingauditory modelsyanptic adaptationtemporal integrationnoise robustASRautomatic speech recognition
相關次數:
  • 被引用被引用:2
  • 點閱點閱:286
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:64
  • 收藏至我的研究室書目清單書目收藏:0
人類的聽覺感知模型比起現存的自動語音辨識系統更加精確也更不受噪音的影
響, 若能在自動語音辨識系統中模擬人類聽覺感知模型,可以藉此提昇自動語音辨識
系統的噪音強健性。
後遮蔽(forward masking)是一種聽覺感知的現象,一個較強的聲音會遮蔽
其後的聲音。我們使用兩個聽覺機制來模擬後遮蔽,分別是突觸調適(synaptic
adaptation)與時序統合(temporal integration),並且以濾波器來實做這些功能。在
梅爾頻率倒頻譜之中加入後遮蔽的模型藉此提高語音特徵參數的噪音強健性。
我們使用Aurora 3語料庫來作實驗,並且依照Aurora 3提供的標準流程進行訓練及
測試。實驗結果顯示synaptic adaptation可以提昇語音辨識正確率16.6%,temporal integration
可以提昇21.6%,另一種temporal integration可以提昇22.5%。將synaptic adaptation
與兩種temporal integration方法結合之後可以進一步提昇達到26.3%及25.5%。接
下來再進行濾波器參數的最佳化,將synaptic adaptation濾波器的相對進步率提高
為18.4%, temporal integration濾波器的相對進步提高為25.2%,第二種temporal integration
提高為22.6%。對於兩種濾波器結合的方法,其相對進步率增為26.9%及26.3%。
The human auditory perception system is much more noise-robust than any state-of theart
automatic speech recognition (ASR) system. It is expected that the noise-robustness of
speech feature vectors may be improved by employing more human auditory functions in the
feature extraction procedure.
Forward masking is a phenomenon of human auditory perception, that a weaker sound
is masked by the preceding stronger masker. In this work, two human auditory mechanisms,
synaptic adaptation and temporal integration are implemented by filter functions and incorporated
to model forward masking into MFCC feature extraction. A filter optimization algorithm
is proposed to optimize the filter parameters.
The performance of the proposed method is evaluated on Aurora 3 corpus, and the procedure
of training/testing follows the standard setting provided by the Aurora 3 task. The
synaptic adaptation filter achieves relative improvements of 16.6% over the baseline. The
temporal integration and modified temporal integration filter achieve relative improvements
of 21.6% and 22.5% respectively. The combination of synaptic adaptation with each of temporal
integration filters results in further improvements of 26.3% and 25.5%. Applying the
filter optimization improves the synaptic adaptation filter and two temporal integration filters,
results in the 18.4%, 25.2%, 22.6% improvements respectively. The performance of the
combined-filters models are also improved, the relative improvement are 26.9% and 26.3%.
List of Tables iii
List of Figures iv
Acknowledgments vi
Chapter 1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2 Related Works 4
2.1 MFCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Common Methods of Noise Robustness . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Temporal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Cepstral Mean Subtraction . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 RelAtive SpecTrAl (RASTA) . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 TempoRAl Patterns (TRAPs) . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 MVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.5 Temporal Structure Normalization . . . . . . . . . . . . . . . . . . . 13
2.3.6 Data Driven Temporal Filter . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Human Auditory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
i
2.4.1 Two-Tone Suppression . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Zero-Crossings with Peak Amplitudes (ZCPA) . . . . . . . . . . . . 17
2.4.3 A Dynamic Forward Masking Model . . . . . . . . . . . . . . . . . 18
Chapter 3 The Proposed Algorithm 20
3.1 Forward Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Synaptic Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Temporal Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2.1 A Modification of Temporal Integration Filter . . . . . . . 30
3.1.3 Synaptic Adaptation with Temporal Integration . . . . . . . . . . . . 33
3.1.3.1 Synaptic Adaptation with Modified Temporal Integration . 34
3.2 Filter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 Synaptic Adaptation Filter . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3 Temporal Integration Filter . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.3.1 Modified Temporal Integration Filter . . . . . . . . . . . . 43
3.2.4 Synaptic Adaptation with Temporal Integration . . . . . . . . . . . . 45
3.2.4.1 Synaptic adaptation with Modified Temporal Integration . . 47
3.2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 4 Experimental Results 51
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.2 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Significance Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 5 Conclusion and Future Works 61
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Bibliography 63
Appendix A The Training Data for Filter Optimization 67
[1] L. Turicchia and R. Sarpeshkar, “A bio-inspired companding strategy for spectral enhancement,”
IEEE Transactions on Speech and Audio Processing, vol. 13, no. 2,
pp. 243–253, 2005.
[2] X. Huang, A. Acero, and H. Hon, Spoken language processing: A guide to theory,
algorithm, and system development. Prentice Hall PTR Upper Saddle River, NJ, USA,
2001.
[3] B. Gold and N. Morgan, Speech and audio signal processing: processing and perception
of speech and music. John Wiley & Sons, Inc. New York, NY, USA, 1999.
[4] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic
word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics,
Speech and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[5] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Journal of the
Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
[6] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE
Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113–120,
1979.
[7] B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for
automatic speaker identification and verification,” J. Acoust. Soc. Amer, pp. 1304–1312,
1974.
[8] C. Chen and J. Bilmes, “MVA Processing of Speech Features,” IEEE Transactions on
Audio, Speech and Language Processing, vol. 15, no. 1, pp. 257–270, 2007.
[9] X. Xiao, E. Chng, and H. Li, “Normalization of the Speech Modulation Spectra for
Robust Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing,
vol. 16, no. 8, pp. 1662–1674, 2008.
[10] J. Droppo, L. Deng, and A. Acero, “Evaluation of the SPLICE algorithm on the Aurora2
database,” in Seventh European Conference on Speech Communication and Technology,
ISCA, 2001.
[11] M. Gales and S. Young, “Robust speech recognition in additive and convolutional noise
using parallel model combination,” Computer speech & language(Print), vol. 9, no. 4,
pp. 289–307, 1995.
[12] P. Moreno, B. Raj, and R. Stern, “A vector Taylor series approach for environmentindependent
speech recognition,” in IEEE INTERNATIONAL CONFERENCE ON
ACOUSTICS SPEECH AND SIGNAL PROCESSING, vol. 2, 1996.
[13] C. Yang, F. Soong, and T. Lee, “Static and dynamic spectral features: Their noise robustness
and optimal weights for ASR,” in IEEE International Conference on Acoustics,
Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05), vol. 1, 2005.
[14] J. Hung and L. Lee, “Optimization of temporal filters for constructing robust features in
speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing,
vol. 14, pp. 808–832, 2006.
[15] H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE Transactions on
Speech Audio Processing, pp. 578–589, 1994.
[16] H. Hermansky and S. Sharma, “TRAPS-Classifiers Of Temporal Patterns,” in Fifth International
Conference on Spoken Language Processing, ISCA, 1998.
[17] C. Avendano, S. Vuuren, and H. Hermansky, “Data based filter design for RASTA-like
channel normalization in ASR,” in Fourth International Conference on Spoken Language
Processing, ISCA, 1996.
[18] C. Avendano and H. Hermansky, “On the properties of temporal processing for speech
in adverse environments,” Applications of Signal Processing to Audio and Acoustics,
1997. 1997 IEEE ASSP Workshop on, Oct 1997.
[19] D. Lee and R. Kil, “Auditory processing of speech signals for robust speech recognitionin
real-world noisy environments,” IEEE Transactions on Speech and Audio Processing,
vol. 7, no. 1, pp. 55–69, 1999.
[20] J. Kates, “A time-domain digital cochlear model,” IEEE Transactions on signal processing,
vol. 39, no. 12, pp. 2573–2592, 1991.
[21] S. Haque, R. Togneri, and A. Zaknich, “Perceptual features for automatic speech recognition
in noisy environments,” Speech Commun., vol. 51, no. 1, pp. 58–75, 2009.
[22] B. Strope and A. Alwan, “A model of dynamic auditory perception and its application
torobust word recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5,
no. 5, pp. 451–464, 1997.
[23] B. Moore, An introduction to the psychology of hearing. Academic press, 2003.
[24] A. Oxenham, “Forward masking: Adaptation or integration?,” The Journal of the Acoustical
Society of America, vol. 109, p. 732, 2001.
[25] M. Holmberg, D. Gelbart, and W. Hemmert, “Automatic speech recognition with an
adaptation model motivated by auditory processing,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 14, no. 1, pp. 43–49, 2006.
[26] K. Park and S. Lee, “An engineering model of the masking for the noise-robust speech
recognition,” Neurocomputing, vol. 52, no. 54, pp. 615–620, 2003.
[27] S. Vuuren and H. Hermansky, “Data-driven design of RASTA-like filters,” in Fifth European
Conference on Speech Communication and Technology, ISCA, 1997.
[28] S. Haykin et al., “Adaptive filtering theory,” Book, Prentice-Hall Information Systems
Science Series, 1986.
[29] S. Haykin, “Adaptive filters,” Signal Processing Magazine. IEEE Computer Society,
1999.
[30] J. Droppo, M. Mahajan, A. Gunawardana, and A. Acero, “How to train a discriminative
front end with stochastic gradient descent and maximum mutual information,” in
Proceedings IEEE Automatic Speech Recognition and Understanding Workshop, 2005.
[31] H. Hirsch and D. Pearce, “The AURORA Experimental Framework For The Performance
Evaluation of Speech Recognition Systems Under Noisy Conditions,” in
ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial
and Research Workshop (ITRW), ISCA, 2000.
[32] A. Moreno, B. Lindberg, C. Draxler, G. Richard, K. Choukry, S. Euler, and J. Allen,
“SPEECH DAT CAR. A Large Speech Database For Automotive Environments,”
[33] N. Parihar and J. Picone, “DSR front end LVCSR evaluation AU/384/02,” Aurora Working
Group, European Telecommunications Standards Institute, 2002.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔