跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.172) 您好!臺灣時間:2025/09/12 05:17
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:林冠良
研究生(外文):Lin, Guan-Liang
論文名稱:一個基於MFCC的語者識別系統
論文名稱(外文):An MFCC-based Speaker Identification System
指導教授:呂芳懌呂芳懌引用關係
指導教授(外文):Leu, Fang-Yie
口試委員:陳金鈴楊伏夷余心淳劉榮春
口試日期:2017-01-05
學位類別:碩士
校院名稱:東海大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2017
畢業學年度:105
語文別:英文
論文頁數:53
中文關鍵詞:語者辨識傅立葉轉換梅爾頻率倒譜係數高斯混合模型聲學模型
外文關鍵詞:speaker identificationFourier transformationMel-frequency cepstral coefficientsGaussian mixture modelacoustic model
相關次數:
  • 被引用被引用:1
  • 點閱點閱:556
  • 評分評分:
  • 下載下載:32
  • 收藏至我的研究室書目清單書目收藏:1
現今的環境中,語音辨識已經有許多生活上的實際應用,諸如iphone的語音助理SIRI、Google的語音輸入辨識系統與語音的手機操作等,語者辨識則相對地還未成熟。因此,本研究著重在語者辨識的研究,方法是將某人,例如,小明,的原始語音訊號,經傅立葉轉換,而從時域轉換到頻域上。其次,經由人耳聽覺模型過濾及分解出各個頻率的能量大小,並轉換成該語音的特徵數據。而後再利用高斯混合模型的機率密度函數,來概括描述這一些特徵數據的分布,進而成為小明的聲學模型。當系統接收到某一未知人物之語音數據時,亦以相同方法處理,並與資料庫中所蒐集之人物(包括小明)的聲學模型進行語音相似度比對,以辨識該未知人物可能是誰。
Nowadays, speech recognition has many practical applications which are currently used by people in the world. Typical examples are the SIRI of iPhone, Google speech recognition system, and mobile phones operated by voice, etc. On the contrary, speaker identification in its current stage is relatively immature. Therefore, in this paper, we study a speaker identification technique which first takes the original voice signals of a person, e.g., Bob. After that, the voice signals is converted from time domain to frequency domain by employing the Fourier transformation approach. A MFCC-based human auditory filtering model is then utilized to adjust the energy levels of different frequencies as the quantified characteristics of Bob’s voice. Next, the energies are normalized to the scales of logarithm as the feature of the voice signals. Further, the probability density function of Gaussian mixture model is employed to represent the distribution of the logarithmic characteristics as Bob’s specific acoustic model. When receiving an unknown person, e.g., x’s voice, the system processes the voice with the same procedure, and compares the processing result, which is x’s acoustic model, with known-people’s acoustic models collected in an acoustic-model database beforehand to identify who the most possible speaker is.
1. Introduction…………………………………………………………………… 1
2. Related Work…………………………………………………………………… 3
2.1 Voice Recognition System…………………………………………… 3
2.2 The Environmental Noise…………………………………………… 4
2.3 Operational Efficiency……………………………………………… 5
3. Background of this study………………………………………………………… 7
3.1 Feature Extraction………………………………………………… 7
3.2 Building Speaker Model…………………………………………… 9
3.2.1 Gaussian Mixture Model………………………………… 9
3.2.2 Training Phase……………………………………………… 11
3.3 The Speaker Identification Method………………………………… 11
4. The System Architecture……………………………………………………… 13
4.1 MFCC Process……………………………………………………… 13
4.2 Establishment of Gaussian Mixture Model………………………… 18
4.2.1 K-means clustering………………………………………… 18
4.2.2 EM algorithm……………………………………………… 20
4.3 Bhattacharyya Distance……………………………………………… 21
5. System Implementation and Evaluation……………………………………… 23
5.1 Experiment 1………………………………………………………… 25
5.2 Experiment 2………………………………………………………… 28
5.3 Experiment 3………………………………………………………… 30
5.4 Experiment 4………………………………………………………… 33
5.5 Experiment 5………………………………………………………… 35
6. Conclusion and Future studies………………………………………………… 40
References………………………………………………………………………… 42
[1] C. Zhan, W. Li and P. Ogunbona, “Face Recognition from Single Sample based on Human Face Perception,” International Conference Image and Vision Computing New Zealand, pp. 56-61, 2009.
[2] http://www.apple.com/tw/ios/siri/
[3] https://cloud.google.com/speech/
[4] D. A. Reynolds, “An overview of Automatic Speaker Recognition Technology,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 4, pp. 4072-4075, 2002.
[5] http://htk.eng.cam.ac.uk/
[6] http://kaldi-asr.org/doc/about.html
[7] L. Rabiner and B. H. Juang, “Fundamentals of Speech Recognition,” in Prentice Hall, 1993.
[8] https://en.wikipedia.org/wiki/Speaker_recognition
[9] T. Stafylakis, M. J. Alam and P. Kenny, “Text-Dependent Speaker Recognition With Random Digit Strings,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 24, Issue. 7, pp. 1194-1203, July 2016.
[10] D. A. Reynolds and R. C. Rose, “Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models,” IEEE Transactions on Speech and Audio Processing, Vol. 3, Issue. 1, pp. 72-83, Jan 1995.
[11] http://www.d-ear.com/
[12] http://www.playrobot.com/speech-recognition/88-arduino-chinese-voice-recognition-module.html
[13] http://www.garmin.com.tw/m/buzz/tw/minisite/nuvi3790T/feature_02.htm
[14] J. S. Lim and A. V. Oppenheim, “Enhancement and Bandwidth Compression of Noisy Speech,” Proceedings of the IEEE, Vol. 67, Issue. 12, pp. 1586-1604, Dec. 1979.
[15] W. Zunjin and C. Zhigang, “Improved MFCC-based feature for robust speaker identification,” Tsinghua Science and Technology, Vol. 10, Issue. 2, pp. 158-161, April 2005.
[16] N. Cristianini and J. Shawe-Taylor, “Support Vector Machines,” in Cambridge University Press, 2000.
[17] B. H. Juang and T. Chen, “The Past, Present, and Future of Speech Processing,” IEEE Signal Processing Magazine, Vol. 15, Issue. 3, pp. 23-48, May 1998.
[18] N. E. Huang et al, “On Instantaneous Frequency,” in World Scientific Publishing Company, pp. 177-229, 2009.
[19] R. Vergin, D. O'Shaughnessy and A. Farhat, “Generalized Mel Frequency Coefficients for Large-Vocabulary Speaker-Independent Continuous-Speech Recognition,” IEEE Transactions on Speech and Audio Processing, Vol. 7, Issue. 5, pp. 525-532, Sep 1999.
[20] D. O'Shaughnessy, “Speech Communications: Human and Machine,” Wiley-IEEE Press, 1999.
[21] T. T. Soong, “Fundamentals of Probability and Statistics for Engineers,” Wiley, 2004.
[22] X. Peng, X. Wang and B. Wang, “Speaker Clustering via Novel seudo-Divergence of Gaussian Mixture Models,” International Conference on Natural Language Processing and Knowledge Engineering, pp. 111-114, 2005.
[23] http://www.datasciencelab.cn/clustering/gmm
[24] L. Rabiner and B. H. Juang, “Fundamentals of Speech Recognition,” in Prentice Hall, pp. 215-219, 1993.
[25] A. Bhattacharyya, “On a Measure of Divergence between Two Statistical Populations,” in Springer on behalf of the Indian Statistical Institute, pp. 99–109, 1943.
[26] K. Rao and P. Yip, “Discrete Cosine Transform: Algorithms, Advantages, Applications,” in Academic Press, 1990.
[27] A. Goel and A. Gupta, “Design of Satellite Payload Filter Emulator Using Hamming Window,” International Conference on Medical Imaging, m-Health and Emerging Communication Systems (MedCom), pp. 202-205, 2014.
[28] J. O. Smith III, “Spectral Audio Signal Processing,” W3K Publishing, 2011.
[29] H. C. Ravichandar and A. P. Dani, “Human Intention Inference Using Expectation-Maximization Algorithm With Online Model Learning,” IEEE Transactions on Automation Science and Engineering, Vol. PP, Issue. 99, December 2016.
[30] http://www.sympy.org/en/index.html
[31] https://www.python.org/
[32] https://www.scipy.org/
[33] https://www.hdfgroup.org/
[34] J. P. Openshaw, Z. P. Sun and J. S. Mason, “A Comparison of Composite Features under Degraded Speech in Speaker Recognition,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 371-374, 1993.
[35] A. Chaudhari, A. Rahulkar and S. B. Dhonde, “Combining dynamic features with MFCC for text-independent speaker identification,” International Conference on Information Processing (ICIP), pp. 160-164, 2015.
[36] http://www.oxfordlearnersdictionaries.com/us/about/pronunciation_english
[37] http://isrc.ccs.asia.edu.tw/www/essay/essay7/essay7-008.htm
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊