跳到主要內容

臺灣博碩士論文加值系統

(44.192.95.161) 您好!臺灣時間:2024/10/04 12:40
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:羅玉雯
研究生(外文):Lo, Yu-Wen
論文名稱:基於兩階段之聽覺感知模型之類神經網路應用於語者識別
論文名稱(外文):Two-stage attentional auditory model inspired neural network and its application to speaker identification
指導教授:冀泰石
指導教授(外文):Chi, Tai-Shih
學位類別:碩士
校院名稱:國立交通大學
系所名稱:電信工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2017
畢業學年度:106
語文別:中文
論文頁數:50
中文關鍵詞:聽覺感知模型語者識別類神經網路
外文關鍵詞:Auditory modelspeaker identificationneural network
相關次數:
  • 被引用被引用:1
  • 點閱點閱:260
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
於本論文中,我們根據神經生物學研究,得知在聲音訊號進入耳朵後,即會針對聲音的各個頻率進行分頻的動作,並產生出聽覺頻譜圖,且根據專注聽覺現象和生物聽覺實驗,也發現的大腦聽覺皮質上神經作用的模式,因此結合當今正紅的類神經網路學習,發想出一種獨特的類神經網路模型,並針對語者辨識這個議題做討論,期望能藉由神經生理學的知識,有效的解決工程上的問題。而我們所設計的模型,是利用兩層不同維度的卷積神經網路(Convolutional Neural Network),分別模擬初期耳蝸階段及大腦皮質階段,透過設計卷積核初始值,即耳蝸階段多組一維分頻濾波器和同時解析時頻資訊的二維濾波器,以使模型能夠快速地達到收斂狀態。而透過模型訓練,根據目的與環境變因的不同,模型會自動調整其中參數,使輸入資料映射至目標的型態。同時我們也針對所提出的模型架構,進行了多種形態的比較,進而發現在給定初始值的狀況下,即使訓練不夠充分,也能收斂至較好的狀態。
Revealed by psychophysical and neuro-physiological studies, the cochlea analyzes the incoming sound in the time and logarithmic-frequency domains. Afterward, the neural activities pass through the auditory pathway to the primary auditory cortex (A1) for further analysis. From the functional point of view, the cochlea produces a 2-D auditory spectrogram and the A1 analyzes the 2-D spectrogram. In this thesis, we propose a neural network (NN) to simulate an attentional auditory model and apply it to speaker identification. The proposed NN consists of 1-D and 2-D convolutional neural networks which mimic the functions of the cochlea and the cortex respectively. By deriving initial kernels of the convolutional layers from the neuro-physiological auditory model, we demonstrated that the proposed NN can quickly reach the convergence state with high performance. In addition, even without training, the proposed system with auditory model based kernels outperforms the randomly initialized NN in speaker identification.
目錄
摘要 .......................................................................................................................................... iii
Abstract .................................................................................................................................... iv
目錄 ........................................................................................................................................... v
圖目錄 ..................................................................................................................................... vii
表目錄 ...................................................................................................................................... ix
第一章 緒論.......................................................................................................................... 1
1.1 研究背景 .................................................................................................................. 1
1.2 研究方向與目標 ...................................................................................................... 4
1.3 章節大綱 .................................................................................................................. 5
第二章 感知訊號處理 .......................................................................................................... 6
2.1 生理聽覺現象與特性 .............................................................................................. 6
2.2 聽覺感知模型 .......................................................................................................... 9
2.2.1 初期耳蝸階段 ................................................................................................... 10
2.2.2 大腦皮質階段 ................................................................................................... 12
第三章 類神經網路系統架構與參數設定 ........................................................................ 15
3.1 卷積神經網路簡介 ................................................................................................ 15
3.2 模型架構 ................................................................................................................ 17
3.3 實驗語料 ............................................................................................................... 18
3.4 語音處理背景知識與參數設定 ............................................................................ 19
3.5 初始化卷積核 ........................................................................................................ 21
3.5.1 一維卷積核 ....................................................................................................... 21
3.5.2 二維卷積核 ....................................................................................................... 22
第四章 實驗與討論 ............................................................................................................ 24
4.1 比較系統介紹 ........................................................................................................ 24
4.2 實驗結果 ............................................................................................................... 26
4.2.1 單一訊雜比與單一雜訊種類結果 ................................................................... 26

4.2.2 多訊雜比與多雜訊種類結果 ........................................................................... 36
第五章 結論與未來展望 .................................................................................................... 46
參考資料 ................................................................................................................................. 47
[1] Ahmad, Khan Suhail, et al. "A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network." Advances in Pattern Recognition (ICAPR), 2015 Eighth International Conference on. IEEE, 2015.
[2] Wang, Yi, and Bob Lawlor. "Speaker recognition based on MFCC and BP neural networks." Signals and Systems Conference (ISSC), 2017 28th Irish. IEEE, 2017.
[3] Zhao, Xiaojia, Yuxuan Wang, and DeLiang Wang. "Deep neural networks for cochannel speaker identification." Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
[4] Hoshen, Yedid, Ron J. Weiss, and Kevin W. Wilson. "Speech acoustic modeling from raw multichannel waveforms." Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
[5] Dai, Wei, et al. "Very deep convolutional neural networks for raw waveforms." Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017.
[6] Sainath, Tara N., et al. "Learning the speech front-end with raw waveform CLDNNs." Sixteenth Annual Conference of the International Speech Communication Association. 2015.
[7] T. S. Chi, class notes of Auditory and Acoustical Information Processing, Department of Communication Engineering, National Chiao-Tung University, Taiwan, 2013.
[8] Morris, Andrew, Jean-Luc Schwartz, and Pierre Escudier. "An information theoretical investigation into the distribution of phonetic information across the auditory spectrogram." Computer Speech & Language 7.2: 121-136, 1993
[9] Humes, Larry E., and Lisa Roberts. "Speech-recognition difficulties of the hearing-impaired elderly: The contributions of audibility." Journal of Speech, Language, and Hearing Research 33.4: 726-735, 1990.
[10] Moore, Brian CJ. "Perceptual consequences of cochlear hearing loss and their implications for the design of hearing aids." Ear and hearing 17.2: 133-161, 1996.
[11] T. Chi, P. Ru, and S. A. Shamma, “Multi-resolution spectro- temporal analysis of complex sounds,” J. Acoust. Soc. Am., vol. 118, no. 2, pp. 887–906, 2005.
[12] M. Elhilali, T. Chi, and S. Shamma, “A spectro-temporal modulation index (stmi) for assessment of speech intelligibility.” Speech Communication, pp. 331–348, 2003.
[13] T.-S. Chi, T.-H. Lin, and C.-C. Hsu., “Spectro-temporal modulation energy based mask for robust speaker identification,” J. Acoust. Soc. Am., vol. 131, no. 5, pp. EL368–EL374, 2012.
[14] T. E. Lin, C. C. Hsu, Y. C. Chen, J. H. Chen, and T. S. Chi, “Spectro-temporal modulation based singing detection combined with pitch-based grouping for singing voice separation,” in Proceedings of INTERSPEECH., pp. 2920–2923, 2013.
[15] F. Yen, Y.-J. Luo, and T.-S. Chi, “Singing voice separation using spectro-temporal modulation features.” Proc. of Annual Conference of International Society for Music Information Retrieval (ISMIR), pp. 617–622, 2014.
[16] J. B. Fritz, M. Elhilali, S. V. David, and S. A. Shamma, “Auditory attention-focusing the searchlight on sound,” Current opinion in neurobiology, vol. 17, no. 4, pp. 437– 455, 2007.
[17] E. R. Hafter, A. Sarampalis, and P. Loui, “Auditory attention and filters,” Auditory perception of sound sources. Springer US, pp. 115–142, 2008.
[18] M. Elhilali, J. Fritz, T. Chi, and S. Shamma, “Auditory cortical receptive fields: Stable entities with plastic abilities,” J. Neuroscience, vol. 27, no. 39, pp. 10 372– 10 382, 2007.

[19] Z. Q. Wang and D. Wang, “A joint training framework for robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 796–806, 2016.
[20] H. Sak, A. Senior, K. Rao, O. Irsoy, A. Graves, F. Beaufays, and J. Schalkwyk, “Learning acoustic frame labeling for speech recognition with recurrent neural networks,” in Proceedings of ICASSP, pp. 4280–4284, 2015.
[21] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 483–492, 2016.
[22] L.-Y. Yeh and T.-S. Chi, “Spectro-temporal modulations for robust speech emotion recognition,” in Proceedings of INTERSPEECH, pp. 789–792, 2010.
[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, pp. 1097–1105, 2012.
[24] J. Masci, U. Meier, D. Cirean, and J. Schmidhuber, “Stacked convolutional auto-encoders for hierarchical feature extraction,” In International Conference on Arti- ficial Neural Networks., pp. 52–59, 2011.
[25] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks.” in Proceedings of ICASSP., pp. 4580–4584, 2015.
[26] O. Abdel-Hamid, A. R. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition.” in Proceedings of ICASSP., pp. 4277–4280, 2012.
[27] Chen, Jing, Thomas Baer, and Brian CJ Moore. "Effect of enhancement of spectral changes on speech intelligibility and clarity preferences for the hearing impaired." The Journal of the Acoustical Society of America 131.4: 2987-2998, 2012
[28] Chi, Tai-Shih, and Chung-Chien Hsu. "Multiband analysis and synthesis of spectro-temporal modulations of Fourier spectrogram." The Journal of the Acoustical Society of America 129.5: EL190-EL196, 2011.
[29] Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. Laurent, Y. Bengio, and A. Courville, “Towards end-to-end speech recognition with deep convolutional neural networks.” in Proceedings of INTERSPEECH., pp. 410– 414, 2016.
[30] Z.-Q. Wang and D. Wang., “Robust speech recognition from ratio masks.” in Proceedings of ICASSP., pp. 5720– 5724, 2016.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top