跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.85) 您好!臺灣時間:2025/01/21 17:46
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:黃肇元
研究生(外文):Huang, Chao-Yuan
論文名稱:改善語音品質之強化學習語音增強演算法
論文名稱(外文):Reinforcement Learning Based Speech Enhancement for Improving Speech Quality
指導教授:冀泰石
指導教授(外文):Chi, Tai-Shih
口試委員:曹昱王逸如
口試委員(外文):Tsao, YuWang, Yih-Ru
口試日期:2018-07-20
學位類別:碩士
校院名稱:國立交通大學
系所名稱:電機工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2018
畢業學年度:106
語文別:英文
論文頁數:51
中文關鍵詞:強化學習語音增強語音品質
外文關鍵詞:Reinforcement LearningSpeech EnhancementSpeech Quality
相關次數:
  • 被引用被引用:0
  • 點閱點閱:423
  • 評分評分:
  • 下載下載:27
  • 收藏至我的研究室書目清單書目收藏:0
去除環境噪音一直以來都是語音訊號處理中非常重要的議題,Koizumi團隊採用深度強化學習去針對語音品質指標進行語音強化,該方法能夠有效地利用有限的訓練資料,且其結果優於映射(mapping)方法。因此,我們針對該方法提出三種優化方式以期進一步加強其結果,首先我們提出另外兩種定義動作(actions)的演算法,並額外探討兩種動作數量的情況並加以比較,然後我們將語音品質指標分數較高的一階深度強化學習語音增強演算法進一步延伸至兩階段深度強化學習語音增強演算法,以分別對高低頻率域進行強化,最後我們更進一步將蒙特卡羅法應用在提出的方法中。實驗設計為量測在不同雜訊種類及強度的情況下語音理解度與語音品質的變化,以評估此三套優化方法之效能。由實驗結果可說明所提出的動作定義演算法的確在語音品質指標上優於原先定義的方式,但不同動作數量並無太大的影響,此外,兩階段深度強化學習語音增強演算法對於語音品質指標有更進一步的提升,最後,結合蒙地卡羅法的深度強化學習語音增強演算法確實有助於語音品質的提升。未來展望可將此套強化學習方法應用在回歸類型的語音增強方法上,以期能有更好的結果。
Speech enhancement to cancel the noises in the environment has been an important topic in speech signal processing. Koizumi's research group proposed deep-neural-network-based reinforcement learning (DNN-RL) to enhance the speech in accordance with the speech quality. Their method is said to utilize limited training data efficiently and is better than the DNN-mapping method. Hence, we propose three optimization techniques to further boost the performance. First, we propose two procedures to define the actions and make a comparison between other number of templates. Second, we extend the one-level DNN-RL which yields the best speech quality to a two-level DNN-RL to separately enhance the high-frequency and low-frequency regions. Last, the Monte Carlo method is combined with the proposed DNN-RLs to ensure the stability of algorithm. To evaluate these three optimization techniques, experiments are designed to measure the difference of speech intelligibility and speech quality under different noise condition. Judging from the experiment results, the proposed procedures of defining actions has higher speech quality scores than the original procedure while the number of actions barely influences the speech quality. Also, the two-level DNN-RL produces better speech quality than the one-level DNN-RL. Last, DNN-RL combined with the Monte Carlo method benefits speech quality. Future work to combine the optimized DNN-RL method into the regression-based speech enhancement method is expected to produce a better result.
摘要 i
Abstract ii
誌謝 iii
Contents iv
List of Tables vi
List of Figures vii
Chapter 1 Introduction 1
1.1 Study Background 1
1.2 Related Work 3
1.3 Motivation 4
1.4 Thesis Organization 4
Chapter 2 Fundamentals of Research Method 5
2.1 Signal Model 5
2.2 Usual Ways of Audio Signal Processing 6
2.2.1 Short-time Fourier Transform 6
2.2.2 Mel-Scale Filters Analysis 10
2.2.3 Ideal Ratio Mask (IRM) 12
2.3 Deep Q-Network of Reinforcement Learning 14
2.3.1 Reinforcement Learning 14
2.3.2 Q-Learning 15
2.3.3 Deep Q-Network 17
Chapter 3 System Architecture 19
3.1 Overview 19
3.2 Framework of Reinforcement Learning in Speech Enhancement 20
3.3 Refinement of T-F Mask Templates 23
3.3.1 Templates Created by Different Methods 23
3.3.2 Templates Created with Different Numbers of Clusters 25
3.4 Two-level DNN-RL 26
3.5 DNN-RL with the Monte Carlo Method 28
Chapter 4 Experiment Design 29
4.1 Experiment Setting and DNN Structure 29
4.2 Perceptual Evaluation of Speech Quality (PESQ) Measure 30
4.3 Short-Time Objective Intelligibility (STOI) Measure 32
4.4 Comparison between Systems 33
Chapter 5 Experiment Result and Discussion 35
5.1 Refinement of T-F Mask Templates 35
5.2 Two-level DNN-RL 37
5.3 DNN-RL with the Monte Carlo Method 38
5.4 Comparison between Systems 42
5.5 Subjective Evaluation 43
Chapter 6 Conclusion and Future Work 44
References 46
Appendix A 50
Appendix B 51
[1] J. Y. Li, L. Deng, Y. F. Gong, and R. Haeb-Umbach, "An Overview of Noise-Robust Automatic Speech Recognition," (in English), Ieee-Acm Transactions on Audio Speech and Language Processing, vol. 22, no. 4, pp. 745-777, Apr 2014.
[2] J. Ming, T. J. Hazen, J. R. Glass, and D. A. Reynolds, "Robust speaker recognition in noisy conditions," (in English), Ieee Transactions on Audio Speech and Language Processing, vol. 15, no. 5, pp. 1711-1723, Jul 2007.
[3] L. P. Yang and Q. J. Fu, "Spectral subtraction-based speech enhancement for cochlear implant patients in background noise," (in English), Journal of the Acoustical Society of America, vol. 117, no. 3, pp. 1001-1004, Mar 2005.
[4] M. Berouti, R. Schwartz, and J. Makhoul, "Enhancement of speech corrupted by acoustic noise," in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP'79., 1979, vol. 4, pp. 208-211: IEEE.
[5] J. S. Lim and A. V. Oppenheim, "All-Pole Modeling of Degraded Speech," (in English), Ieee Transactions on Acoustics Speech and Signal Processing, vol. 26, no. 3, pp. 197-210, 1978.
[6] Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator," (in English), Ieee Transactions on Acoustics Speech and Signal Processing, vol. 32, no. 6, pp. 1109-1121, 1984.
[7] K. W. Wilson, B. Raj, P. Smaragdis, and A. Divakaran, "Speech denoising using nonnegative matrix factorization with priors," (in English), 2008 Ieee International Conference on Acoustics, Speech and Signal Processing, Vols 1-12, pp. 4029-+, 2008.
[8] J. Tchorz and B. Kollmeier, "SNR estimation based on amplitude modulation analysis with application's to noise suppression," (in English), Ieee Transactions on Speech and Audio Processing, vol. 11, no. 3, pp. 184-192, May 2003.
[9] X. G. Lu, Y. Tsao, S. Matsuda, and C. Hori, "Speech Enhancement Based on Deep Denoising Autoencoder," (in English), 14th Annual Conference of the International Speech Communication Association (Interspeech 2013), Vols 1-5, pp. 436-440, 2013.
[10] Y. Xu, J. Du, L. R. Dai, and C. H. Lee, "A Regression Approach to Speech Enhancement Based on Deep Neural Networks," (in English), Ieee-Acm Transactions on Audio Speech and Language Processing, vol. 23, no. 1, pp. 7-19, Jan 2015.
[11] S. W. Fu, Y. Tsao, and X. G. Lu, "SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement," (in English), 17th Annual Conference of the International Speech Communication Association (Interspeech 2016), Vols 1-5, pp. 3768-3772, 2016.
47
[12] A. L. Maas, Q. V. Le, T. M. O'Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, "Recurrent Neural Networks for Noise Reduction in Robust ASR," (in English), 13th Annual Conference of the International Speech Communication Association 2012 (Interspeech 2012), Vols 1-3, pp. 22-25, 2012.
[13] F. Weninger et al., "Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR," in International Conference on Latent Variable Analysis and Signal Separation, 2015, pp. 91-99: Springer.
[14] P. G. Shivakumar and P. G. Georgiou, "Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement," in INTERSPEECH, 2016, pp. 3743-3747.
[15] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, "End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 9, pp. 1570-1584, 2018.
[16] Yan Zhao, Buye Xu, Ritwik Giri, and T. Zhang, "Perceptually guided speech enhancement using deep neural networks," 2018 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), pp. 5074-5078, 2018.
[17] Morten Kolbæk, Zheng-Hua Tan, and J. Jensen, "Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure," 2018 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), pp. 5059-5063, 2018.
[18] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "An algorithm for intelligibility prediction of time–frequency weighted noisy speech," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125-2136, 2011.
[19] Y. Hu and P. C. Loizou, "Evaluation of objective quality measures for speech enhancement," (in English), Ieee Transactions on Audio Speech and Language Processing, vol. 16, no. 1, pp. 229-238, Jan 2008.
[20] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, "Dnn-Based Source Enhancement Self-Optimized by Reinforcement Learning Using Sound Quality Measurements," (in English), 2017 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), pp. 81-85, 2017.
[21] P. C. Loizou, Speech enhancement: theory and practice. CRC press, 2007.
[22] G. Heinzel, A. Rüdiger, and R. Schilling, "Spectrum and spectral density estimation by the Discrete Fourier transform (DFT), including a comprehensive list of window functions and some new at-top windows," 2002.
[23] S. S. Stevens, J. Volkmann, and E. B. Newman, "A scale for the measurement of the psychological magnitude pitch," The Journal of the Acoustical Society of America, vol. 8, no. 3, pp. 185-190, 1937.
48
[24] E. Zwicker, "Subdivision of the audible frequency range into critical bands (Frequenzgruppen)," The Journal of the Acoustical Society of America, vol. 33, no. 2, pp. 248-248, 1961.
[25] D. Baby, T. Virtanen, J. F. Gemmeke, and H. van Hamme, "Coupled Dictionaries for Exemplar-Based Speech Enhancement and Automatic Speech Recognition," (in English), Ieee-Acm Transactions on Audio Speech and Language Processing, vol. 23, no. 11, pp. 1788-1799, Nov 2015.
[26] F. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, "Discriminatively Trained Recurrent Neural Networks for Single-Channel Speech Separation," (in English), 2014 Ieee Global Conference on Signal and Information Processing (Globalsip), pp. 577-581, 2014.
[27] D. L. Wang, "On ideal binary mask as the computational goal of auditory scene analysis," (in English), Speech Separation by Humans and Machines, pp. 181-197, 2005.
[28] A. Narayanan and D. L. Wang, "Ideal Ratio Mask Estimation Using Deep Neural Networks for Robust Speech Recognition," (in English), 2013 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), pp. 7092-7096, 2013.
[29] D. S. Williamson, Y. X. Wang, and D. L. Wang, "Complex Ratio Masking for Joint Enhancement of Magnitude and Phase," (in English), 2016 Ieee International Conference on Acoustics, Speech and Signal Processing Proceedings, pp. 5220-5224, 2016.
[30] D. Silver et al., "Mastering the game of Go with deep neural networks and tree search," (in English), Nature, vol. 529, no. 7587, pp. 484-+, Jan 28 2016.
[31] V. Mnih et al., "Playing atari with deep reinforcement learning," arXiv preprint arXiv:1312.5602, 2013.
[32] C. J. C. H. Watkins and P. Dayan, "Q-learning," Machine Learning, journal article vol. 8, no. 3, pp. 279-292, May 01 1992.
[33] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 1998.
[34] V. Mnih et al., "Human-level control through deep reinforcement learning," (in English), Nature, vol. 518, no. 7540, pp. 529-533, Feb 26 2015.
[35] F. Seide, G. Li, X. Chen, and D. Yu, "Feature engineering in context-dependent deep neural networks for conversational speech transcription," in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, 2011, pp. 24-29: IEEE.
49
[36] Y. Xu, J. Du, L. R. Dai, and C. H. Lee, "An Experimental Study on Speech Enhancement Based on Deep Neural Networks," (in English), Ieee Signal Processing Letters, vol. 21, no. 1, pp. 65-68, Jan 2014.
[37] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, "DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1," NASA STI/Recon technical report n, vol. 93, 1993.
[38] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, "The third ‘CHiME’speech separation and recognition challenge: Dataset, task and baselines," in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on, 2015, pp. 504-511: IEEE.
[39] I. Rec, "P. 800: Methods for subjective determination of transmission quality," International Telecommunication Union, Geneva, 1996.
[40] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, "Perceptual evaluation of speech quality (PESQ) - A new method for speech quality assessment of telephone networks and codecs," (in English), 2001 Ieee International Conference on Acoustics, Speech, and Signal Processing, Vols I-Vi, Proceedings, pp. 749-752, 2001.
[41] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "A short-time objective intelligibility measure for time-frequency weighted noisy speech," in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, 2010, pp. 4214-4217: IEEE.
[42] Alex Irpan, "Deep Reinforcement Learning Doesn't Work Yet," 2018.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top