跳到主要內容

臺灣博碩士論文加值系統

(44.212.96.86) 您好!臺灣時間:2023/12/10 09:13
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:丁文淵
研究生(外文):Wen-Yuan Ting
論文名稱:具有預測理解度之預訓練模型和零點控制波束成形技術的雙通道語音增強系統
論文名稱(外文):A Two-channel Speech Enhancement System with a Pre-trained Intelligibility Prediction Model and Null-steering Beamforming
指導教授:蘇柏青
指導教授(外文):Borching Su
口試委員:曹昱劉俊麟彭盛裕
口試委員(外文):Yu TsaoChun-Lin LiuSheng-Yu Peng
口試日期:2023-07-24
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:電信工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2023
畢業學年度:111
語文別:中文
論文頁數:61
中文關鍵詞:波束成形零點控制短時客觀理解度STOI-Net
外文關鍵詞:beamformingnull-steeringSTOISTOI-Net
DOI:10.6342/NTU202302638
相關次數:
  • 被引用被引用:0
  • 點閱點閱:12
  • 評分評分:
  • 下載下載:4
  • 收藏至我的研究室書目清單書目收藏:0
波束成形技術時常用於許多多通道語音增強系統中,以抑制降低語音理解度的指向性干擾訊號。傳統波束成形器通常基於「波達方向」(direction-of-arrival, DOA)、「能量頻譜密度」(power spectral density, PSD)、「相對轉移函數」(relative transfer function, RTF)、共變異數矩陣等參數的精準估計值來進行最佳化。但精準估計這一些參數有時候是一件很不容易的任務。在這一本論文中,我們提出了一個新的波束成形框架,此框架是基於一個能預測訊號的「短時客觀理解度」(short-time objective intelligibility, STOI) 的預訓練模型:STOI-Net 來提升吵雜語音訊號的理解度。該方法稱作「具理解度意識的零點控制波束成形技術」(intelligibility-aware null-steering beamforming, IANS)。吵雜語音訊號會先送進一群零點控制波束成形器來產生一序列的訊號。這一些訊號會再送進STOI-Net 來決定何者具有最高的理解度。實驗結果顯示我們可以利用一個雙麥克風陣列搭配我們提出的方法在多個情境中提升語音訊號的理解度。其STOI 增強效果類似於在已知目標以及干擾訊號之DOA 的狀況下所產生的波束成形結果。
Beamforming technology is commonly used in many multi-channel speech enhancement systems to suppress directional interfering signals that degrade speech intelligibility. Traditional beamformers are usually optimized based on accurate estimations of parameters such as the direction-of-arrival (DOA), power spectral densities, relative transfer functions, and covariance matrices. However, accurately estimating these parameters could be a challenging task. In this thesis, a novel beamforming framework is proposed to enhance the intelligibility of noisy speech signals based on a pre-trained short-time objective intelligibility (STOI) prediction model, STOI-Net. This framework is referred to as intelligibility-aware null-steering beamforming (IANS). The noisy speech signal is first sent into a set of null-steering beamformer to generate a set of signals. These signals are then sent into STOI-Net which determines the signal corresponding to the highest intelligibility. Experiment results show that our proposed method, using a two-channel microphone array, is capable of generating intelligibility-enhanced speech signals in multiple scenarios. These signals have STOI scores similar to those generated using beamforming methods given the DOAs of the speech and interfering signals.
口試委員審定書 i
誌謝 iii
摘要 v
Abstract vii
目錄 ix
圖目錄 xiii
表目錄 xv
第一章 緒論 1
第二章 相關研究 7
2.1 訊號模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 濾波和加總波束成形技術. . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 傳統MVDR/MPDR 波束成形技術. . . . . . . . . . . . . . . . . . . 11
2.4 傳統MVDR/MPDR 技術中的限制. . . . . . . . . . . . . . . . . . . 13
2.4.1 傳統DOA 估計演算法. . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 傳統共變異數矩陣估計法. . . . . . . . . . . . . . . . . . . . . . 15
2.4.3 Rxx[n, k] 估計法. . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.4 Rii[n, k] 估計法. . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.5 Rss[n, k] 估計法. . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 零點控制波束成形. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.1 定義. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.2 例子. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.3 實務上的限制. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.3.1 有限訊號的影響. . . . . . . . . . . . . . . . . . . . 23
2.5.3.2 非自由場環境的影響. . . . . . . . . . . . . . . . . . 23
2.6 STOI-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
第三章 IANS最佳化 25
3.1 最佳化問題. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 最佳化演算法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 階段一:NSBF 階段. . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 階段二:STOI-Net 階段. . . . . . . . . . . . . . . . . . . . . . . 28
3.3 旁瓣訊號增強. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
第四章 實驗架設與結果分析 31
4.1 實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 場景設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 訊號設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 IANS 參數設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.4 實驗比較對象. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 實驗結果(一) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 θs = 45◦(自由場) . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 θs = 45◦(RT60 = 150 毫秒) . . . . . . . . . . . . . . . . . . . . 37
4.2.3 θs = 90◦ (自由場) . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.4 θs = 90◦(RT60 = 150 毫秒) . . . . . . . . . . . . . . . . . . . . 40
4.3 實驗結果(二) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 實驗結果(三) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 實驗結果(四) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
第五章 總結 53
參考文獻 55
[1] J. B. Allen and D. A. Berkley. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America, 65(4):943–950, 1979.
[2] S. Araki, H. Sawada, R. Mukai, and S. Makino. DOA estimation for multiple sparse sources with normalized observation vector clustering. In Proc. ICASSP, pages 33–36, 2006.
[3] M. R. Bai, J.-G. Ih, and J. Benesty. Acoustic array systems: theory, implementation, and application. John Wiley & Sons, 2013.
[4] J. Capon. High-resolution frequency-wavenumber spectrum analysis. Proceedings of the IEEE, 57(8):1408–1418, 1969.
[5] J. H. DiBiase. A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone arrays. PhD thesis, Brown University, Providence, R.I., 2000.
[6] L. Drude, C. Boeddeker, J. Heymann, R. Haeb-Umbach, K. Kinoshita, M. Delcroix, and T. Nakatani. Integrating neural network based beamforming and weighted prediction error dereverberation. In Proc. INTERSPEECH, pages 3043–3047, 2018.
[7] O. L. Frost. An algorithm for linearly constrained adaptive array processing. Proceedings of the IEEE, 60(8):926–935, 1972.
[8] S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang. Quality-Net: An end-to-end non-intrusive speech quality assessment model based on BLSTM. arXiv preprint arXiv:1808.05344, 2018.
[9] S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao. MetricGAN+: An improved version of MetricGAN for speech enhancement. arXiv preprint arXiv:2104.03538, 2021.
[10] E. A. P. Habets, J. Benesty, I. Cohen, S. Gannot, and J. Dmochowski. New insights into the MVDR beamformer in room acoustics. IEEE Transactions on Audio, Speech, and Language Processing, 18(1):158–170, 2009.
[11] J. Heymann, L. Drude, and R. Haeb-Umbach. Neural network based spectral mask estimation for acoustic beamforming. In Proc. ICASSP, pages 196–200, 2016.
[12] G. Huang, J. Benesty, I. Cohen, and J. Chen. A simple theory and new method of differential beamforming with uniform linear microphone arrays. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:1079–1093, 2020.
[13] M.-W. Huang. Development of Taiwan Mandarin hearing in noise test. Master’s thesis, Department of speech language pathology and audiology, National Taipei University of Nursing and Health science, 2005.
[14] W. Huang and J. Feng. Differential beamforming for uniform circular array with directional microphones. In Proc. INTERSPEECH, pages 71–75, 2020.
[15] A. N. S. Institute S3.5-1997. Methods for calculation of the speech intelligibility index. American National Standards Institute (ANSI), 1997.
[16] U. Kjems and J. Jensen. Maximum likelihood based noise covariance matrix estimation for multi-microphone speech enhancement. In Proc. EUSIPCO, pages 295–299. IEEE, 2012.
[17] C. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(4):320–327, 1976.
[18] H. Kuttruff and E. Mommertz. Room acoustics. In Handbook of engineering acoustics, pages 239–267. Springer, 2012.
[19] B. Kwon, Y. Park, and Y.-S. Park. Analysis of the GCC-PHAT technique for multiple sources. In Proc. ICCAS, pages 2070–2073, 2010.
[20] X. Le, H. Chen, K. Chen, and J. Lu. DPCRN: Dual-path convolution recurrent network for single channel speech enhancement. arXiv preprint arXiv:2107.05429, 2021.
[21] N. Le Goff, J. Jensen, M. S. Pedersen, and S. L. Callaway. An introduction to opensound navigator™. Oticon A/S, 2016.
[22] C. Li, J. Benesty, and J. Chen. Beamforming based on null-steering with small spacing linear microphone arrays. The Journal of the Acoustical Society of America, 143(5):2651–2665, 2018.
[23] Y. Liu and D. Wang. Divide and conquer: A deep CASA approach to talker-independent monaural speaker separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12):2092–2102, 2019.
[24] P. C. Loizou. Speech enhancement: theory and practice. CRC press, 2007.
[25] Y.-J. Lu, X. Chang, C. Li, W. Zhang, S. Cornell, Z. Ni, Y. Masuyama, B. Yan, R. Scheibler, Z.-Q. Wang, et al. ESPnet-SE+ +: Speech enhancement for robust speech recognition, translation, and understanding. arXiv preprint arXiv:2207.09514, 2022.
[26] Y. Luo and N. Mesgarani. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, 2019.
[27] S. Markovich-Golan and S. Gannot. Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method. In Proc. ICASSP, pages 544–548, 2015.
[28] U. Michel. History of acoustic beamforming. In Proc. 1st. BeBeC, pages 1–17, 2006.
[29] S. Mohan, M. E. Lockwood, M. L. Kramer, and D. L. Jones. Localization of multiple acoustic sources with small arrays using a coherence test. The Journal of the Acoustical Society of America, 123(4):2136–2147, 2008.
[30] R. P. Mueller, R. S. Brown, H. Hop, and L. Moulton. Video and acoustic camera techniques for studying fish under ice: a review and comparison. Reviews in Fish Biology and Fisheries, 16:213–226, 2006.
[31] T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, and X. Xiao. Unified architecture for multichannel end-to-end speech recognition with neural beamforming. IEEE Journal of Selected Topics in Signal Processing, 11(8):1274–1288, 2017.
[32] D. B. Paul and J. Baker. The design for the Wall Street Journal-based CSR corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, pages 357–362, 1992.
[33] L. Pfeifenberger, M. Zöhrer, and F. Pernkopf. Deep complex-valued neural beamformers. In Proc. ICASSP, pages 2902–2906, 2019.
[34] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proc. ICASSP, pages 749–752, 2001.
[35] R. Roy and T. Kailath. ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(7):984–995, 1989.
[36] D. Salvati, C. Drioli, and G. L. Foresti. Incoherent frequency fusion for broadband steered response power algorithms in noisy environments. IEEE Signal Processing Letters, 21(5):581–585, 2014.
[37] R. Scheibler, E. Bezzam, and I. Dokmanić. Pyroomacoustics: A python package for audio room simulation and array processing algorithms. In Proc. ICASSP, pages 351–355, 2018.
[38] H. Schepker, S. E. Nordholm, L. T. T. Tran, and S. Doclo. Null-steering beamformer-based feedback cancellation for multi-microphone hearing aids with incoming signal preservation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(4):679–691, 2019.
[39] H. Schepker, L. T. T. Tran, S. Nordholm, and S. Doclo. Acoustic feedback cancellation for a multi-microphone earpiece based on a null-steering beamformer. In Proc. IWAENC, pages 1–5, 2016.
[40] H. Schepker, L. T. T. Tran, S. Nordholm, and S. Doclo. Null-steering beamformer for acoustic feedback cancellation in a multi-microphone earpiece optimizing the maximum stable gain. In Proc. ICASSP, pages 341–345, 2017.
[41] R. Schmidt. Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation, 34(3):276–280, 1986.
[42] M. Souden, J. Benesty, and S. Affes. On optimal frequency-domain multichannel linear filtering for noise reduction. IEEE Transactions on audio, speech, and language processing, 18(2):260–276, 2009.
[43] M. Souden, J. Chen, J. Benesty, and S. Affes. An integrated solution for online multichannel noise tracking and reduction. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2159–2169, 2011.
[44] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125–2136, 2011.
[45] J. Thiemann, N. Ito, and E. Vincent. The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings. Proceedings of Meetings on Acoustics, 19(1):035081, 2013.
[46] N. T. N. Tho, S. Zhao, and D. L. Jones. Robust DOA estimation of multiple speech sources. In Proc. ICASSP, pages 2287–2291, 2014.
[47] W.-Y. Ting, S.-S. Wang, Y. Tsao, and B. Su. IANS: Intelligibility-aware null-steering beamforming for dual-microphone arrays. arXiv preprint arXiv:2307.04179, 2023.
[48] H. L. Van Trees. Optimum array processing: Part IV of detection, estimation, and modulation theory. John Wiley & Sons, 2004.
[49] A. Varga and H. J. M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3):247–251, 1993.
[50] M. Wang, C. Boeddeker, R. G. Dantas, and A. Seelan. PESQ (perceptual evaluation of speech quality) wrapper for python users, May 2022.
[51] D. B. Ward and R. C. Williamson. Beamforming for a source located in the interior of a sensor array. In Proc. ISSPA, volume 2, pages 873–876, 1999.
[52] R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y. Tsao. Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:54–70, 2022.
[53] R. E. Zezario, S.-W. Fu, C.-S. Fuh, Y. Tsao, and H.-M. Wang. STOI-Net: A deep learning based non-intrusive speech intelligibility assessment model. In Proc. APSIPA, pages 482–486, 2020.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top