跳到主要內容

臺灣博碩士論文加值系統

(44.200.194.255) 您好!臺灣時間:2024/07/19 06:02
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:林璟旭
研究生(外文):LIN, CHING-HSU
論文名稱:探討自監督式學習模型之續預訓練對自動語音辨識的影響:以低資源之語言為例
論文名稱(外文):Exploring the Impact of Continual Pre-training of Self-supervised Learning Models on Automatic Speech Recognition: A Case Study on Low-resource Languages
指導教授:蔡偉和蔡偉和引用關係李鴻欣李鴻欣引用關係
指導教授(外文):TSAI, WEI-HOLEE, HUNG-SHIN
口試委員:王新民李鴻欣蔡偉和
口試委員(外文):WANG, HSIN-MINGLEE, HUNG-SHINTSAI, WEI-HO
口試日期:2023-11-24
學位類別:碩士
校院名稱:國立臺北科技大學
系所名稱:電子工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2023
畢業學年度:112
語文別:中文
論文頁數:73
中文關鍵詞:自動語音辨識語言識別低資源語言自監督式學習
外文關鍵詞:automatic speech recognitionlanguage identificationlow-resource languageself-supervised learning
相關次數:
  • 被引用被引用:0
  • 點閱點閱:92
  • 評分評分:
  • 下載下載:24
  • 收藏至我的研究室書目清單書目收藏:0
近年,預訓練的類神經模型展現了低資源語言中,各種下游任務的巨大潛力。這些模型是透過使用大規模語音資料的自監督式學習 (self-supervised learning, SSL) 而生成的。在本篇論文研究中,我們介紹了兩個瀕危的南島語言,阿美語 (Amis) 和賽德克語 (Seediq) ,以此作為研究目標,並探討資料量在自監督式學習式模型續預訓練 (continual pre-training) 上,對自動語音辨識的影響。我們提出了一個資料篩選的方法,從大型多語言語料庫中選擇最接近目標語言的語句,為自監督式學習之模型-自動語音辨識的技術方法做出貢獻。為了實現這一點,我們使用語言識別器從每個語句中提取特徵。其次,我們為目標語言訓練並使用了三個單一類別分類器 (one-class classifiers)。將多語言語料庫中的語句根據決策分數進行排名和選擇。我們的結果表明,該方法適用於以少量的新語言語料來調適自監督式學習模型,從而為該語言的自動語音辨識帶來有希望的結果。
Recently, pre-trained neural models have shown great potential for various downstream tasks in low-resource languages. These models are generated via self-supervised learning (SSL) using large-scale speech data. In this study, we introduce two endangered Austronesian languages, Amis and Seediq, as targets and explore the impact of data volume on the continued pre-training of SSL models for ASR. We propose a data-selection scheme where in utterances phonetically and phonologically closest to the target language are selected from a large multilingual corpus to contribute to the SSL-ASR pipeline. To achieve this, we use a language recognizer to extract an embedding from each utterance. Subsequently, we train and employ three one-class classifiers for the target language. The utterances in the multilingual corpus are ranked and selected based on the decision scores. Our results show that this scheme is feasible for adapting SSL models to new languages, leading to promising results for ASR.
摘要............................................................................................................................... i
ABSTRACT ................................................................................................................. ii
誌謝............................................................................................................................. iii
目錄............................................................................................................................. iv
表目錄 .........................................................................................................................vii
圖目錄........................................................................................................................viii
第一章 緒論................................................................................................................ 1
1.1 研究背景 ...................................................................................................... 1
1.2 研究目的 ...................................................................................................... 2
1.3 論文架構 ...................................................................................................... 3
第二章 相關文獻探討 ................................................................................................ 4
第三章 語料介紹 ........................................................................................................ 7
3.1 語料來源 ............................................................................................................ 7
3.2 語料選擇 ............................................................................................................ 7
3.3 語料特色 ............................................................................................................ 9
3.3.1 阿美語....................................................................................................... 10
3.3.2 賽德克語................................................................................................... 10
3.4 語料前處理 ..................................................................................................... 13
第四章 研究方法 ...................................................................................................... 14
4.1 基於語言識別之方法....................................................................................14
4.2 語言識別器...................................................................................................16
4.3 語言嵌入.......................................................................................................18
4.4 AmbertNet 模型之架構................................................................................ 19
4.4.1 編碼器....................................................................................................... 21
4.4.2 序言區塊................................................................................................... 22
4.4.3 巨型區塊................................................................................................... 22
4.4.4 一維深度通道可分離卷積.......................................................................23
4.4.5 解碼器....................................................................................................... 24
4.4.6 統計池化層............................................................................................... 24
4.5 異常檢測之方法............................................................................................25
4.5.1 單一類別支援向量機............................................................................... 26
4.5.2 孤立森林................................................................................................... 28
4.5.3 深度支援向量描述................................................................................... 31
4.5.4 過濾演算法 .................................................................................................. 33
第五章 實驗結果與討論 .......................................................................................... 39
5.1 實驗流程 .......................................................................................................... 39
5.2 工具介紹 ......................................................................................................... 40
5.2.1 Fairseq ........................................................................................................ 41
5.2.2 S3PRL ........................................................................................................ 41
5.2.3 ESPnet ........................................................................................................ 42
5.3 ML-SUPERB 建議模型之介紹 ...................................................................... 42
5.3.1 wav2vec2-base ........................................................................................... 43
5.3.2 wav2vec2-large .......................................................................................... 43
5.3.3 robust wav2vec2-large ............................................................................... 44
5.3.4 wav2vec2-base-23 ...................................................................................... 45
5.3.5 wav2vec2-large-23 ..................................................................................... 45
5.3.6 XLSR-53 .................................................................................................... 46
5.3.7 XLSR-128 .................................................................................................. 47
5.3.8 HuBERT-base ............................................................................................ 47
5.3.9 HuBERT-large ........................................................................................... 48
5.3.10 HuBERT-base-cmn .................................................................................. 49
5.3.11 HuBERT-large-cmn ................................................................................. 49
5.3.12 mHuBERT-base ....................................................................................... 50
5.4 單一類別分類 ................................................................................................. 51
5.5 持續預訓練 ..................................................................................................... 52
5.6 微調 ................................................................................................................. 53
5.7 結果.................................................................................................................54
第六章 結論與未來展望 .......................................................................................... 68
6.1 結論 .................................................................................................................. 68
6.2 未來與展望 ...................................................................................................... 68
參考文獻.................................................................................................................... 69
[1]A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020.
[2]K. Kim et al., “E-Branchformer: Branchformer with Enhanced merging for speech recognition,” in Proc. SLT, 2022.
[3]S. Schneider, A. Baevski, R. Collobert, and M. Auli, “Wav2vec: Unsupervised pre-training for speech recognition,” in Proc. Interspeech, 2019.
[4]A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020.
[5]W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 3451–3460, 2021.
[6]A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in Proc. Interspeech, 2021.
[7]C. Wang et al., “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” in Proc. ACL, 2021.
[8]A. Babu et al., “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” in Proc. Interspeech, 2022.
[9]J. Shi et al., “ML-SUPERB: Multilingual speech universal performance benchmark,” in Proc. Interspeech, 2023.
[10]K. Nowakowski, M. Ptaszynski, K. Murasaki, and J. Nieuważny, “Adapting multilingual speech representation model for a enw, underresourced language through multilingual fine-tuning and continued pretraining,” Inf. Process. Manag., vol. 60, no. 2, p. 103148, 2023.
[11]F. Jia, N. R. Koluguri, J. Balam, and B. Ginsburg, “AmberNet: A compact end-to-end model for spoken language identification,” in Proc. ICASSP, 2023.
[12]B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Computation, vol. 13, no. 7, pp. 1443–1471, 2001.
[13]F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proc. ICDM, 2008.
[14]F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation-based anomaly detection,” ACM Trans. Knowl. Discov. Data, vol. 6, no. 1, pp. 1–39, 2012.
[15]L. Ruff et al., “Deep one-class classification,” in Proc. ICML, 2018.
[16]W. Han et al., “ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context,” in Proc. Interspeech, 2020.
[17]N. R. Koluguri, T. Park, and B. Ginsburg, “TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context,” in Proc. ICASSP, 2021.
[18]S. Kriman et al., “QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions,” in Proc. ICASSP, 2019.
[19]J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-Excitation Networks,” IEEE CVPR, 2018.
[20]S. Majumdar, J. Balam, O. Hrinchuk, V. Lavrukhin, V. Noroozi, and B. Ginsburg, “Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition,” in Audio and Speech Processing, 2021.
[21]D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” in Proc. ICASSP, 2018.
[22]S. Kulkarni, H. Watanabe, and F. Homma, “Self-Supervised Audio Encoder with Contrastive Pretraining for Respiratory Anomaly Detection,” in Proc. ICASSPW, IEEE, 2023, pp. 1–5.
[23]A. Jiang, W.-Q. Zhang, Y. Deng, P. Fan, and J. Liu, “Unsupervised Anomaly Detection and Localization of Machine Audio: A GAN-based Approach,” 2023.
[24]C. Feng, Z. Chen, and A. Owens, “Self-Supervised Video Forensics by Audio-Visual Anomaly Detection,” IEEE CVPR, 2023.
[25]K. Li, Y. Wang, M. L. Nguyen, M. Akagi, and M. Unoki, “Analysis of Amplitude and Frequency Perturbation in the Voice for Fake Audio Detection,” in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE, 2022, pp. 929–936.
[26]M. Lavechin et al., “Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation,” arXiv, 2023.
[27]J. He, Z. Cheng, and B. Guo, “Anomaly Detection in Satellite Telemetry Data Using a Sparse Feature-Based Method,” Sensors, vol. 22, no. 17, p. 6358, 20.
[28]A. Khan and K. M. Malik, “Securing Voice Biometrics: One-Shot Learning Approach for Audio Deepfake Detection.” arXiv, 2023.
[29]R. Kumar, M. Gupta, and S. Singh, “Early Prediction of Parkinson’s Disease using Multiple SVM Classifiers,” in Proc. ICSCSS, IEEE, Jun. 2023, pp. 37–42. [Online]. Available: https://ieeexplore.ieee.org/document/10169162/
[30]U. Sadique, M. S. Khan, S. Anwar, and M. Ahmad, “Machine Learning based human recognition via robust Features from audio signals,” in Proc. ICAI, IEEE, 2023, pp. 52–57.
[31]M. Ott et al., “fairseq: A fast, extensible toolkit for sequence modeling,” in Proc. NAACL, 2019.
[32]S. Watanabe et al., “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018.
[33]S. Yang et al., “SUPERB: Speech processing universal performance benchmark,” in Proc. Interspeech, 2021.
[34]W.-N. Hsu et al., “Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training,” in Proc. Interspeech, 2021.
[35]A. Conneau et al., “Unsupervised cross-lingual depresentation learning at scale,” in Proc. ACL, 2020.
[36]B. Zhang et al., “WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in Proc. ICASSP, 2022.
[37]A. Lee et al., “Textless speech-to-speech translation on real data,” in Proc. NAACL, 2022.
[38]D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
[39]A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006.
[40]D. S. Park et al., “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019.
[41] Blust, Robert. "Subgrouping, circularity and extinction: some issues in. Austronesian comparative linguistics." Selected papers from the eighth international conference on Austronesian linguistics. Vol. 1. 1999.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top