跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.169) 您好!臺灣時間:2024/12/06 09:40
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:簡義
研究生(外文):I Chien
論文名稱:採用知識蒸餾與模型壓縮之低功耗可變關鍵字的喚醒詞辨識系統
論文名稱(外文):Small-footprint Open-vocabulary Keyword Spotting Using Knowledge Distillation and Model Quantization
指導教授:張智星張智星引用關係
指導教授(外文):Jyh-Shing Jang
口試委員:王新民廖元甫
口試委員(外文):Hsin-Min WangYuan-Fu Liao
口試日期:2021-06-19
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊網路與多媒體研究所
學門:電算機學門
學類:網路學類
論文種類:學術論文
論文出版年:2021
畢業學年度:109
語文別:中文
論文頁數:64
中文關鍵詞:喚醒詞辨識連結時序分類知識蒸餾模型量化Mobvoi Hotwords
外文關鍵詞:keyword spottingconnectionist temporal classificationknowledge distillationmodel quantizationMobvoi Hotwords
DOI:10.6342/NTU202101258
相關次數:
  • 被引用被引用:0
  • 點閱點閱:160
  • 評分評分:
  • 下載下載:26
  • 收藏至我的研究室書目清單書目收藏:0
隨著智慧裝置的普及,語音喚醒技術日益重要。語音喚醒主要透過喚醒詞辨識實現,目標為在一連續語音中辨識是否存在一特定關鍵字。由於深度神經網路快速的發展,採用深度神經網路的喚醒詞辨識也在辨識精準度上獲得了大幅的進步。傳統基於深度神經網路的喚醒詞辨識系統需要使用大量目標關鍵字的語音作為訓練資料,因此只能辨識固定的關鍵字且難以在完成訓練後替換關鍵字。若是需要替換關鍵字,就需要重新蒐集目標關鍵字的語料並重新訓練模型。本論文聚焦於實作一可變關鍵字的喚醒詞辨識系統,其採用連結時序分類(connectionist temporal classification,CTC)來訓練聲學模型,透過模型的輸出計算信心分數並基於信心分數來決定是否喚醒系統。然而為了方便使用,喚醒詞辨識系統需要部屬於邊緣裝置上,為了達成此目標,本論文也採用了知識蒸餾(knowledge distillation)和模型量化(model quantization)方法,在不影響辨識精準度的前題下大幅提升系統的辨識速度。於Mobvoi Hotwords上進行實驗,相較於基準方法,本研究提出的方法可以在運行速度相對提升40%時,同時使每小時錯誤喚醒次數為1時的錯誤拒絕率相對下降15.54%。
With the widespread of smart devices, wakeup-word detection becomes more and more important. Wakeup-word detection is based on keyword spotting (KWS), in which the target is to identify whether there is a specific keyword in a continuous speech. Traditional deep learning-based KWS approaches require the use of lots of keyword audio files to train the keyword-specific network and it is hard to change the keywords without extensive training in advance. If we want to change the keyword, we have to collect the new keyword audio corpus in order to retrain the network. In this thesis, we focus on the implementation of a system for open-vocabulary keyword spotting. We use the connectionist temporal classification (CTC) to train the acoustic model and determine whether to wake up the system by the confidence score of the target keyword calculated by CTC output. However, the keyword spotting system needs to be deployed on an edge device for convenience of use. To achieve this, we use knowledge distillation and model quantization to reduce the system latency without performance degradation. The experiments performed on Mobvoi Hotwords database show that the relative reduction in latency can reach 40% and the relative reduction in false rejection rate can reach 15.54% with 1 false alarm per hours, when compared with the baseline model.
誌謝 ii
摘要 iii
Abstract iv
1 緒論 1
1.1 研究動機 1
1.2 研究貢獻 2
1.3 章節概述 2
2 文獻探討 4
2.1 固定關鍵字的喚醒詞辨識 4
2.2 可變關鍵字的喚醒詞辨識 6
2.2.1 基於實例查詢方法 6
2.2.2 基於大詞彙連續語音辨識方法 7
2.2.3 基於聲學模型方法 9
3 研究方法 12
3.1 連結時序分類 12
3.1.1 CTC基本概述 13
3.1.2 CTC的解碼 14
3.2 知識蒸餾 16
3.2.1 知識蒸餾基本概述 16
3.2.2 連結時序分類的知識蒸餾 19
3.3 模型量化 21
3.3.1 量化方法 22
3.3.2 PyTorch的實作 23
3.4 關鍵字搜尋方法 26
4 語料介紹 28
4.1 語音辨識語料 28
4.1.1 Aidatatang_200zh 29
4.1.2 Aishell1 29
4.1.3 MagicData 29
4.1.4 Primewords 29
4.1.5 STCMDS 30
4.1.6 THCHS30 30
4.2 喚醒詞語料 30
4.2.1 Mobvoi Hotwords 30
4.2.2 富士康喚醒詞資料集 32
5 實驗設計與結果 33
5.1 實驗流程 33
5.1.1 聲學特徵抽取 33
5.1.2 數據增強 34
5.1.3 訓練標籤產生 35
5.1.4 神經網路架構 35
5.1.5 訓練流程與參數設定 37
5.1.6 邊緣設備 39
5.2 效果評估方式 39
5.2.1 錯誤拒絕率 39
5.2.2 每小時錯誤喚醒次數 40
5.2.3 即時率 40
5.3 結果探討 40
5.3.1 實驗一:不同聲學特徵及聲學單位的效果 40
5.3.2 實驗二:不同可變關鍵字喚醒詞搜尋方法之比較 43
5.3.3 實驗三:連結時序分類之知識蒸餾的效果 44
5.3.4 實驗四:模型量化的效果 49
5.3.5 實驗五:實驗於富士康喚醒詞資料集的結果 53
5.3.6 錯誤分析 54
6 結論與未來展望 57
6.1 結論 57
6.2 未來展望 58
Bibliography 59
[1] G. Chen, C. Parada, and G. Heigold, “Small­footprint keyword spotting using deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4087–4091.
[2] T. Sainath and C. Parada, “Convolutional neural networks for small­footprint keyword spotting,” in Interspeech, 2015.
[3] R. Tang and J. Lin, “Deep residual learning for small­footprint keyword spotting,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5484–5488.
[4] Y. Bai, J. Yi, J. Tao, Z. Wen, Z. Tian, C. Zhao, and C. Fan, “A time delay neural network with shared weight self­attention for small­footprint keyword spotting.” in INTERSPEECH, 2019, pp. 2190–2194.
[5] G. Chen, C. Parada, and T. N. Sainath, “Query­by­example keyword spotting using long short­term memory networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5236–5240.
[6] M. Weintraub, “Lvcsr log­likelihood ratio scoring for keyword spotting,” in 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1995, pp. 297–300 vol.1.
[7] H. Yan, Q. He, and W. Xie, “Crnn­ctc based mandarin keywords spotting,” in ICASSP 2020 ­ 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7489–7493.
[8] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,” the Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
[9] R. C. Rose and D. B. Paul, “A hidden markov model based keyword recognition system,” in International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1990, pp. 129–132.
[10] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
[12] D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur, “A time­restricted self­attention layer for asr,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5874–5878.
[13] J. Devlin, M.­W. Chang, K. Lee, and K. Toutanova, “Bert: Pre­training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[14] M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel, “Self­attentional acoustic models,” arXiv preprint arXiv:1803.09519, 2018.
[15] Y. Zhang and J. R. Glass, “Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams,” in 2009 IEEE Workshop on Automatic Speech Recognition Understanding, 2009, pp. 398–403.
[16] D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series.” in KDD workshop, vol. 10, no. 16. Seattle, WA, USA:, 1994, pp. 359–370.
[17] M. C. Madhavi and H. A. Patil, “Vtln­warped gaussian posteriorgram for qbe­std,” in 2017 25th European Signal Processing Conference (EUSIPCO), 2017, pp. 563–567.
[18] E. Eide and H. Gish, “A parametric approach to vocal tract length normalization,” in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 1. IEEE, 1996, pp. 346–348.
[19] N. Sacchi, A. Nanchen, M. Jaggi, and M. Cernak, “Open­vocabulary keyword spotting with audio and text embeddings,” in INTERSPEECH 2019­IEEE International Conference on Acoustics, Speech, and Signal Processing, no. CONF, 2019.
[20] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
[21] G. D. Forney, “The viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp. 268–278, 1973.
[22] D. R. Miller, M. Kleber, C.­L. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M. Schwartz, and H. Gish, “Rapid and accurate spoken term detection,” in Eighth Annual Conference of the international speech communication association, 2007.
[23] G. Chen, O. Yilmaz, J. Trmal, D. Povey, and S. Khudanpur, “Using proxies for oov keywords in the keyword search task,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013, pp. 416–421.
[24] Y. Wang and Y. Long, “Keyword spotting based on ctc and rnn for mandarin chinese speech,” in 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2018, pp. 374–378.
[25] Z. Chen, Y. Qian, and K. Yu, “Sequence discriminative training for deep learning based acoustic keyword spotting,” Speech Communication, vol. 102, pp. 100–111, 2018.
[26] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
[27] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence­trained neural networks for asr based on lattice­free mmi.” in Interspeech, 2016, pp. 2751–2755.
[28] A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, “First­pass large vocabulary continuous speech recognition using bi­directional recurrent dnns,” arXiv preprint arXiv:1408.2873, 2014.
[29] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015.
[30] Y. Kim and A. M. Rush, “Sequence­level knowledge distillation,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 1317–1327.
[31] J. Wong and M. J. F. Gales, “Sequence student­teacher training of deep neural networks,” in Interspeech. ISCA, September 2016, pp. 2761–2765.
[32] R. Takashima, S. Li, and H. Kawai, “An investigation of a knowledge distillation method for ctc acoustic models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5809–5813.
[33] R. Takashima, L. Sheng, and H. Kawai, “Investigation of sequence­level knowledge distillation methods for ctc acoustic models,” in ICASSP 2019 ­ 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6156–6160.
[34] G. Kurata and K. Audhkhasi, “Improved knowledge distillation from bi­directional to uni­directional lstm ctc for end­to­end speech recognition,” in 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 411–417.
[35] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018.
[36] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer­arithmetic­only inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713.
[37] O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, “Q8bert: Quantized 8bit bert,” arXiv preprint arXiv:1910.06188, 2019.
[38] (2020) Dynamic quantization. [Online]. Available: https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html
[39] Beijing DataTang Technology Co., Ltd., “aidatatang 200zh, a free chinese mandarin speech corpus,” www.datatang.com.
[40] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell­1: An open­source mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O­COCOSDA), 2017, pp. 1–5.
[41] Magic Data Technology Co., Ltd., “Magicdata mandarin chinese read speech corpus,” http://www.imagicdatatech.com/index.php/home/dataopensource/data_info/id/101, 2019.
[42] L. Primewords Information Technology Co., “Primewords chinese corpus set 1,” 2018, https://www.primewords.cn.
[43] Surfingtech, “St­cmds­20170001 1 free st chinese mandarin corpus.”
[44] Z. Z. Dong Wang, Xuewei Zhang, “Thchs­30 : A free chinese speech corpus,” 2015. [Online]. Available: http://arxiv.org/abs/1512.01882
[45] J. Hou, Y. Shi, M. Ostendorf, M. Hwang, and L. Xie, “Region proposal network based small­footprint keyword spotting,” IEEE Signal Process. Lett., vol. 26, no. 10, pp. 1471–1475, 2019. [Online]. Available: https://doi.org/10.1109/LSP.2019.2936282
[46] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.
[47] I. Szöke, M. Skácel, L. Mošner, J. Paliesek, and J. Černocký, “Building and evaluation of a real room impulse response dataset,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 863–876, 2019.
[48] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
[49] Y. Zhuang, X. Chang, Y. Qian, and K. Yu, “Unrestricted vocabulary keyword spotting using lstm­ctc.” in Interspeech, 2016, pp. 938–942.
[50] S. Kim, T. Hori, and S. Watanabe, “Joint ctc­attention based end­to­end speech recognition using multi­task learning,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 4835–4839.
[51] R. Serizel and D. Giuliani, “Vocal tract length normalisation approaches to dnn­based children’s and adults’ speech recognition,” in 2014 IEEE Spoken Language Technology Workshop (SLT), 2014, pp. 135–140.
[52] K. Matsuura, M. Mimura, S. Sakai, and T. Kawahara, “Generative adversarial training data adaptation for very low­resource automatic speech recognition,” in Proc. Interspeech 2020, 2020, pp. 2737–2741.
[53] B. Huang, D. Ke, H. Zheng, B. Xu, Y. Xu, and K. Su, “Multi­task learning deep neural networks for speech feature denoising,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[54] G. Kurata and K. Audhkhasi, “Multi­task ctc training with auxiliary feature reconstruction for end­to­end speech recognition.” in INTERSPEECH, 2019, pp. 1636–1640.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top