跳到主要內容

臺灣博碩士論文加值系統

(44.213.60.33) 您好!臺灣時間:2024/07/20 06:49
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:洪瑛秀
研究生(外文):Ying-Hsiu Hung
論文名稱:用於構音障礙輔助系統之訓練資料擴增技術
論文名稱(外文):Training Data Augmentation for AI-Based Dysarthria Assistive Systems
指導教授:賴穎暉
指導教授(外文):Ying-Hui Lai
學位類別:碩士
校院名稱:國立陽明大學
系所名稱:生物醫學工程學系
學門:工程學門
學類:生醫工程學類
論文種類:學術論文
論文出版年:2020
畢業學年度:108
語文別:中文
論文頁數:60
中文關鍵詞:構音障礙溝通輔助系統資料增量深度學習
外文關鍵詞:dysarthriacommunication assistance systemdata augmentationdeep learning
相關次數:
  • 被引用被引用:1
  • 點閱點閱:168
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
誌 謝...................................................ii
摘 要..................................................iii
Abstract.................................................iv
目 錄....................................................v
圖 目 錄..............................................vii
表 目 錄................................................x
第1章 緒論..............................................1
1.1 研究背景與動機....................................1
1.2 研究目的..........................................5
1.3 論文架構..........................................5
第2章 文獻探討..........................................6
2.1 語音處理技術......................................6
2.1.1 以信號處理為基礎之轉換方法........................6
2.1.2 以語音辨識為基礎之轉換方法.......................11
2.2 資料增量.........................................15
第3章 研究方法.........................................25
3.1 架構與方法.......................................25
3.1.1 實驗語料.........................................26
3.1.2 產生大量構音障礙患者資料之DSG方法................28
3.2 實驗設計.........................................32
3.3 評估合成語料品質與ASR辨識效益驗證................36
3.3.1 DSG合成語料品質與相似度評估......................36
3.3.2 ASR辨識度效益驗證................................38
第4章 結果與討論.......................................42
4.1 實驗一:DSG合成語料品質與相似度評估..............42
4.2 實驗二:合成語料對於ASR辨識度提升初步驗證........45
4.3 實驗三:合成語料與其他資料增量方法比較...........48
4.4 實驗四:Free talk的驗證..........................50
第5章 結論與未來展望...................................54
5.1 結論.............................................54
5.2 未來展望.........................................55
參考文獻.................................................56

圖目錄
圖 2-1 典型VC系統的訓練與轉換流程 [14]。..................................................................................................................................7
圖 2 2 增加特徵之VC架構 [16]。............................................................................................................................................9
圖 2 3 t-SNE評估LPS特徵中,提出之轉換方法(J_SBDNN)與其他方法的轉換特徵與目標特徵之差異 [16]。............................................................................10
圖 2 4 非監督式的網路語音轉換模型 [19]。.................................................................................................................................11
圖 2 5 Kaldi ASR架構 [15]。..............................................................................................................................................12
圖 2 6 PPG灰階圖 [36]。..................................................................................................................................................13
圖 2 7 CTC 示意圖 [37]。.................................................................................................................................................13
圖 2 8 2019年Google ASR平均WER [22]。....................................................................................................................................14
圖 2 9 訓練資料增加對網路的影響 [22]。...................................................................................................................................15
圖 2 10 構音障礙患者分級分析:(a) UA dysarthric speech Corpus、(b)TORGO dysarthric speech corpus [21]。..................................................................16
圖 2 11 以調整語速增量ASR結果 [21]。.....................................................................................................................................17
圖 2 12 語音轉換架構[27]。...............................................................................................................................................18
圖 2 13 DCGAN結構圖[27]。................................................................................................................................................19
圖 2 14 客觀測試及主觀測試 [27]。........................................................................................................................................20
圖 2 15 加入轉換語料與噪音兩種資料增量方法的差異 [27]。..................................................................................................................21
圖 2 16 CycleGAN架構圖 [51]。............................................................................................................................................22
圖 3 1提出系統架構。.....................................................................................................................................................26
圖 3 2 轉換階段。........................................................................................................................................................29
圖 3 3 實驗流程。........................................................................................................................................................33
圖 3 4 TDNN示意圖。......................................................................................................................................................34
圖 3 5 正常語者和構音障礙語者之語音差異示意圖 [25]。.....................................................................................................................38
圖 4 1 正常語料和合成語料的頻譜包絡線圖,每張圖唸的字分別是:(a)、(b)為“這”;(c)、(d)為“學”;(e)、(f)為“期”,每張圖的五條線是同一個字唸五次的頻譜包絡線結果。......44
圖 4 2 D_1在實驗二的測試結果:實線部分為Half outside test;虛線部分為Duplicate test。....................................................................................45
圖 4 3 D_2在實驗二的測試結果:實線部分為Half outside test;虛線部分為Duplicate test。....................................................................................46
圖 4 4 PPG音素分布比例圖:藍線為TMHINT 288句的語料音素分布圖;橘線為新聞文本合成語料音素分布圖。.........................................................................47
圖 4 5比較D_1的合成語料與正常語料增量差異:實線部分為合成語料;虛線部分為正常語料。......................................................................................48
圖 4 6 比較D_2的合成語料與正常語料增量差異:實線部分為Half outside test;虛線部分為Duplicate test。......................................................................49
圖 4 7 D_2以大量TTS資料增量結果:實線部分為Not Join;虛線部分為Duplicate test。..........................................................................................51

表目錄
表 3.1 資料使用表。.............................28
表 3.2 測試語料定義。...........................38
表 3.3 實驗二與實驗三使用之資料量。.............40
表 3.4 實驗四使用之資料量。.....................41
表 4.1 CycleGAN與StarGAN合成語料之MCD評測。.....43
表 4.2 ASR系統260倍TTS增量測試結果。............52
[1] 維基百科, "言語障礙," https://zh.wikipedia.org/wiki/%E8%A8%80%E8%AA%9E%E9%9A%9C%E7%A4%99.
[2] ASHA, "Quick Facts Speech & Language Disorders," https://www.asha.org/about/news/quick-facts/.
[3] CDC, "Stroke Statistics," https://www.cdc.gov/stroke/.
[4] ALS Association, "About ALS," http://www.alsa.org/about-als/facts-you-should-know.html.
[5] Cerebral Palsy Prevalence and Incidence, "Cerebral Palsy Prevalence and I-ncidence," https://www.cerebralpalsyguidance.com/cerebral-palsy/research/prevalence-and-incidence/.
[6] D. Sullivan, "Multiple Sclerosis: Facts, Statistics, and You," https://www.healthline.com/health/multiple-sclerosis/facts-statistics-infographic#1.
[7] C. Marras et al., "Prevalence of Parkinson’s disease across North America," Journal of NPJ Parkinson's disease, vol. 4, no. 1, pp. 21, 2018.
[8] American Brain Tumor Association, "Brain Tumor Education," https://www.abta.org/about-brain-tumors/brain-tumor-education/.
[9] 維基百科, "Augmentative and alternative communication," https://en.wikipedia.org/wiki/Augmentative_and_alternative_communication.
[10] Stephen Calculator and C. D. A. Luchko, "Evaluating the effectiveness of a communication board training program," Journal of speech and Hearing Disorders, vol. 48, no. 2, pp. 185-191, 1983.
[11] C.-S. Lin, C.-W. Ho, W.-C. Chen, C.-C. Chiu, and M.-S. Yeh, "Powered wheelchair controlled by eye-tracking system," Journal of Optica Applicata, vol. 36, 2006.
[12] "UW Augcomm: AACFeatures - Rate Enhancement," https://depts.washington.edu/augcomm/02_features/04d_rateenhance.htm.
[13] B. E. Murdoch and D. G. Theodoros, Traumatic brain injury: Associated speech, language, and swallowing disorders, Cengage Learning, 2001.
[14] S. H. Mohammadi and A. Kain, "An overview of voice conversion systems," Journal of Speech Communication, vol. 88, pp. 65-82, 2017.
[15] M. Ravanelli, T. Parcollet, and Y. Bengio, "The pytorch-kaldi speech recognition toolkit," in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6465-6469, 2019.
[16] K.-C. Chen, H.-W. Yeh, J.-Y. Hang, S.-H. Jhang, W.-Z. Zheng, and Y.-H. Lai, "A joint-feature learning-based voice conversion system for dysarthric user based on deep learning technology," in Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 1838-1841, 2019.
[17] J. Du and Q. Huo, "A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions," in Ninth Annual Conference of the International Speech Communication Association, 2008.
[18] L. Muda, M. Begam, and I. Elamvazuthi, "Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques," arXiv preprint arXiv:1003.4083, 2010.
[19] S. H. Yang and M. Chung, "Improving Dysarthric Speech Intelligibility Using Cycle-consistent Adversarial Training," arXiv preprint arXiv:2001.04260, 2020.
[20] D. Wang et al., "End-To-End Voice Conversion Via Cross-Modal Knowledge Distillation for Dysarthric Speech Reconstruction," in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7744-7748, 2020.
[21] B. Vachhani, C. Bhat, and S. K. Kopparapu, "Data Augmentation Using Healthy Speech for Dysarthric Speech Recognition," in proc. of the Interspeech, pp. 471-475, 2018.
[22] J. Shor et al., "Personalizing ASR for Dysarthric and Accented Speech with Limited Data," arXiv preprint arXiv:1907.13511, 2019.
[23] Y. Takashima, R. Takashima, T. Takiguchi, and Y. Ariki, "Knowledge Transferability Between the Speech Data of Persons With Dysarthria Speaking Different Languages for Dysarthric Speech Recognition," Journal of IEEE Access, vol. 7, pp. 164320-164326, 2019.
[24] J. R. Duffy, Motor speech disorders-e-book: Substrates, differential diagnosis, and management, Elsevier Health Sciences, 2013.
[25] G. Weismer, K. Tjaden, and R. D. Kent, "Can articulatory behavior in motor speechdisorders be accounted for by theories of normal speech production?," Journal of Phonetics, vol. 23, no. 1-2, pp. 149-164, 1995.
[26] M. Huang, "Development of Taiwan Mandarin hearing in noise test," Department of speech language pathology and audiology, National Taipei University of Nursing and Health science, 2005.
[27] Y. Jiao, M. Tu, V. Berisha, and J. Liss, "Simulating dysarthric speech for training data augmentation in clinical speech applications," in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6009-6013, 2018.
[28] H. Kawahara and M. Morise, "Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework," Journal of Sadhana, vol. 36, no. 5, pp. 713-727, 2011.
[29] M. Morise, F. Yokomori, and K. Ozawa, "WORLD: a vocoder-based high-quality speech synthesis system for real-time applications," Journal of IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877-1884, 2016.
[30] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, "Voice conversion through vector quantization," Journal of the Acoustical Society of Japan (E), vol. 11, no. 2, pp. 71-76, 1990.
[31] A. Kain and M. W. Macon, "Spectral voice conversion for text-to-speech synthesis," in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 285-288, 1998.
[32] J. Han and C. Moraga, "The influence of the sigmoid function parameters on the speed of backpropagation learning," in International Workshop on Artificial Neural Networks, pp. 195-201, 1995.
[33] E. L. Lehmann and G. Casella, Theory of point estimation. Springer Science & Business Media, 2006.
[34] P. J. Werbos, The roots of backpropagation: from ordered derivatives to neural networks and political forecasting. John Wiley & Sons, 1994.
[35] L. v. d. Maaten and G. Hinton, "Visualizing data using t-SNE," Journal of machine learning research, vol. 9, no. Nov, pp. 2579-2605, 2008.
[36] T. J. Hazen, W. Shen, and C. White, "Query-by-example spoken term detection using phonetic posteriorgram templates," in IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 421-426, 2009.
[37] A. Hannun, "Sequence modeling with CTC," Distill, vol. 2, no. 11, pp. e8, 2017.
[38] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly, "A Comparison of Sequence-to-Sequence Models for Speech Recognition," in proc. of the Interspeech, pp. 939-943, 2017.
[39] C.-C. Chiu et al., "State-of-the-art speech recognition with sequence-to-sequence models," in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4774-4778, 2018.
[40] E. Yilmaz, M. Ganzeboom, C. Cucchiarini, and H. Strik, "Multi-stage DNN training for automatic recognition of dysarthric speech," 2017.
[41] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, "Audio augmentation for speech recognition," in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[42] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, "X-vectors: Robust DNN embeddings for speaker recognition," in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5329-5333, 2018.
[43] C. Cannam, "Rubber band: A library and utility program for changing tempo and pitch of an audio recording," http://breakfastquay.com/rubberband/.
[44] SPTK Working Group et al., "Speech signal processing toolkit (sptk)," http://sp-tk.sourceforge.net/.
[45] A. Radford, L. Metz, and S. Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks," arXiv preprint arXiv:1511.06434, 2015.
[46] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proc. of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[47] S. Shao, P. Wang, and R. Yan, "Generative adversarial networks for data augmentation in machine fault diagnosis," Journal of Computers in Industry, vol. 106, pp. 85-93, 2019.
[48] A. Antoniou, A. Storkey, and H. Edwards, "Data augmentation generative adversarial networks," arXiv preprint arXiv:1711.04340, 2017.
[49] T. Kaneko and H. Kameoka, "Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks," in European Signal Processing Conference, pp. 2100-2104, 2018.
[50] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, "Unpaired image-to-image translation using cycle-consistent adversarial networks," in Proc. of the IEEE international conference on computer vision, pp. 2223-2232, 2017.
[51] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks," in IEEE Spoken Language Technology Workshop, pp. 266-273, 2018.
[52] "Balabolka," http://www.cross-plus-a.com/balabolka.htm, 2020.
[53] "Microsoft Speech API," https://en.wikipedia.org/wiki/Microsoft_Speech_API, 2019.
[54] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, "Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline," in Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, pp. 1-5, 2017.
[55] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, "Language modeling with gated convolutional networks," in Proc. of the International Conference on Machine Learning-Volume 70, pp. 933-941, 2017.
[56] T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, "Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks," in Proc. of the Interspeech, pp. 1283-1287, 2017.
[57] D. Povey et al., "The Kaldi speech recognition toolkit," in IEEE workshop on automatic speech recognition and understanding, 2011.
[58] N. Jaitly and G. E. Hinton, "Vocal tract length perturbation (VTLP) improves speech recognition," in Proc. of the ICML Workshop on Deep Learning for Audio, Speech and Language, vol. 117, 2013.
[59] J. Kominek, T. Schultz, and A. W. Black, "Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion," in Spoken Languages Technologies for Under-Resourced Languages, 2008.
[60] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion," in Proc. of the Interspeech, 2019.
[61] S. Luo and M. Sun, "Two-character Chinese word extraction based on hybrid of internal and contextual measures," in Proc. of the second SIGHAN workshop on Chinese language processing, pp. 24-30, 2003.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊