跳到主要內容

臺灣博碩士論文加值系統

(44.201.92.114) 您好!臺灣時間:2023/03/31 10:41
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:張文瀚
研究生(外文):Chang, Wen-Han
論文名稱:基於語音模型之語音轉換探討與分析
論文名稱(外文):Study and Analysis of Speech Model-Based Voice Conversion
指導教授:林泰吉
指導教授(外文):Lin, Tay-Jyi
口試委員:王進賢葉經緯林泰吉賴穎暉
口試委員(外文):Wang, Jinn-ShyanYeh, Ching-weiLin, Tay-JyiLai, Ying-Hui
口試日期:2019-01-22
學位類別:碩士
校院名稱:國立中正大學
系所名稱:資訊工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2019
畢業學年度:107
語文別:中文
論文頁數:50
中文關鍵詞:語音轉換語音模型語音特徵
外文關鍵詞:Voice Conversion ChallengeSprocket
相關次數:
  • 被引用被引用:2
  • 點閱點閱:190
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
語音模型之語音轉換是將來源語音(Source Voice)的聲音,透過來源語音與目標語音的系統訓練過程,得出與目標語音(Target Voice)有關的轉換函式參數,接著將來源語音(Source Voice)經過語音特徵分析(包含語音基本頻率、非週期性頻率及梅爾頻率倒譜),再經過與目標語音特徵間參數轉換校正的特殊處理,最後通過語音合成產生波形,來達成將一個人提供之來源語音(Source Voice)的聲音轉換成另一個人,即目標語音(Target Voice)的聲音。
本篇論文將透過探討於Voice Conversion Challenge(VCC)2018 Baseline System之語音轉換軟體—Sprocket的轉換部分來分析語音模型是如何達成語音轉換的目的。在Sprocket的語音轉換部分可分為三個步驟,首先經過語音特徵分析來獲取每個人聲音差異之不同的主要因素,再透過對語音特徵做特殊處理,來將語音特徵轉換至目標人物的語音特徵,最後再將這些語音特徵通過語音合成器來產生目標人物的語音。本篇將針對主要演算法來探討,通過這些演算法將了解如何獲取這些語音的個別特徵與重要資訊,以及透過不同於Sprocket演算法,進行優化可行性分析,再針對個別模組的執行時間,研究與分析結果及數據有無差異。
在仔細研究並了解 Sprocket 架構及相關演算法及函式的使用方法後,擬定實驗設計對語音特徵之分析、轉換及合成部分,嘗試進行不同演算法與函式的設計,來達到執行時間下降且達成相似語音轉換品質的功效。基於Sprocket原作中是以Python語法來撰寫,本實驗將嘗試以C語法來完成重新開發與設計改寫 Sprocket,再基於C語法上來對演算法進行替換與優化,經過多次優化與調整,本實驗最後達成將Sprocket系統整體執行時間效率改善提升2.785倍的結果,在人耳聽覺的辨識情形下沒有明顯影響語音轉換差異與轉換品質,達到語音轉換速度加快的目標。

The target of voice conversion based on speech model-based is to transfer source voice to target voice. Through the training process, converting function parameter is generated. And then, the speech voice will just like the target voice by acoustic feature analysis, mapping, and synthesis in sequence.
This thesis will analyze how the speech model-based method achieves voice conversion by exploring Sprocket which is the baseline system of Voice Conversion Challenge 2018. The voice conversion part of Sprocket can be divided into three stages. First, analysis of acoustic feature is used to obtain the main factors that cause the difference of each person's voice. Second, through special processing on the acoustic features, the acoustic features are converted to the target’s acoustic features. Finally, the acoustic features are used to generate the speech of the target person through synthesis. This article will focus on exploring for the main algorithms. Through these algorithms we could realize how to find out the acoustic features and important information of speech. And also we could analysis the results of different models’ execution time.
After carefully studying the Sprocket architecture and using related algorithms and functions, we design experiment about analysis, conversion, and synthesis. We choose different algorithms to reduce the execution time but not cause damage to the speech quality. Based on the Python in the original Sprocket, this experiment will try to development and design in C, and then replace and optimize the algorithm and function based on C implementation. After lots of optimizations and adjustments, the experiment finally achieving 2.785 times improvement in the overall execution time of the Sprocket. We achieve the goal of speeding up the voice conversion’s execution time without significantly affecting difference in converting quality by subjective hearing.

第一章 序論
1.1 語音轉換
1.2 研究動機與目的
1.3 基於C語言之語音轉換實現
1.4 論文架構
第二章 語音模型之語音轉換
2.1 Sprocket語音轉換系統
2.2 Sprocket語音轉換流程
2.3 Sprocket語音轉換基於C語言之實現
第三章 時間複雜度分析
3.1 Sprocket語音轉換之時間複雜度分析
3.2 Sprocket語音轉換之演算法優化與替換
第四章 實驗設計與結果
4.1 實驗環境設定
4.2 實驗結果
第五章 結論
參考文獻


[1] K. Kobayashi and T. Toda, “Sprocket: open-source voice conversion software,” Proc. Odyssey, pp. 203-210, June 2018.
[2] M. Masanori, “Harvest: a high-performance fundamental frequency estimator rom speech signals,” Proc. INTERSPEECH, pp. 2321–2325, Aug. 2017.
[3] M. Masanori, “D4C, a band-aperiodicity estimator for high-quality speech synthesis,” Speech Communication, vol. 84, pp. 57-65, Sept. 2016.
[4] M. Masanori, “CheapTrick, a spectral envelope estimator for high-quality speech synthesis,” Speech Communication, vol. 67, pp. 1-7, Sept. 2015.
[5] J. Demmel, J. R. Gilbert, and X. S. Li, “Superlu users’ guide,” http://crd.lbl.gov/~xiaoye/SuperLU/ug.pdf.
[6] M. Morise and Y. Watanabe, “Sound quality comparison among high-quality vocoders by using re-synthesized speech,” Acoust. Sci. & Tech. 33, vol. 39, no. 3, pp. 263-265, May 2018.
[7] J. Flanagan and R. Golden, “Phase vocoder,” The Bell System Technical Journal, vol. 45, no. 9, pp. 1493–1509, 2009.
[8] Y. Stylianou, O. Capp´e, and E. Moulines, “Continuousprobabilistic transform for voice conversion,” IEEE Trans.SAP, vol. 6, no. 2, pp. 131–142, Mar. 1998.
[9] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum likelihood estimation of spectral parameter trajectory,” IEEE Trans. ASLP, vol. 15, no. 8, pp. 2222–2235, Nov. 2007.
[10] T. Toda, T. Muramatsu, and H. Banno, “Implementation of computationally efficient real-time voice conversion,” Proc. INTERSPEECH, Sept. 2012
[11] N. Pilkington, H. Zen, and M. Gales, “Gaussian process experts for voice conversion,” Proc. INTERSPEECH, pp. 2761–2764, Aug. 2011.
[12] N. Xu, Y. Tang, J. Bao, A. Jiang, X. Liu, and Z. Yang, “Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data,” Speech Communication, vol. 58, pp. 124–138, Mar. 2014.
[13] R. Takashima, T. Takiguchi, and Y. Ariki, “Exemplarbased voice conversion using sparse representation in noisy environments,” IEICE Trans. Inf. and Syst., vol. E96-A, no. 10, pp. 1946–1953, Oct. 2013.
[14] Z. Wu, T. Virtanen, E. Chng, and H. Li, “Exemplarbased sparse representation with residual compensation for voice conversion,” IEEE/ACM Trans. ASLP, vol. 22, no. 10, pp. 1506–1521, June 2014.
[15] T. Nakashika, R. Takashima, T. Takiguchi, and Y. Ariki,“Voice conversion in high-order eigen space using deep belief nets,” Proc. INTERSPEECH, pp. 369–372, Aug.2013.
[16] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice conversion using deep neural networks with layer-wise generative training,” IEEE/ACM Trans. ASLP, vol. 22, no.12, pp. 1859–1872, Dec. 2014.
[17] L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversion using deep bidirectional long short-term memory based recurrent neural networks,” Proc. ICASSP, pp. 4869–4873, Apr. 2015.
[18] D. Erro, A. Moreno, and A. Bonafonte, “INCA algorithm for training voice conversion systems from nonparallel corpora,” IEEE Trans. ASLP, vol. 18, no. 5, pp. 944–953, 2010.
[19] T. Hashimoto, D. Saito, and N. Minematsu, “Arbitrary speaker conversion based on speaker space bases constructed by deep neural networks,” Proc. APSIPA, pp. 1–4,Dec. 2016.
[20] Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano,“Many-to-many eigenvoice conversion with reference voice,” Proc. INTERSPEECH, pp. 1623–1626, Sept.2009.
[21] Y. Chen, M. Chu, E. Chang, J. Liu, and R. Liu, “Voice conversion with smoothed GMM and MAP adaptation,” Proc. INTERSPEECH, pp. 1–4, Sept. 2003.
[22] H. Kawahara, J. Estill, and O. Fujimura, “Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited,” Proc. ICASSP1997, pp.1303–1306, 1997.
[23] M. Morise, H. Kawahara, and T. Nishiura, “Rapid f0 estimation for high-snr speech based on fundamental component extraction,” IEICE Trans. Inf. & Syst. (Japanese Edition), vol.J93-D, no.2, pp.109–117, 2010.

電子全文 電子全文(網際網路公開日期:20240322)
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top