跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.81) 您好!臺灣時間:2024/12/15 03:50
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:黃子賢
研究生(外文):Tzu-Hsien Huang
論文名稱:類神經網路語音轉換模型之強健性分析與任意對任意非平行語料序列到序列語音轉換模型
論文名稱(外文):Robustness Analysis for Neural Voice Conversion Models and An Any-to-Any Non-Parallel Sequence-to-Sequence Voice Conversion Model
指導教授:李琳山李琳山引用關係
指導教授(外文):Lin-shan Lee
口試委員:李宏毅鄭秋豫王小川簡仁宗陳信宏
口試委員(外文):Hung-yi LeeChiu-yu TsengHsiao-Chuan WangJen-Tzung ChienSin-Horng Chen
口試日期:2021-06-30
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2021
畢業學年度:109
語文別:中文
論文頁數:85
中文關鍵詞:語音轉換
外文關鍵詞:Voice conversion
DOI:10.6342/NTU202101226
相關次數:
  • 被引用被引用:0
  • 點閱點閱:182
  • 評分評分:
  • 下載下載:21
  • 收藏至我的研究室書目清單書目收藏:0
語音轉換(Voice conversion)任務的目標為不改變輸入語音中的語言內容,而將語音轉換成另一段語音,所轉換的可以是口音、韻律、情緒或是語者特徵。目前許多研究都已使用深層學習來達成語音轉換任務,甚至有公開展示的語音成果供人聆聽,且乍聽之下都表現很好;然而因為語音轉換屬於訊號生成任務,通常只能主觀衡量,很容易會因為樣本數不夠多,不易真實反應模型的表現;另一方面,到目前為止,幾乎所有的論文都只測試在訓練資料集上,很少呈現在真實環境下的表現,而當資料的統計分佈不同時,深層學習難免會有嚴重的表現衰退的問題。
本論文第一部分主要探討各種不同的基於深層學習的語音轉換模型,對於訓練和測試資料分布不一致時生成的語音品質。受測的模型包含FragmentVC、AutoVC、AdaIN-VC、VQVC+、BLOW、DGAN-VC 和WAStarGAN-VC,並盡可能對模型做公平的比較,測試的情境包含不同的錄音環境、不同的語言、不同的性別間的轉換和增加雜訊的語音,並用實驗證明語音轉換模型能夠訓練在單一語言的資料上,而能很好在不同語言間做轉換。
本論文第二部分主要嘗試在沒有平行的訓練資料且沒有文字資訊的情境下,達成序列到序列的語者轉換模型;根據FragmentVC 的架構,額外增添韻律模組來學習建立目標韻律,並實驗各種資料增強的方法。雖然因為時間因素實驗尚未完成,但仍有一些發現可供未來的研究者參考,故忠實紀錄實驗的結果與推想。
The goal of voice conversion task is to convert some property (e.g. accent, rhythm, emotion, or speaker characteristics) of the source audio without changing the linguistic content.

At present, many studies used deep learning to achieve the voice conversion task, with publicly available audio results sounding very well. However, the voice conversion task is a signal generation task, actual performance of which requirs human evaluation based on enough number of audio samples. In particular, very often only results on the training dataset rather than those for the real-world data were provided, but it is well known that the performance of deep learning models may drop seriously for the out-of-distribution data.

In the first part of this thesis, we analyzed the performances of a few most updated and popular voice conversion models based on deep learning when they were tested with very different datasets. The voice conversion models analyzed include FragmentVC, AutoVC, AdaIN-VC, VQVC+, BLOW, DGANVC, and WAStarGAN-VC, in the scenarios of different recording environments, different languages, and different genders. One example result was we showed the voice conversion models trained in a single language can generalize well to different languages.

In the second part of this thesis, we made an effort to try to achieve sequence-to-sequence and any-to-any voice conversion with non-parallel training data. We modified the framework of FragmentVC by adding an additional prosody module, and tried to use several data augmentation methods. Although the experiments were not completed due to time limitation, some interesting findings were reported here for future researchers to refer to.
中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
英文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
一、導論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 研究方向. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 研究貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 章節安排. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
二、背景知識. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 深層類神經網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 自回歸模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 序列到序列模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 教師強迫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 轉換器類神經網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 專注機制. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 多頭自專注. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 位置編碼. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 自編碼器. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 變分自編碼器. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 基於向量量化之變分自編碼器. . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 生成式對抗網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 循環生成式對抗網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 星狀生成式對抗網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 流模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 自監督學習表徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
三、語音轉換模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 資料使用情境比較. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 平行語料語音轉換. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 非平行語料語音轉換. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 輸入和輸出類型數目比較. . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 任意對任意語音轉換模型. . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 任意對一語音轉換模型. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 多對多語音轉換模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.4 一對一語音轉換模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 生成行為比較. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 音框對音框. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.2 序列對序列. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 相關技術. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 基於表徵解纏. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 基於範例. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.3 基於生成式對抗網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.4 基於流學習. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
四、語音轉換模型在不同音檔條件下的強健性比較. . . . . . . . . . . . . . . . . . . . 25
4.1 資料集介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 評量方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 自動語音辨識. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2 類神經網路語者辨識模型. . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.3 類神經網路平均意見評量模型. . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 任意對任意語音轉換模型的強健性比較. . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 模型介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 實驗設置. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.3 任意對任意語者轉換模型之跨資料集實驗結果與分析. . . . . . . . . . . . . . . . 34
4.3.4 任意對任意語者轉換模型之語言類別相關性實驗結果與分析. . . . . . . . . . . . . 40
4.3.5 任意對任意語者轉換模型之性別轉換難易度實驗結果與分析. . . . . . . . . . . . . 45
4.4 多對多語音轉換模型的強健性比較. . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.1 模型介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.2 實驗設置. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.3 實驗結果與分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
五、邁向任意對任意非平行語料序列對序列之語者轉換模型. . . . . . . . . . . . . . . . 56
5.1 相關研究. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 模型架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 實驗設置. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4 實驗結果與分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
六、結論與展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1 研究貢獻與討論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
[1] K. Fukushima and S. Miyake, “Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition,” in Competition and cooperation in neural nets. Springer, 1982, pp. 267–285.
[2] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[3] T. Homma, L. E. Atlas, and R. J. Marks, “An artificial neural network for spatiotemporal: application to phoneme classification,” in Proceedings of the 1987 International Conference on Neural Information Processing Systems, 1987, pp. 31–40.
[4] J. L. Elman, “Finding structure in time.” Cognitive Science, 1990.
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[6] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” in NIPS 2014 Workshop on Deep Learning, December 2014, 2014.
[7] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, Eds., vol. 27. Curran Associates, Inc., 2014. [Online]. Available: https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf
[8] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural computation, vol. 1, no. 2, pp. 270–280, 1989.
[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[10] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations, ICLR 2015, 2015.
[11] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421.
[12] P. Baldi and K. Hornik, “Neural networks and principal component analysis: Learning from examples without local minima,” Neural Networks, vol. 2, no. 1, pp.53–58, 1989. [Online]. Available: https://doi.org/10.1016/0893-6080(89)90014-2
[13] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,”in 2015 IEEE Information Theory Workshop (ITW). IEEE, 2015, pp. 1–5.
[14] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[15] A. van den Oord, O. Vinyals, and k. kavukcuoglu, “Neural discrete representation learning,” in Advances in Neural Information Processing Systems, I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds.,vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, Eds., vol. 27. Curran Associates, Inc., 2014. [Online]. Available: https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
[17] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[18] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018, pp. 8789–8797.
[19] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent components estimation,” arXiv preprint arXiv:1410.8516, 2014.
[20] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” arXiv preprint arXiv:1605.08803, 2016.
[21] Y. Chung, W. Hsu, H. Tang, and J. R. Glass, “An unsupervised autoregressive model for speech representation learning,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 146–150. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-1473
[22] Y. Chung, H. Tang, and J. R. Glass, “Vector-quantized autoregressive predictive coding,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 3760–3764. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-1228
[23] A. H. Liu, Y. Chung, and J. R. Glass, “Non-autoregressive predictive coding for learning speech representations from local dependencies,” CoRR, vol.abs/2011.00406, 2020. [Online]. Available: https://arxiv.org/abs/2011.00406
[24] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 3465–3469. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-1873
[25] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=rylwJxrYDS
[26] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html
[27] A. T. Liu, S. Yang, P. Chi, P. Hsu, and H. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” CoRR, vol.abs/1910.12638, 2019. [Online]. Available: http://arxiv.org/abs/1910.12638
[28] A. T. Liu, S. Li, and H. Lee, “TERA: self-supervised learning of transformer encoder representation for speech,” CoRR, vol. abs/2007.06028, 2020. [Online]. Available: https://arxiv.org/abs/2007.06028
[29] C. Veaux, J. Yamagishi, K. MacDonald et al., “Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2016.
[30] J. Kominek and A. W. Black, “The CMU arctic speech databases,” in Fifth ISCA ITRW on Speech Synthesis, Pittsburgh, PA, USA, June 14-16, 2004, A. W. Black and K. A. Lenzo, Eds. ISCA, 2004, pp. 223–224. [Online]. Available: http://www.isca-speech.org/archive_open/ssw5/ssw5_223.html
[31] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, and J. Yamagishi, “The voice conversion challenge 2016.” in Interspeech, 2016, pp. 1632–1636.
[32] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” arXiv preprint arXiv:1804.04262, 2018.
[33] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp.1526–1530. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-2441
[34] K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
[35] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with sparse training data,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp.345–354, 2005.
[36] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4052–4056.
[37] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4879–4883.
[38] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
[39] T.-h. Huang, J.-h. Lin, and H.-y. Lee, “How far are we from robust voice conversion: A survey,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 514–521.
[40] C.-M. Chien, J.-H. Lin, C.-y. Huang, P.-c. Hsu, and H.-y. Lee, “Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multistyle text-to-speech,” arXiv preprint arXiv:2103.04088, 2021.
[41] W. Huang, Y. Wu, T. Hayashi, and T. Toda, “Any-to-one sequence-to-sequence voice conversion using self-supervised discrete speech representations,” CoRR, vol.abs/2010.12231, 2020. [Online]. Available: https://arxiv.org/abs/2010.12231
[42] J. Chou, C. Yeh, H. Lee, and L. Lee, “Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, B. Yegnanarayana, Ed. ISCA, 2018, pp. 501–505. [Online]. Available: https://doi.org/10.21437/Interspeech.2018-1830
[43] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 5210–5219. [Online]. Available: http://proceedings.mlr.press/v97/qian19c.html
[44] D. Wu, Y. Chen, and H. Lee, “VQVC+: one-shot voice conversion by vector quantization and u-net architecture,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 4691–4695. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-1443
[45] X. Huang and S. J. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 2017, pp. 1510–1519. [Online]. Available: https://doi.org/10.1109/ICCV.2017.167
[46] Y.-H. Chen, D.-Y. Wu, T.-H. Wu, and H.-y. Lee, “Again-vc: A one-shot voice conversion using activation guidance and adaptive instance normalization,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5954–5958.
[47] R. Takashima, T. Takiguchi, and Y. Ariki, “Exemplar-based voice conversion in noisy environment,” in 2012 IEEE Spoken Language Technology Workshop (SLT), Miami, FL, USA, December 2-5, 2012. IEEE, 2012, pp. 313–317. [Online]. Available: https://doi.org/10.1109/SLT.2012.6424242
[48] Z. Jin, A. Finkelstein, S. DiVerdi, J. Lu, and G. J. Mysore, “Cute: A concatenative method for voice conversion using exemplar-based unit selection,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016. IEEE, 2016, pp. 5660–5664. [Online]. Available: https://doi.org/10.1109/ICASSP.2016.7472761
[49] Y. Y. Lin, C. Chien, J. Lin, H. Lee, and L. Lee, “Fragmentvc: Any-toany voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention,” CoRR, vol. abs/2010.14150, 2020. [Online]. Available: https://arxiv.org/abs/2010.14150
[50] J. Lin, Y. Y. Lin, C. Chien, and H. Lee, “S2VC: A framework for any-to-any voice conversion with self-supervised pretrained representations,” CoRR, vol.abs/2104.02901, 2021. [Online]. Available: https://arxiv.org/abs/2104.02901
[51] T. Kaneko and H. Kameoka, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” in 26th European Signal Processing Conference, EUSIPCO 2018, Roma, Italy, September 3-7, 2018. IEEE, 2018, pp.2100–2104. [Online]. Available: https://doi.org/10.23919/EUSIPCO.2018.8553236
[52] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019. IEEE, 2019, pp. 6820–6824. [Online]. Available: https://doi.org/10.1109/ICASSP.2019.8682897
[53] ——, “Cyclegan-vc3: Examining and improving cyclegan-vcs for mel-spectrogram conversion,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 2017–2021. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-2280
[54] C. Wang and Y. Yu, “Cyclegan-vc-gp: Improved cyclegan-based non-parallel voice conversion,” in 20th IEEE International Conference on Communication Technology, ICCT 2020, Nanning, China, October 28-31, 2020. IEEE, 2020, pp. 1281–1284. [Online]. Available: https://doi.org/10.1109/ICCT50939.2020.9295938
[55] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks,”in 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18-21, 2018. IEEE, 2018, pp. 266–273. [Online]. Available: https://doi.org/10.1109/SLT.2018.8639535
[56] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Stargan-vc2: Rethinking conditional methods for stargan-based voice conversion,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 679–683. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-2236
[57] Z. Zhang, B. He, and Z. Zhang, “GAZEV: gan-based zero-shot voice conversion over non-parallel speech corpus,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 791–795. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-1710
[58] M. Chen, Y. Shi, and T. Hain, “Towards low-resource stargan voice conversion using weight adaptive instance normalization,” CoRR, vol. abs/2010.11646, 2020. [Online]. Available: https://arxiv.org/abs/2010.11646
[59] J. Serrà, S. Pascual, and C. Segura, “Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 6790–6800. [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/9426c311e76888b3b2368150cd05f362-Abstract.html
[60] Y. Leng, X. Tan, S. Zhao, F. K. Soong, X. Li, and T. Qin, “Mbnet: MOS prediction for synthesized speech with mean-bias network,” CoRR, vol. abs/2103.00110, 2021. [Online]. Available: https://arxiv.org/abs/2103.00110
[61] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015. IEEE, 2015, pp. 5206–5210. [Online]. Available: https://doi.org/10.1109/ICASSP.2015.7178964
[62] I. Solak, Jan 2019. [Online]. Available: https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/
[63] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, B. Yegnanarayana, Ed. ISCA, 2018, pp. 1086–1090. [Online]. Available: https://doi.org/10.21437/Interspeech.2018-1929
[64] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-demos.6
[65] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. Lopez-Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 4485–4495. [Online]. Available: https://proceedings.neurips.cc/paper/2018/hash/6832a7b24bc06775d02b7406880b93fc-Abstract.html
[66] J. Chou and H. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 664–668. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-2663
[67] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, ser. Lecture Notes in Computer Science, N. Navab, J. Hornegger, W. M. W. III, and A. F. Frangi, Eds., vol. 9351. Springer, 2015, pp. 234–241. [Online]. Available: https://doi.org/10.1007/978-3-319-24574-4_28
[68] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, ser. Proceedings of Machine Learning Research, J. G. Dy and A. Krause, Eds., vol. 80. PMLR, 2018, pp. 2415–2424. [Online]. Available: http://proceedings.mlr.press/v80/kalchbrenner18a.html
[69] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based highquality speech synthesis system for real-time applications,” IEICE Trans. Inf. Syst., vol. 99-D, no. 7, pp. 1877–1884, 2016. [Online]. Available: https://doi.org/10.1587/transinf.2015EDP7457
[70] D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time fourier transform,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’83, Boston, Massachusetts, USA, April 14-16, 1983. IEEE, 1983, pp. 804–807. [Online]. Available: https://doi.org/10.1109/ICASSP.1983. 1172092
[71] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings,” in Proceedings of Meetings on Acoustics ICA2013, vol. 19, no. 1. Acoustical Society of America, 2013, p. 035081.
[72] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 5036–5040. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-3015
[73] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016. ISCA, 2016, p. 125. [Online]. Available: http://www.isca-speech.org/archive/SSW_2016/abstracts/ssw9_DS-4_van_den_Oord.html
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top