跳到主要內容

臺灣博碩士論文加值系統

(44.201.97.224) 您好!臺灣時間:2024/04/14 18:26
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:鄭俊祥
研究生(外文):CHUN-HSIANG CHENG
論文名稱:基於語者特徵領域泛化之零資源語音轉換系統
論文名稱(外文):Zero-shot Voice Conversion Based on Speaker Embedding Domain Generalization
指導教授:王家慶
指導教授(外文):Jia-Ching Wang
學位類別:碩士
校院名稱:國立中央大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2022
畢業學年度:110
語文別:中文
論文頁數:65
中文關鍵詞:語音轉換語者編碼語音合成領域泛化元學習
外文關鍵詞:voice conversionspeaker embeddingtext-to-speechdomain generalizationnmeta-learning
相關次數:
  • 被引用被引用:0
  • 點閱點閱:100
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
近年來隨著深度學習的發展,讓人們開始可以進行一些天馬行空的想像,透過語音轉換的方式,將任何一位來源語者的聲音,只保留聲音中的語義資訊(如文字),將聲音中的語者資訊(如音高、語速、能量)轉換成另一位目標語者的聲音。然而,若要達到良好的轉換效果,就必須要有足夠的訓練資料對模型進行足夠的訓練,並且需要提升模型的泛化能力來提高模型對任何領域的推論效果。因此通常語音轉換任務在註冊語者(訓練時用過的語者資料)上的效果較好,而在未註冊語者(訓練時未用過的語者資料)上效果較差,雖然近年來也有研究朝向未註冊語者的語音轉換,但合成出的品質還是低於註冊語者的品質,因此本論文希望建構出一個零資源的中文語音轉換系統來改善語音轉換任務中未註冊語者的語音品質。
本論文建構了一種零資源的語音轉換系統,主要透過有效地解耦語音當中的語義資訊及語者資訊來達成零資源的語音轉換,本論文讓模型分別透過預訓練之語音辨識模型Wav2vec 2.0模型提取來自於來源語者的語義資訊,以及透過WavLM模型提取來自於目標語者的語者資訊,再將目標語者的語者資訊透過Robust MAML模型將語者資訊映射到一個領域泛化(domain generalization)的空間中,使其能夠直接應用於任何未註冊的語者領域(unseen speaker domain),最後再透過遷移學習的方式,將語義資訊以及領域泛化之語者資訊經由語音合成模型FastSpeech2合成出目標語者的語音,以此建構出一個零資源的語音轉換系統。
In recent years, with the development of deep learning, people can start to have some wild imagination. Through the method of voice conversion, the voice of any source speaker will only retain the semantic information (such as text) in the voice, and the voice will be converted the speaker information (such as pitch, speed, energy) of source speaker into the speaker information of another target speaker. However, in order to achieve a good conversion effect, there must be enough training data to train the model enough, and the generalization ability of the model needs to be improved to improve the inference effect of the model in any data domain. Therefore, the speech conversion task usually performs better on registered speakers (speaker data used in training), but is less effective on unregistered speakers (speaker data not used in training), although in recent years there have research is aimed at the voice conversion of unregistered speakers, but the quality of the synthesis is still lower than that of registered speakers. Therefore, this paper hopes to construct a zero-resource Chinese voice conversion system to improve the voice quality of unregistered speakers in the voice conversion task..
This paper constructs a zero-resource speech conversion system, which mainly achieves zero-resource speech conversion by effectively decoupling the semantic information and speaker information in the speech. In this paper, the model uses the pre-trained speech recognition model Wav2vec 2.0 model to extract the semantic information from the source speaker, and extract the speaker information from the target speaker through the WavLM model, and then map the speaker information of the target speaker to a domain generalization feature space through the Robust MAML model, it can be directly applied to any unregistered speaker domain (unseen speaker domain). finally, through transfer learning, the speech of target voice will be synthesized by the source speaker’s semantic information and target speaker’s speaker information through the FastSpeech2 model.
目錄
摘要 i
ABSTRACT ii
目錄 iv
圖目錄 vii
表目錄 viii
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 研究方法與章節概要 2
第二章 語音轉換簡介與相關文獻探討 4
2.1 語音語者特徵 4
2.1.1 One-hot vector 4
2.1.2 D-vector 5
2.1.3 X-vector 6
2.2 Transformer 7
2.2.1 自注意力演算法(Self-Attention) 8
2.2.2 多頭注意力機制(Multi-head Attention) 9
2.2.3 位置編碼演算法(Positional Encoding,PE) 10
2.3 語音辨識模型 11
2.3.1 Contrastive Predictive Coding(CPC) 11
2.3.2 Wav2vec 2.0. 12
2.3.3 HuBERT 14
2.4 語音生成模型 15
2.4.1 Fastspeech 15
2.5 聲碼器(Vocoder) 17
2.5.1 HiFi-GAN 17
2.6 語音轉換相關文獻 18
第三章 基於語者特徵領域泛化之零資源語音轉換系統Zero-shot Voice Conversion Based on Speaker Embedding Domain Generalization 20
3.1 語者特徵擷取模型 21
3.1.1 WavLM 21
3.1.2 WavLM模型架構 21
3.1.3 門控相對位置偏差(gated relative position bias) 23
3.1.4 遮蔽語音去噪及預測(Masked Speech Denoising and Prediction) 24
3.1.5 WavLM語者特徵擷取模型 24
3.2 語者特徵泛化模型 25
3.2.1 Model-Agnostic Meta-Learning(MAML) 26
3.2.2 MAML資料設置 26
3.2.3 MAML訓練流程 27
3.2.4 Robust MAML 29
3.3 多語者語音合成模型 31
3.3.1 FastSpeech2 31
3.3.2 FastSpeech2模型架構 32
3.3.3 方差適配器(Variance Adaptor) 32
3.3.4 多語者FastSpeech2模型 33
3.4 語者語音轉換模型 34
3.4.1 語音轉換模型架構 35
3.4.2 推論流程 36
第四章 實驗 37
4.1 資料集 37
4.1.1 AISHELL1 37
4.1.2 AISHELL3 38
4.2 實驗設置 39
4.2.1 實驗設備及環境 39
4.2.2 語者特徵擷取模型之設置 40
4.2.3 語者特徵擷取模型Xvector模型之訓練 41
4.2.4 語者特徵擷取模型WavLM模型之訓練 41
4.2.5 語者特徵擷取模型WavLM + Robust MAML模型之訓練 43
4.2.6 語音合成模型訓練 43
4.3 實驗結果與分析 44
4.3.1 實驗評估方式 44
4.3.2 語者特徵擷取模型效果 45
4.3.3 註冊語者語音轉換效果 46
4.3.4 未註冊語者語者特徵 47
4.3.5 未註冊語者語音轉換效果 47
第五章 結論及未來展望 49
第六章 參考文獻 50
[1] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, May 2014, pp. 4052–4056. doi: 10.1109/ICASSP.2014.6854363.
[2] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Apr. 2018, pp. 5329–5333. doi: 10.1109/ICASSP.2018.8461375.
[3] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Interspeech 2015, Sep. 2015, pp. 3214–3218. doi: 10.21437/Interspeech.2015-647.
[4] A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1706.03762
[5] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv, May 19, 2016. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1409.0473
[6] L. Dong, S. Xu, and B. Xu, “Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018, pp. 5884–5888. doi: 10.1109/ICASSP.2018.8462506.
[7] A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding.” arXiv, Jan. 22, 2019. Accessed: Jul. 16, 2022. [Online]. Available: http://arxiv.org/abs/1807.03748
[8] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Mar. 2010, pp. 297–304. Accessed: Jul. 17, 2022. [Online]. Available: https://proceedings.mlr.press/v9/gutmann10a.html
[9] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” arXiv, Oct. 22, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/2006.11477
[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv, May 24, 2019. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1810.04805
[11] E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization with Gumbel-Softmax.” arXiv, Aug. 05, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1611.01144
[12] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units.” arXiv, Jun. 14, 2021. Accessed: Jun. 19, 2022. [Online]. Available: http://arxiv.org/abs/2106.07447
[13] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, Mar. 1982, doi: 10.1109/TIT.1982.1056489.
[14] Y. Ren et al., “FastSpeech: Fast, Robust and Controllable Text to Speech.” arXiv, Nov. 20, 2019. Accessed: Jul. 01, 2022. [Online]. Available: http://arxiv.org/abs/1905.09263
[15] Y. Wang et al., “Tacotron: Towards End-to-End Speech Synthesis.” arXiv, Apr. 06, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1703.10135
[16] J. Shen et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.” arXiv, Feb. 15, 2018. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1712.05884
[17] W. Ping et al., “Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning.” arXiv, Feb. 22, 2018. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1710.07654
[18] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.” arXiv, Oct. 23, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/2010.05646
[19] K. Kumar et al., “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” in Advances in Neural Information Processing Systems, 2019, vol. 32. Accessed: Jul. 17, 2022. [Online]. Available: https://papers.nips.cc/paper/2019/hash/6804c9bca0a615bdb9374d00a9fcba59-Abstract.html
[20] L.-W. Chen, H.-Y. Lee, and Y. Tsao, “Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech.” arXiv, Aug. 22, 2019. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1810.12656
[21] T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks,” in 2018 26th European Signal Processing Conference (EUSIPCO), Sep. 2018, pp. 2100–2104. doi: 10.23919/EUSIPCO.2018.8553236.
[22] J. Serrà, S. Pascual, and C. Segura, “Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion.” arXiv, Sep. 05, 2019. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1906.00794
[23] D.-Y. Wu, Y.-H. Chen, and H. Lee, “VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net Architecture,” in Interspeech 2020, Oct. 2020, pp. 4691–4695. doi: 10.21437/Interspeech.2020-1443.
[24] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder.” arXiv, Oct. 13, 2016. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1610.04019
[25] I. J. Goodfellow et al., “Generative Adversarial Networks.” arXiv, Jun. 10, 2014. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1406.2661
[26] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes.” arXiv, May 01, 2014. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1312.6114
[27] J. Chou, C. Yeh, H. Lee, and L. Lee, “Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations.” arXiv, Jun. 24, 2018. Accessed: Jul. 14, 2022. [Online]. Available: http://arxiv.org/abs/1804.02812
[28] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks.” arXiv, Jun. 29, 2018. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1806.02169
[29] J.-C. Chou, C. Yeh, and H. Lee, “One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization,” 2019. doi: 10.21437/interspeech.2019-2663.
[30] X. Huang and S. Belongie, “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization.” arXiv, Jul. 30, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1703.06868
[31] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss.” arXiv, Jun. 06, 2019. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1905.05879
[32] A. Polyak, L. Wolf, and Y. Taigman, “TTS Skins: Speaker Conversion via ASR.” arXiv, Jul. 26, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1904.08983
[33] S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing.” arXiv, Jan. 24, 2022. Accessed: Jun. 19, 2022. [Online]. Available: http://arxiv.org/abs/2110.13900
[34] J. Kahn et al., “Libri-Light: A Benchmark for ASR with Limited or No Supervision,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 7669–7673. doi: 10.1109/ICASSP40776.2020.9052942.
[35] Z. Chi et al., “XLM-E: Cross-lingual Language Model Pre-training via ELECTRA.” arXiv, Apr. 19, 2022. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/2106.16138
[36] J. Kang, R. Liu, L. Li, Y. Cai, D. Wang, and T. F. Zheng, “Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning,” in Interspeech 2020, Oct. 2020, pp. 3825–3829. doi: 10.21437/Interspeech.2020-2562.
[37] C. Finn, P. Abbeel, and S. Levine, “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.” arXiv, Jul. 18, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1703.03400
[38] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals, “Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML.” arXiv, Feb. 12, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1909.09157
[39] Q. Qian, S. Zhu, J. Tang, R. Jin, B. Sun, and H. Li, “Robust Optimization over Multiple Domains.” arXiv, Nov. 14, 2018. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1805.07588
[40] Q. Dou, D. C. Castro, K. Kamnitsas, and B. Glocker, “Domain Generalization via Model-Agnostic Learning of Semantic Features.” arXiv, Oct. 29, 2019. doi: 10.48550/arXiv.1910.13580.
[41] Y. Ren et al., “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.” arXiv, Mar. 04, 2021. Accessed: Jul. 01, 2022. [Online]. Available: http://arxiv.org/abs/2006.04558
[42] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” 2017. doi: 10.21437/INTERSPEECH.2017-1386.
[43] A. Suni, D. Aalto, T. Raitio, P. Alku, and M. Vainio, “Wavelets for intonation modeling in HMM speech synthesis,” Th ISCA Speech Synth. Workshop, p. 6, 2013.
[44] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline.” arXiv, Sep. 16, 2017. doi: 10.48550/arXiv.1709.05522.
[45] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines.” arXiv, Apr. 22, 2021. doi: 10.48550/arXiv.2010.11567.
[46] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization.” arXiv, Jan. 29, 2017. doi: 10.48550/arXiv.1412.6980.
[47] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” p. 30.
[48] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization.” arXiv, Jul. 21, 2016. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1607.06450
[49] D. Hendrycks and K. Gimpel, “Gaussian Error Linear Units (GELUs).” arXiv, Jul. 08, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1606.08415
[50] “tsne.pdf.” Accessed: Jul. 15, 2022. [Online]. Available: http://www.cs.toronto.edu/~hinton/absps/tsne.pdf
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top