跳到主要內容

臺灣博碩士論文加值系統

(44.192.49.72) 您好!臺灣時間:2024/09/19 22:31
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:林均憲
研究生(外文):LIN, CHUN-HSIEN
論文名稱:應用於樂器音色轉換之生成對抗網路設計
論文名稱(外文):Generative Adversarial Network Design for Instrument Timbre Transfer Application
指導教授:許明華許明華引用關係
指導教授(外文):SHEU, MING-HWA
口試委員:蔡宗漢江正雄吳亦超
口試委員(外文):TSAI, TSUNG-HANCHIANG, JEN-SHIUNWU, YI-CHAO
口試日期:2024-07-11
學位類別:碩士
校院名稱:國立雲林科技大學
系所名稱:電子工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2024
畢業學年度:112
語文別:中文
論文頁數:62
中文關鍵詞:生成對抗網路聲音合成音色轉換注意力機制
外文關鍵詞:Generative Adversarial NetworkSound SynthesisTimbre TransferAttention Mechanism
相關次數:
  • 被引用被引用:0
  • 點閱點閱:17
  • 評分評分:
  • 下載下載:1
  • 收藏至我的研究室書目清單書目收藏:0
本論文開發了一種基於生成對抗網路 ( Generative Adversarial Network, GAN ) 的樂器音色轉換系統,旨在利用深度學習技術實現不同樂器間的音色轉換。研究結合了可微分數位訊號處理 ( Differentiable Digital Signal Processing, DDSP ) 的模塊化特性與 Any-to-Any 語音轉換技術,提出了一種新的音色轉換合成模型架構。該模型包括 F0 編碼器、響度編碼器、能量編碼器及音色編碼器等技術,分別處理音高、響度、能量和音色,並通過音色轉換網路解碼器實現精確的音色轉換與合成。此外,模型中引入的非整數諧波生成技術,有效地補足了整數諧波無法覆蓋的音色範疇。通過對 NSynth-subset 資料集進行訓練和測試,結果顯示模型在音色轉換方面展現出高度的有效性和潛力。在重合成 ( resynthesis ) 實驗中,本論文的模型在多項指標上達到了與 DDSP 相近的數值,其中響度 ( loudness ) L1 距離達到0.08,基頻 ( F0 ) L1 距離達到 0.02,證明了模型在音色合成上的高精度和穩定性。同時,在參數量僅略微增加的情況下,模型還實現了 Any-to-Any 的音色轉換效果。此外,本論文還進行了平均意見分數 ( Mean Opinion Score, MOS ) 測試,結果顯示聽眾對於本論文模型生成的音頻在音質和音色相似性方面給予了高度評價。本論文在樂器音色轉換領域做出了重要的貢獻,提出了一種創新的高傳真度音色轉換模型,並為未來的改進和應用提供了更多可能性。
This paper develops an instrument timbre transfer system based on Generative Adversarial Networks (GAN), aiming to achieve timbre transfer between different instruments using deep learning techniques. The research combines the modular characteristics of Differentiable Digital Signal Processing (DDSP) with Any-to-Any voice conversion technology, proposing a new timbre transfer synthesis model architecture. This model includes technologies such as F0 encoder, loudness encoder, energy encoder, and timbre encoder, which respectively process pitch, loudness, energy, and timbre, and achieve precise timbre transfer and synthesis through a timbre transfer network decoder. Additionally, the introduction of non-integer harmonic generation technology in the model effectively supplements the timbre range that integer harmonics cannot cover.
By training and testing on the NSynth-subset dataset, the results show that the model exhibits high effectiveness and potential in timbre transfer. In the resynthesis experiments, the model in this paper achieved values close to DDSP in several metrics, with a loudness (L1) distance of 0.08 and a fundamental frequency (F0) (L1) distance of 0.02, demonstrating the model's high accuracy and stability in timbre synthesis. Moreover, with only a slight increase in parameter count, the model also achieved Anyto-Any timbre transfer effects. Furthermore, this paper conducted Mean Opinion Score (MOS) tests, where listeners gave high ratings to the audio generated by the model in terms of sound quality and timbre similarity. This paper makes significant contributions to the field of instrument timbre transfer, proposing an innovative high-fidelity timbre transfer model and providing more possibilities for future improvements and applications.
摘要 i
Abstract ii
目錄 iii
表目錄 vi
圖目錄 vii
第一章、緒論 1
1.1研究背景與目的 1
1.2論文架構 2
1.3研究貢獻 2
第二章、相關研究 3
2.1深度學習聲音生成方法 3
2.1.1 WaveNet 3
2.1.2 WaveRNN 5
2.1.3 Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders 6
2.1.4 Fast and Flexible Neural Audio Synthesis 7
2.1.5 HiFiGAN 9
2.1.6 DDSP 10
2.2深度學習音色轉換方法 12
2.2.1 A Universal Music Translation Network 12
2.2.2 Any-to-Any Voice Conversion with F0 and Timbre Disentanglement and Novel Timbre Conditioning 13
第三章、樂器音色轉換之生成對抗網路設計與架構 15
3.1 GAN Design for Instrument Timbre Transfer設計方法 15
3.2 Source Encoder方法 16
3.2.1 F0 Encoder設計 16
3.2.2 Loudness Encoder設計 17
3.2.3 Energy Encoder設計 17
3.3 Timbre Encoder方法 18
3.4 Timbre Transfer Network Decoder方法 19
3.4.1 Linear Normalization Block and MLP設計 20
3.4.2 Timbre Z Modifier設計 21
3.4.3 Timbre Transformer設計 23
3.4.4 Integer Harmonic Head設計 24
3.4.5 Non-Integer Harmonic Head設計 25
3.4.6 Noise Head設計 25
3.5 Timbre Transfer Network 聲音合成器方法 26
3.5.1 Harmonic Oscillator加法合成器 26
3.5.2 Envelopes設計 27
3.5.3 Filter Design: Frequency Sampling方法 27
3.5.4 Filtered Noise減法合成器 28
3.6 Timbre Transfer Network 鑑別器架構 28
3.7 損失函數 29
3.7.1 Adversarial Loss函式 29
3.7.2 Multiscale FFT Loss函式 29
3.7.3 Mel Spectrum Loss函式 30
3.7.4 Feature Matching Loss函式 30
3.7.5 KL Loss函式 31
3.7.6 Final Loss函式 32
第四章、聲音生成與音色轉換測試結果 33
4.1資料集 33
4.1.1 NSynth資料集 33
4.1.2 NSynth-subset資料集 34
4.2評估方法 38
4.2.1 F0 L1 Distance評估方法 38
4.2.2 Loudness F1 Distance評估方法 38
4.2.3 MOS 評估方法 38
4.3測試結果與分析 39
4.3.1 Resynthesis測試結果 39
4.3.2 Non-Integer Harmonic測試結果 44
4.3.3 MOS測試結果 46
第五章 結論與未來展望 47
參考文獻 48
附錄 50
附錄一、口試委員Q&A 50


[1]A. van den Oord et al., “WaveNet: A Generative Model for Raw Audio.” arXiv, Sep. 19, 2016.
[2]N. Kalchbrenner et al., “Efficient Neural Audio Synthesis.” arXiv, Jun. 25, 2018.
[3]J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.” arXiv, Oct. 23, 2020.
[4]J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: Differentiable Digital Signal Processing.” arXiv, Jan. 14, 2020.
[5]J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, "GANSynth: Adversarial Neural Audio Synthesis." in International Conference on Learning Representations (ICLR), 2019.
[6]J. Engel et al., “Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders.” arXiv, Apr. 05, 2017.
[7]L. Hantrakul, J. Engel, A. Roberts, and C. Gu, "Fast and Flexible Neural Audio Synthesis." in International Society for Music Information Retrieval Conference (ISMIR), 2019.
[8]A. van den Oord et al., “Parallel WaveNet: Fast High-Fidelity Speech Synthesis.” arXiv, Nov. 28, 2017.
[9]R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A Flow-based Generative Network for Speech Synthesis.” arXiv, Oct. 30, 2018.
[10]J.-M. Valin and J. Skoglund, “LPCNet: Improving Neural Speech Synthesis Through Linear Prediction.” arXiv, Feb. 19, 2019.
[11]N. Mor, L. Wolf, A. Polyak, and Y. Taigman, “A Universal Music Translation Network.” arXiv, May 23, 2018.
[12]S. Kovela, R. Valle, A. Dantrey, and B. Catanzaro, “Any-to-Any Voice Conversion with F 0 and Timbre Disentanglement and Novel Timbre Conditioning,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece: IEEE, Jun. 2023, pp. 1–5.
[13]J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “CREPE: A Convolutional Representation for Pitch Estimation.” arXiv, Feb. 16, 2018.
[14]B. Nguyen and F. Cardinaux, “NVC-Net: End-to-End Adversarial Voice Conversion.” arXiv, Jun. 02, 2021.
[15]K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition.” arXiv, Dec. 10, 2015.
[16]A. Vaswani et al., “Attention Is All You Need.” arXiv, Aug. 01, 2023.
[17]R. Dey and F. M. Salem, “Gate-variants of Gated Recurrent Unit (GRU) neural networks,” in 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA: IEEE, Aug. 2017, pp. 1597–1600.
[18]K. Kumar et al., “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis.” arXiv, Dec. 08, 2019.
[19]X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, “Least Squares Generative Adversarial Networks.” arXiv, Apr. 05, 2017.
[20] D. P. Kingma and M. Welling, "Auto-Encoding Variational Bayes," in International Conference on Learning Representations (ICLR), 2014.
[21] I. J. Goodfellow et al., "Generative Adversarial Nets," in Advances in Neural Information Processing Systems (NeurIPS), 2014.


QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊