跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.82) 您好!臺灣時間:2024/12/05 00:37
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:楊崇文
研究生(外文):Yang, Chung-Wen
論文名稱:一個調整文字轉語音模型所產生之語音語速之系統
論文名稱(外文):A System for Modifying the Duration of Synthesized Speech from Text-To-Speech Models
指導教授:賴槿峰賴槿峰引用關係
指導教授(外文):Lai, Chin-Feng
口試委員:賴盈勳陳世曄蔡家緯蘇育生
口試委員(外文):Lai, Ying-HsunChen, Shih-YehTsai, Chia-WeiSu, Yu-Sheng
口試日期:2023-07-26
學位類別:碩士
校院名稱:國立成功大學
系所名稱:工程科學系
學門:工程學門
學類:綜合工程學類
論文種類:學術論文
論文出版年:2023
畢業學年度:111
語文別:英文
論文頁數:61
中文關鍵詞:文字轉語音音訊時長控制蒙特婁文字對齊器
外文關鍵詞:Text-To-SpeechAudio Time-Scale ModificationMontreal Forced Aligner
相關次數:
  • 被引用被引用:0
  • 點閱點閱:81
  • 評分評分:
  • 下載下載:15
  • 收藏至我的研究室書目清單書目收藏:0
在過去幾年裡,文字轉語音因為其多樣的應用性而受到許多的研究關注。隨著文宇轉語音相關技術的發展,人們對於所產生語音的要求,除了語音内容的正確性以外,更要求所產生語音必須具有高自然度。而影響語音自然度的其中一個關鍵的因素為語音的語速。大多數早期所提出的文字轉語音模型是以自我迴歸模型為基礎,產生語音的方式為基於前一幀的語音内容,再接續產生下一幀的語音内容。然而,這種以自我迴歸模型產生語音的方式有一個最大的缺點,在於其對於所產生語音之語速缺乏控制能力。為了增加對於語速的控制能力,較後期所提出的文字轉語音模型便轉而利用非自我迴歸模型作為其模型基礎。以非自我迴歸模型作為基礎的文宇轉語音模型,在訓練上所需要的資料量相當的大,因此在訓練上對於硬體設備的需求相當的高,所需的訓練時間也相對較長,造成模型訓練上的難度。因此,本論文提出一個調整文字轉語音模型所產生之語音語速之系統,本系統由一個文字對齊器、一個語速調整器、一個語音語速調整網路以及一個聲碼器組成。本系統的文字對齊器會找出語音中的文字邊界,語速調整器則會將語音轉換至頻域,接著根據文宇邊界調整每個文字所對應語音之語速,調整的方式如下:加入空白幀以延長語音語速,刪除數個幀以縮短語音語速。調整後的頻譜則被輸入至語音語速調整網路,替空白幀内填入適當的語音内容,並弭平插入與刪除幀所造成幀與幀之間的不連續性。最後,再由聲碼器將調整後的頻譜轉換回時域,輸出成為語音音訊。實驗結果顯示透過本系統所調整語速之語音,與基於非自我迴歸模型的文字轉語音模型所座生之語音,其產生之語音品質相當,並相當接近真人錄製之語音品質。
In the past decades, synthesizing speech from texts, also known as Text to Speech (TTS), has drawn a great attention from researchers since it is applicable to a variety of applications. One of the factors that affects the prosody of synthesized speech is the speed at which it is spoken. Most of the TTS models proposed earlier are based on the autoregressive mechanism that generates speeches frame by frame. However, these autoregressive TTS models have a major drawback that lack the ability of controlling the duration of the synthesized speeches. In order to increase the ability of TTS models to control the duration of the synthesized speeches, many non-autoregressive TTS models are proposed to model the duration of synthesized speech. However, comparing with training an autoregressive TTS model, it takes a huge amount of data and computing power to train a non-autoregressive TTS model. Therefore, in thesis, a system for modifying the duration of speech synthesized from a TTS model is proposed. The proposed system consists of a forced aligner, a duration modifier, a neural network named ATM-Net and a vocoder. The forced aligner in the proposed system is adopted from the Montreal Forced Aligner with pretrained mandarin model. Once the boundary of each word / phoneme is determined, the duration modifier is then used to lengthen or shorten speech segments by inserting dummy frames or removing frames in a mel spectrogram respectively. The modified mel spectrogram is then fed into the ATM-Net to fill in audio contents as well as smoothen the discontinuities between frames. At last, the output mel spectrogram is used to synthesize the audio signal by a vocoder. Experiments show that the proposed system could modify the duration of speech and synthesize speech with natural prosody that is close to a real human does.
摘要 I
誌謝 III
內文目錄 IV
表目錄 VI
圖目錄 VII
第壹章 Introduction 1
第一節 Motivation 1
第二節 The Proposed Method 3
第三節 An Overview of the Thesis 5
第貳章 Related Works 6
第一節 Non-Autoregressive TTS 6
第二節 Audio Time-Scale Modification 10
第三節 Speech Segmentation 12
第四節 The Vocoder 14
第參章 The Proposed System 19
第一節 The Forced Aligner 21
第二節 The Duration Modifier 23
第三節 The ATM-Net 26
第一小節 The Model Architecture 26
第二小節 The Two-Staged Training Method 31
第四節 The Vocoder 34
第肆章 Experimental Results 35
第一節 Data Preparation 35
第二節 Experimental Results 37
第一小節 Shortening the Duration of an Audio Signal 39
第二小節 Lengthening the Duration of an Audio Signal 41
第三小節 Audio Signals with Duration Partially Modified 43
第四小節 The Vocoders 44
第三節 Discussions 45
第一小節 Discussions on the masking strategies 45
第二小節 A Brief History of the ATM-Net 48
第伍章 Conclusion and Future Works 52
第一節 The Conclusion 52
第二節 The Future Works 54
參考文獻 55
[1] Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agionyrgiannakis, Y., Clark, R., Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
[2] Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., & Bengio, Y. (2017). Char2wav: End-to-end speech synthesis.
[3] Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654.
[4] Li, N., Liu, S., Liu, Y., Zhao, S., & Liu, M. (2019, July). Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 6706-6713).
[5] Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32.
[6] Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2020). Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558.
[7] Elias, I., Zen, H., Shen, J., Zhang, Y., Jia, Y., Weiss, R. J., & Wu, Y. (2021, June). Parallel tacotron: Non-autoregressive and controllable tts. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5709-5713). IEEE.
[8] Yu, C., Lu, H., Hu, N., Yu, M., Weng, C., Xu, K., Liu, P., Tou, D., Kang, S., Lei, G., Su, D., Yu, D. (2019). Durian: Duration informed attention network for multimodal synthesis. arXiv preprint arXiv:1909.01700.
[9] Łańcucki, A. (2021, June). Fastpitch: Parallel text-to-speech with pitch prediction. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6588-6592). IEEE.
[10] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R. J., Saurous, R. A., Agiomyrgiannakis, Y., Wu, Y. (2018, April). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779-4783). IEEE.
[11] Lim, D., Jang, W., Park, H., Kim, B., & Yoon, J. (2020). Jdi-t: Jointly trained duration informed transformer for text-to-speech without explicit alignment. arXiv preprint arXiv:2005.07799.
[12] Beliaev, S., Rebryk, Y., & Ginsburg, B. (2020). TalkNet: Fully-convolutional non-autoregressive speech synthesis model. arXiv preprint arXiv:2005.05514.
[13] Kriman, S., Beliaev, S., Ginsburg, B., Huang, J., Kuchaiev, O., Lavrukhin, V., Leary, R., Li, J., Zhang, Y. (2020, May). Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6124-6128). IEEE.
[14] Zeng, Z., Wang, J., Cheng, N., Xia, T., & Xiao, J. (2020, May). Aligntts: Efficient feed-forward text-to-speech system without explicit alignment. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6714-6718). IEEE.
[15] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017, August). Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech (Vol. 2017, pp. 498-502).
[16] Shen, J., Jia, Y., Chrzanowski, M., Zhang, Y., Elias, I., Zen, H., & Wu, Y. (2020). Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling. arXiv preprint arXiv:2010.04301.
[17] Elias, I., Zen, H., Shen, J., Zhang, Y., Jia, Y., Skerry-Ryan, R. J., & Wu, Y. (2021). Parallel Tacotron 2: A non-autoregressive neural TTS model with differentiable duration modeling. arXiv preprint arXiv:2103.14574.
[18] Abbas, A., Merritt, T., Moinet, A., Karlapati, S., Muszynska, E., Slangen, S., Gatti, E., Drugman, T. (2022). Expressive, variable, and controllable duration modelling in TTS. arXiv preprint arXiv:2206.14165.
[19] Allen, J. B., & Rabiner, L. R. (1977). A unified approach to short-time Fourier analysis and synthesis. Proceedings of the IEEE, 65(11), 1558-1564.
[20] Roucos, S., & Wilgus, A. (1985, April). High quality time-scale modification for speech. In ICASSP'85. IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 10, pp. 493-496). IEEE.
[21] Rudresh, S., Vasisht, A., Vijayan, K., & Seelamantula, C. S. (2018). Epoch-synchronous overlap-add (ESOLA) for time-and pitch-scale modification of speech signals. arXiv preprint arXiv:1801.06492.
[22] Laroche, J. (1993, October). Autocorrelation method for high-quality time/pitch-scaling. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 131-134). IEEE.
[23] Lawlor, B., & Fagan, A. D. (1999). A novel high quality efficient algorithm for time-scale modification of speech.
[24] Wong, P. H., & Au, O. C. (2003). Fast SOLA-based time scale modification using envelope matching. Journal of VLSI signal processing systems for signal, image and video technology, 35, 75-90.
[25] Wong, P. H., & Au, O. C. (2002, May). Fast SOLA-based time scale modification using modified envelope matching. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 3, pp. III-3188). IEEE.
[26] Dorran, D., Lawlor, R., & Coyle, E. (2003, April). High quality time-scale modification of speech using a peak alignment overlap-add algorithm (PAOLA). In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). (Vol. 1, pp. I-I). IEEE.
[27] Xin, D., Takamichi, S., Okamoto, T., Kawai, H., & Saruwatari, H. (2022). Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation. arXiv preprint arXiv:2204.10561.
[28] Kong, J., Kim, J., & Bae, J. (2020). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 17022-17033.
[29] Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 25(1-3), 133-147.
[30] Povey, D., & Saon, G. (2006, September). Feature and model space speaker adaptation with full covariance gaussians. In Interspeech (pp. 1145-1148).
[31] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. “MFA Pretrained Mandarin Models in International Phonetic Alphabet”, https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/models/index.html
[32] Tachibana, H., Uenoyama, K., & Aihara, S. (2018, April). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4784-4788). IEEE.
[33] Yang, J., Lee, J., Kim, Y., Cho, H., & Kim, I. (2020). VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network. arXiv preprint arXiv:2007.15256.
[34] Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001, May). Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (Vol. 2, pp. 749-752). IEEE.
[35] FFmpeg, http://ffmpeg.org
[36] Verhelst, W., & Roelands, M. (1993, April). An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 2, pp. 554-557). IEEE.
[37] ffmpeg atempo API, https://ffmpeg.org/ffmpeg-filters.html#atempo
[38] Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2), 236-243.
[39] I. Rosenfelder, J. Fruehwald, K. Evanini, and J. Yuan, “FAVE (Forced Alignment and Vowel Extraction) Program Suite [Computer program],” 2011, available at http://fave.ling.upenn.edu.
[40] Brugnara, F., Falavigna, D., & Omologo, M. (1993). Automatic segmentation and labeling of speech based on Hidden Markov Models. Speech Communication, 12(4), 357-370.
[41] Forney, G. D. (1973). The viterbi algorithm. Proceedings of the IEEE, 61(3), 268-278.
[42] Malfrere, F., & Dutoit, T. (1997). High-quality speech synthesis for phonetic speech segmentation. In Fifth European Conference on Speech Communication and Technology.
[43] van Santen, J. P., & Sproat, R. (1999, September). High-accuracy automatic segmentation. In EUROSPEECH.
[44] Katsamanis, A., Black, M., Georgiou, P. G., Goldstein, L., & Narayanan, S. (2011, January). SailAlign: Robust long speech-text alignment. In Proc. of workshop on new tools and methods for very-large scale phonetics research (Vol. 1).
[45] Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems, 99(7), 1877-1884.
[46] Morise, M., Kawahara, H., & Katayose, H. (2009, February). Fast and reliable F0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech. In Audio Engineering Society Conference: 35th International Conference: Audio for Games. Audio Engineering Society.
[47] Morise, M. (2015). CheapTrick, a spectral envelope estimator for high-quality speech synthesis. Speech Communication, 67, 1-7.
[48] Morise, M. (2012). Platinum: A method to extract excitation signals for voice synthesis system. Acoustical Science and Technology, 33(2), 123-125.
[49] Prenger, R., Valle, R., & Catanzaro, B. (2019, May). Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3617-3621). IEEE.
[50] Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31.
[51] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
[52] Yamamoto, R., Song, E., & Kim, J. M. (2020, May). Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6199-6203). IEEE.
[53] Kumar, K., Kumar, R., De Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., De Brebisson, A., Bengio, Y., Courville, A. C. (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32.
[54] Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). A scale for the measurement of the psychological magnitude pitch. The journal of the acoustical society of america, 8(3), 185-190.
[55] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[56] The VocGAN, https://github.com/rishikksh20/VocGAN.
[57] Keith Ito and Linda Johnson, "The LJ Speech Dataset", https://keithito.com/LJ-Speech-Dataset/, 2017.
[58] pysox tempo, https://pysox.readthedocs.io/en/latest/api.html
[59] LeCun, Y., & Bengio, Y. (1998). The handbook of brain theory and neural networks.
[60] Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training very deep networks. Advances in neural information processing systems, 28.
[61] McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia medica, 22(3), 276-282.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top