

( 您好!臺灣時間:2024/12/02 04:42
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::


研究生(外文):Jung-yun Wu
論文名稱(外文):Pitch Prediction Using Prosody Hierarchy and Dynamic Features for HMM-based Mandarin Speech Synthesis
指導教授(外文):Chung-Hsien Wu
外文關鍵詞:PitchProsody HierarchyDynamic FeatureSpeech Synthesis
  • 被引用被引用:4
  • 點閱點閱:167
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
在本論文中,對於應用階層式韻律架構及動態參數之音高預測模型,分為下列四項研究重點:(1)階層式韻律架構的預測及產生;(2)導入動態參數生成演算法於各韻律階層;(3)運用分類回歸樹及隱藏式馬可夫模型建立各層韻律模型;(4)參數提取使用STRAIGHT(Speech Transformation and Representation based on Adaptive Interpolation of weiGHTed spectrogram)。
Prosody is the main measurement of naturalness for speech, and pitch is the key factor known to carry the prosodic information. In resent years, speech synthesis based on Hidden Markov Models has been developed, which can synthesize smooth speech and in an advantageous position about its flexible property and portable in size. Nevertheless, there is still room for improvement in “the naturalness” of synthesized speech. In our research, we take the “prosody hierarchy structure” as the basis of pitch prediction model, and apply “dynamic features” to the unit of each hierarchical layer. We describe prosodic units as the supra-segmental units which occur in a hierarchy structure and reflect how brain processes speech; the latter preserve time correlation between adjacent units and result in more natural connection among each conjunction point. Applying this framework to HMM-based speech synthesis system, we can result a better, natural sounding speech.
The purpose of this thesis is to develop a pitch prediction model using prosody hierarchy structure and dynamic features and to investigate the improvement of naturalness for synthesized speech. More specifically, this research is aimed to: (1) Prediction and generation of prosody hierarchy structure; (2) Dynamic features for each hierarchical layer; (3) Building the pitch prediction model for each layer: CART for prosodic word and syllable level, HMM for frame level; (4) Feature analysis using STRAIGHT (Speech Transformation and Representation based on Adaptive Interpolation of weiGHTed spectrogram).
The experimental result using both subjective and objective tests in the proposed approach and other comparative systems shows that our scheme is better can comparative ones and can generate more natural sounding speech.
Abstract................................................................................................................................ V
1 緒論 1
1.1 前言 1
1.1.1 研究背景 1
1.1.2 研究動機與目的 1
1.1.3 文獻回顧 2
1.2 研究方法簡介 4
1.2.1 系統架構 4
1.3 章節概要 7
2 HMM-based中文語音合成器 8
2.1 HMM-based語音合成系統 8
2.2 中文HMM模型之建立 10
2.2.1 中文音素模型 10
2.2.2 文字分析前處理器 13
2.2.3 狀態合併分裂樹(決策樹)之問題集 14
2.3 參數提取: STRAIGHT 15
3 韻律階層架構 17
3.1 中文韻律結構 17
3.2 韻律結構之產生 20
3.2.1 韻律結構預測模型 20
3.2.2 預測模型之建立 22
4 階層式音高模型 25
4.1 音高階層式結構 25
4.2 各層之音高量化 30
4.2.1 量化模型 31
4.2.2 音韻詞階層 32
4.2.3 音節階層 33
4.2.4 音框階層 34
4.3 各層音高模型之建立 34
4.3.1 動態參數 34
4.3.2 音韻詞/音節階層 37
4.3.3 音框階層 42
4.4 參數生成演算法 42
5 實驗結果與分析 44
5.1 實驗語料 44
5.1.1 北京清華大學語料庫 44
5.1.2 語料設定 45
5.2 實驗與評估 46
5.2.1 音韻停頓預測模型評估 46
5.2.2 音高預測模型評估 48 權重值設定 48 音高軌跡比較 55 客觀/主觀評估 57 分析與討論 60
6 結論與未來展望 62
6.1 結論 62
6.2 未來展望 62
參考文獻 64
0 附錄 68
作者簡歷 76
[Andrej, 1986] Andrej, L. and Frank, F., “Synthesis of Natural Sounding Pitch Contours in Isolated Utterances Using Hidden Markov Models”, IEEE Trans. on Acoustic, Speech and Signal Processing, vol. ASSP-34, no.5, pp.1074-1080, October 1986
[Benijamin, 1994] Benijamin, A., Chilin S. and Richard S., “A Corpus-Based Mandarin Text-to-Speech Synthesizer”, in Proc of ICSLP, S29, 8.1-8.4, pp. 1771-1774, 1994
[Breiman, 1984] Breiman, L., Friedman, J.H., Olshen, R. A. and C.J. Stone,” Classification and Regression Trees”, Chapman Hall, New York, 1984
[Chan, 1994] Chan, M. V., Feng, X., Heinen, J. A. and Niederjohn, R. J., “Classification of Speech Accents with Neural Networks”, Neural Networks, IEEE World Congress on Computational Intelligence., IEEE International Conference on, vol.7, pp. 4483-4486, 1994
[Chen, 1990] Chen, S. H. and Wang Y. R., “Vector Quantization of Pitch Information in Mandarin Speech”, IEEE Trans. on Communications, Vol. 38, No. 9, pp. 1317-1320, 1990
[Chen, 1995] Chen, S. H. and Wang, Y. R., ”Tone Recognition of Continuous Mandarin Speech Based on Neural Networks”, IEEE Trans. on Speech and Audio processing, vol. 3, no.2, pp.146-150, March 1995
[Chen, 1998] Chen, S. H., Hwang, S. H. and Wang, Y. R., “An RNN-based Prosodic Information Synthesizer for Mandarin Text-to-Speech”, IEEE Trans. on Speech and Audio Processing, vol. 6, no.3, pp.226-269, 1998
[Chen, 2005] Chen S. H., Lai, W. H. and Wang, Y. R., “A Statistics-based Pitch Contour Model for Mandarin Speech”, The Journal of the Acoustical Society of America, 117(2), pp. 908-925, 2005
[Chu, 2001] Chu, M. and Qian, Y., “Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts”, Computational Linguistics and Chinese Language Processing, 6(1), pp. 61-82, 2001
[Dong, 2002] Dong, M. and Lua, K. T., “Pitch Contour Model for Chinese Text-to-Speech Using CART and Statistical Model”, in Proc. of ICSLP, pp. 2405-2408, 2002
[Fujisaki, 1984] Fujisaki, H. and Hirose, K., “Analysis of voice fundamental frequency contours for declarative sentences of Japanese”, Journal of Acoustic Society, Japan, 1984
[Fukada, 1992] Fukada, T., Tokuda, K., Kobayashi, T. and Imai, S., “An adaptive algorithm for mel-cepstral analysis of speech,” in Proc. of ICASSP, vol.1, pp.137–140, 1992
[Greg, 2000] Greg, P. K. and Shih, C., “Stem-ML: Language-Independent Prosody Description”, in Proc. of ICSLP, pp. 239-242, 2000
[Huang, 2004] Huang, C., Shi, Y., Zhou, J. L., Chu, M., Wang, T., and Chang, E., “Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR”, in Proc. of ICASSP, pp.901-904, 2004
[Kawahara, 1997] Kawahara, H., “Speech Representation and Transformation Using Adaptive Interpolation of Weighted Spectrum: Vocoder Revisited”, in Proc. of ICASSP, vol. 2, pp. 1303-1306, Munich, Germany, April 1997
[Kim, 1997] Kim, S. H., and Kim, J. Y., “Efficient Model of Establishing Words Tone Dictionary for Korean TTS System”, in Proc. of Eurospeech, pp. 243-246, 1997
[Ladd, 1996] Ladd, D. R., “Intonational phonology”, Cambridge Studies in Linguistics 79. Cambridge: Cambridge University Press. 334 pages, 1996
[Lee, 1989] Lee, L. S., Tseng, C. Y. and Ouh-young M., “The Synthesis Rules in a Chinese Text-to-speech System”, IEEE Trans. on Acoustic, Speech and Signal Processing, vol. 37, no. 9, pp. 1309-1319, September 1989
[Lee, 1993] Lee, L. S., Tseng, C. Y. and Hsieh, C. J., “Improved Tone Concatenation Rules in a Formant-Based Chinese Text-to-Speech System”, IEEE Trans. on Speech and Audio processing, vol. 1, no.3, pp.287-294, July 1993
[Lin, 1992] Lin, T. and Wang, L. J., “Phonetic Tutorials”, Beijing University Press, pp. 103-121, 1992
[Lin, 1999] Lin, X., Chen, Y., Lim, S. and Lim, C., “Recognition of Emotional State From Spoken Sentences”, IEEE 3rd workshop on Multimedia Signal Processing, pp. 469-473, 1999
[Masuko, 1996] Masuko, T., Tokuda, K., Kobayashi, T. and Imai, S., “Speech Synthesis Using HMMs with Dynamic Features”, in Proc. of ICASSP, pp. 389-392, 1996
[Monaghan, 1991] Monaghan, A.I.C. and Ladd, D.R., “Manipulating Synthetic Intonation for Speaker C haracterisation”, in Proc. of ICASSP, S7.11, pp. 453-456, 1991
[Pan, 2000] Pan, N. H., Jen, W. T., Yu, S. S., Yu, S. S., Huang, S. Y. and Wu, M. J., “Prosody Model in a Mandarin Text-to-Speech System Based on a Hierarchical Approach”, IEEE International Conference on Multimedia and Expo, vol. 1, pp. 448-451, 2000
[Rissanen, 1984] Rissanen, J., “Universal Coding, Information, Prediction, and Estimation”, IEEE Trans. on IT, vol. 30, no. 40, pp. 629-636, 1984
[Shinoda, 1997] Shinoda, K. and Watanabe, T., “Acoustic modeling based on the MDL criterion for speech recognition”, in Proc. of EuroSpeech, vol. 1, pp. 99-102, 1997
[Sun, 2002] Sun, X., The Determination, Analysis and Synthesis of Fundamental Frequency, Ph. D Thesis, Northwestern University, 2002
[Tao, 2004] Tao, J., “F0 Prediction Model of Speech Synthesis Based on Template and Statistical Method”, Lecture Nodes of Artificial Intelligence, Springer, 2004
[Tokuda, 1995] Tokuda, K., Kobayashi, T. and Imai, S., “Speech Parameter Generation from HMM Using Dynamic Features”, in Proc. of ICASSP, pp. 660-663, 1995
[Tokuda, 2000] Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T. and Kitamura, T., “Speech Parameter Generation Algorithms for HMM-based Speech Synthesis”, in Proc. of ICASSP, pp. 1315-1318, 2000
[Tseng, 2004] Tseng, C.Y. and Lee, Y. L., ”Speech rate and Prosody Units: Evidence of Interaction from Mandarin Chinese”, in Proc. of the International Conference on Speech Prosody, pp. 251-254, 2004
[Tseng, 2005] Tseng, C. Y., Pin, S. H., Lee, Y. L., Wang, H. M. and Chen, Y. C., “Fluent Speech Prosody: Framework and Modeling”, Speech Communication, Special Issue on Quantitative Prosody Modeling for Natural Speech Description and Generation, Vol. 46: 3-4, pp. 284-309, 2005
[Wightman, 1994] Wightman, C. W. and Ostendorf. M., “Automatic Labeling of Prosodic Patterns”, IEEE Trans. on Speech and Audio Processing, vol. 2, no. 4, pp. 469-481, October 1994
[Yi, 2001] Yi, X. and Wang Q. E., “Pitch Targets and Their Realization: Evidence from Mandarin Chinese”, Speech Communication, pp. 319-337, 2001
[Young, 2006] Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.Y., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P., The Hidden Markov Model Toolkit (HTK) Version 3.4, 2006. http://htk.eng.cam.ac.uk/
[Zen, 2007] Zen, H., Nose, T., Yamagishi, J., Sako, S. and Tokuda, K., The HMM-based Speech Synthesis System (HTS) Version 2.0, 2007. http://hts.sp.nitech.ac.jp/
[謝, 民63年] 謝雲飛, 語音學大綱, 民國63年初版
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
第一頁 上一頁 下一頁 最後一頁 top