跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.171) 您好!臺灣時間:2024/12/02 04:42
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:吳仲耘
研究生(外文):Jung-yun Wu
論文名稱:應用韻律階層及動態參數之音高預測在基於HMM之中文語音合成器
論文名稱(外文):Pitch Prediction Using Prosody Hierarchy and Dynamic Features for HMM-based Mandarin Speech Synthesis
指導教授:吳宗憲吳宗憲引用關係
指導教授(外文):Chung-Hsien Wu
學位類別:碩士
校院名稱:國立成功大學
系所名稱:資訊工程學系碩博士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2008
畢業學年度:96
語文別:中文
論文頁數:76
中文關鍵詞:語音合成音高動態參數韻律階層
外文關鍵詞:PitchProsody HierarchyDynamic FeatureSpeech Synthesis
相關次數:
  • 被引用被引用:4
  • 點閱點閱:167
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
韻律表現是影響語音自然度的重要因素,而音高更蘊含了豐富的韻律訊息。基於隱藏式馬可夫模型的語音合成器,近年來已可合成出流暢及可理解的語音,系統的可攜性及適應性更是其發展優勢,但在語音的自然度上仍需改善。因此,本研究以“階層式音韻架構”作為音高預測的基礎,各層的音韻單元考慮“動態參數特性”;希望改善傳統音韻模型以小單元合成的不足,並以動態參數保留時間上的關聯性,使單元之間的連接更加自然,藉以改善基於隱藏式馬可夫模型之合成語音的自然度。
在本論文中,對於應用階層式韻律架構及動態參數之音高預測模型,分為下列四項研究重點:(1)階層式韻律架構的預測及產生;(2)導入動態參數生成演算法於各韻律階層;(3)運用分類回歸樹及隱藏式馬可夫模型建立各層韻律模型;(4)參數提取使用STRAIGHT(Speech Transformation and Representation based on Adaptive Interpolation of weiGHTed spectrogram)。
在實驗中,首先對韻律停頓預測模型評估其預測準確度,接著對音高預測模型進行主觀及客觀的評估,證明本論文提出之方法,在合成語音的自然度表現上,有不錯的表現及改善。
Prosody is the main measurement of naturalness for speech, and pitch is the key factor known to carry the prosodic information. In resent years, speech synthesis based on Hidden Markov Models has been developed, which can synthesize smooth speech and in an advantageous position about its flexible property and portable in size. Nevertheless, there is still room for improvement in “the naturalness” of synthesized speech. In our research, we take the “prosody hierarchy structure” as the basis of pitch prediction model, and apply “dynamic features” to the unit of each hierarchical layer. We describe prosodic units as the supra-segmental units which occur in a hierarchy structure and reflect how brain processes speech; the latter preserve time correlation between adjacent units and result in more natural connection among each conjunction point. Applying this framework to HMM-based speech synthesis system, we can result a better, natural sounding speech.
The purpose of this thesis is to develop a pitch prediction model using prosody hierarchy structure and dynamic features and to investigate the improvement of naturalness for synthesized speech. More specifically, this research is aimed to: (1) Prediction and generation of prosody hierarchy structure; (2) Dynamic features for each hierarchical layer; (3) Building the pitch prediction model for each layer: CART for prosodic word and syllable level, HMM for frame level; (4) Feature analysis using STRAIGHT (Speech Transformation and Representation based on Adaptive Interpolation of weiGHTed spectrogram).
The experimental result using both subjective and objective tests in the proposed approach and other comparative systems shows that our scheme is better can comparative ones and can generate more natural sounding speech.
中文摘要.............................................................................................................................IV
Abstract................................................................................................................................ V
誌謝................................................................................................................................VI
圖目錄................................................................................................................................IX
表目錄................................................................................................................................XI
1 緒論 1
1.1 前言 1
1.1.1 研究背景 1
1.1.2 研究動機與目的 1
1.1.3 文獻回顧 2
1.2 研究方法簡介 4
1.2.1 系統架構 4
1.3 章節概要 7
2 HMM-based中文語音合成器 8
2.1 HMM-based語音合成系統 8
2.2 中文HMM模型之建立 10
2.2.1 中文音素模型 10
2.2.2 文字分析前處理器 13
2.2.3 狀態合併分裂樹(決策樹)之問題集 14
2.3 參數提取: STRAIGHT 15
3 韻律階層架構 17
3.1 中文韻律結構 17
3.2 韻律結構之產生 20
3.2.1 韻律結構預測模型 20
3.2.2 預測模型之建立 22
4 階層式音高模型 25
4.1 音高階層式結構 25
4.2 各層之音高量化 30
4.2.1 量化模型 31
4.2.2 音韻詞階層 32
4.2.3 音節階層 33
4.2.4 音框階層 34
4.3 各層音高模型之建立 34
4.3.1 動態參數 34
4.3.2 音韻詞/音節階層 37
4.3.3 音框階層 42
4.4 參數生成演算法 42
5 實驗結果與分析 44
5.1 實驗語料 44
5.1.1 北京清華大學語料庫 44
5.1.2 語料設定 45
5.2 實驗與評估 46
5.2.1 音韻停頓預測模型評估 46
5.2.2 音高預測模型評估 48
5.2.2.1 權重值設定 48
5.2.2.2 音高軌跡比較 55
5.2.2.3 客觀/主觀評估 57
5.2.2.4 分析與討論 60
6 結論與未來展望 62
6.1 結論 62
6.2 未來展望 62
參考文獻 64
0 附錄 68
作者簡歷 76
[Andrej, 1986] Andrej, L. and Frank, F., “Synthesis of Natural Sounding Pitch Contours in Isolated Utterances Using Hidden Markov Models”, IEEE Trans. on Acoustic, Speech and Signal Processing, vol. ASSP-34, no.5, pp.1074-1080, October 1986
[Benijamin, 1994] Benijamin, A., Chilin S. and Richard S., “A Corpus-Based Mandarin Text-to-Speech Synthesizer”, in Proc of ICSLP, S29, 8.1-8.4, pp. 1771-1774, 1994
[Breiman, 1984] Breiman, L., Friedman, J.H., Olshen, R. A. and C.J. Stone,” Classification and Regression Trees”, Chapman Hall, New York, 1984
[Chan, 1994] Chan, M. V., Feng, X., Heinen, J. A. and Niederjohn, R. J., “Classification of Speech Accents with Neural Networks”, Neural Networks, IEEE World Congress on Computational Intelligence., IEEE International Conference on, vol.7, pp. 4483-4486, 1994
[Chen, 1990] Chen, S. H. and Wang Y. R., “Vector Quantization of Pitch Information in Mandarin Speech”, IEEE Trans. on Communications, Vol. 38, No. 9, pp. 1317-1320, 1990
[Chen, 1995] Chen, S. H. and Wang, Y. R., ”Tone Recognition of Continuous Mandarin Speech Based on Neural Networks”, IEEE Trans. on Speech and Audio processing, vol. 3, no.2, pp.146-150, March 1995
[Chen, 1998] Chen, S. H., Hwang, S. H. and Wang, Y. R., “An RNN-based Prosodic Information Synthesizer for Mandarin Text-to-Speech”, IEEE Trans. on Speech and Audio Processing, vol. 6, no.3, pp.226-269, 1998
[Chen, 2005] Chen S. H., Lai, W. H. and Wang, Y. R., “A Statistics-based Pitch Contour Model for Mandarin Speech”, The Journal of the Acoustical Society of America, 117(2), pp. 908-925, 2005
[Chu, 2001] Chu, M. and Qian, Y., “Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts”, Computational Linguistics and Chinese Language Processing, 6(1), pp. 61-82, 2001
[Dong, 2002] Dong, M. and Lua, K. T., “Pitch Contour Model for Chinese Text-to-Speech Using CART and Statistical Model”, in Proc. of ICSLP, pp. 2405-2408, 2002
[Fujisaki, 1984] Fujisaki, H. and Hirose, K., “Analysis of voice fundamental frequency contours for declarative sentences of Japanese”, Journal of Acoustic Society, Japan, 1984
[Fukada, 1992] Fukada, T., Tokuda, K., Kobayashi, T. and Imai, S., “An adaptive algorithm for mel-cepstral analysis of speech,” in Proc. of ICASSP, vol.1, pp.137–140, 1992
[Greg, 2000] Greg, P. K. and Shih, C., “Stem-ML: Language-Independent Prosody Description”, in Proc. of ICSLP, pp. 239-242, 2000
[Huang, 2004] Huang, C., Shi, Y., Zhou, J. L., Chu, M., Wang, T., and Chang, E., “Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR”, in Proc. of ICASSP, pp.901-904, 2004
[Kawahara, 1997] Kawahara, H., “Speech Representation and Transformation Using Adaptive Interpolation of Weighted Spectrum: Vocoder Revisited”, in Proc. of ICASSP, vol. 2, pp. 1303-1306, Munich, Germany, April 1997
[Kim, 1997] Kim, S. H., and Kim, J. Y., “Efficient Model of Establishing Words Tone Dictionary for Korean TTS System”, in Proc. of Eurospeech, pp. 243-246, 1997
[Ladd, 1996] Ladd, D. R., “Intonational phonology”, Cambridge Studies in Linguistics 79. Cambridge: Cambridge University Press. 334 pages, 1996
[Lee, 1989] Lee, L. S., Tseng, C. Y. and Ouh-young M., “The Synthesis Rules in a Chinese Text-to-speech System”, IEEE Trans. on Acoustic, Speech and Signal Processing, vol. 37, no. 9, pp. 1309-1319, September 1989
[Lee, 1993] Lee, L. S., Tseng, C. Y. and Hsieh, C. J., “Improved Tone Concatenation Rules in a Formant-Based Chinese Text-to-Speech System”, IEEE Trans. on Speech and Audio processing, vol. 1, no.3, pp.287-294, July 1993
[Lin, 1992] Lin, T. and Wang, L. J., “Phonetic Tutorials”, Beijing University Press, pp. 103-121, 1992
[Lin, 1999] Lin, X., Chen, Y., Lim, S. and Lim, C., “Recognition of Emotional State From Spoken Sentences”, IEEE 3rd workshop on Multimedia Signal Processing, pp. 469-473, 1999
[Masuko, 1996] Masuko, T., Tokuda, K., Kobayashi, T. and Imai, S., “Speech Synthesis Using HMMs with Dynamic Features”, in Proc. of ICASSP, pp. 389-392, 1996
[Monaghan, 1991] Monaghan, A.I.C. and Ladd, D.R., “Manipulating Synthetic Intonation for Speaker C haracterisation”, in Proc. of ICASSP, S7.11, pp. 453-456, 1991
[Pan, 2000] Pan, N. H., Jen, W. T., Yu, S. S., Yu, S. S., Huang, S. Y. and Wu, M. J., “Prosody Model in a Mandarin Text-to-Speech System Based on a Hierarchical Approach”, IEEE International Conference on Multimedia and Expo, vol. 1, pp. 448-451, 2000
[Rissanen, 1984] Rissanen, J., “Universal Coding, Information, Prediction, and Estimation”, IEEE Trans. on IT, vol. 30, no. 40, pp. 629-636, 1984
[Shinoda, 1997] Shinoda, K. and Watanabe, T., “Acoustic modeling based on the MDL criterion for speech recognition”, in Proc. of EuroSpeech, vol. 1, pp. 99-102, 1997
[Sun, 2002] Sun, X., The Determination, Analysis and Synthesis of Fundamental Frequency, Ph. D Thesis, Northwestern University, 2002
[Tao, 2004] Tao, J., “F0 Prediction Model of Speech Synthesis Based on Template and Statistical Method”, Lecture Nodes of Artificial Intelligence, Springer, 2004
[Tokuda, 1995] Tokuda, K., Kobayashi, T. and Imai, S., “Speech Parameter Generation from HMM Using Dynamic Features”, in Proc. of ICASSP, pp. 660-663, 1995
[Tokuda, 2000] Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T. and Kitamura, T., “Speech Parameter Generation Algorithms for HMM-based Speech Synthesis”, in Proc. of ICASSP, pp. 1315-1318, 2000
[Tseng, 2004] Tseng, C.Y. and Lee, Y. L., ”Speech rate and Prosody Units: Evidence of Interaction from Mandarin Chinese”, in Proc. of the International Conference on Speech Prosody, pp. 251-254, 2004
[Tseng, 2005] Tseng, C. Y., Pin, S. H., Lee, Y. L., Wang, H. M. and Chen, Y. C., “Fluent Speech Prosody: Framework and Modeling”, Speech Communication, Special Issue on Quantitative Prosody Modeling for Natural Speech Description and Generation, Vol. 46: 3-4, pp. 284-309, 2005
[Wightman, 1994] Wightman, C. W. and Ostendorf. M., “Automatic Labeling of Prosodic Patterns”, IEEE Trans. on Speech and Audio Processing, vol. 2, no. 4, pp. 469-481, October 1994
[Yi, 2001] Yi, X. and Wang Q. E., “Pitch Targets and Their Realization: Evidence from Mandarin Chinese”, Speech Communication, pp. 319-337, 2001
[Young, 2006] Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.Y., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P., The Hidden Markov Model Toolkit (HTK) Version 3.4, 2006. http://htk.eng.cam.ac.uk/
[Zen, 2007] Zen, H., Nose, T., Yamagishi, J., Sako, S. and Tokuda, K., The HMM-based Speech Synthesis System (HTS) Version 2.0, 2007. http://hts.sp.nitech.ac.jp/
[謝, 民63年] 謝雲飛, 語音學大綱, 民國63年初版
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top