|
In this dissertation, some approaches to synthesis unit selection and prosodicinformation generation are proposed for Chinese text-to-speech conversion. The monosyllables are adopted as the basic synthesis units. A set of synthesis units is selected from a large continuous speech database based on two cost functions which minimize the inter- and intra-syllable distortion. The speech database is also employed to establish a word-prosody-based template tree according to the linguistic features: tone combination, word length, part-of-speech (POS) of the word, and word position in a sentence. This template tree stores the prosodic features including pitch contour, average energy, and syllable duration of a word for possible combinations of linguistic features. Two modules for sentence intonation and template selection are proposed to generate the target prosodic templates. On the other hand, a Bayesian network is used to model the relationship between linguistic features and prosodic information. Finally, a Speech Activated Telephony Email Reader (SATER) is proposed. SATER is an integrated system combining speaker verification, network, and text-to-speech conversion. A registered user can activate and listen to his email through a wired/wireless telephone. In the speaker verification subsystem, a time-varying verification phrase is adopted. The speaker''s password is used to generate the verification phrases for that speaker. A hidden Markov Model with states of variable number is used to model each verification phrase. The experimental results for the TTS conversion system showed that synthesized prosodic features quite resembled their original counterparts for most syllables in the inside test. Evaluation by subjective experiments also confirmed the satisfactory performance of these approaches.
|