研究生(外文):Shih-Cheng Chang
論文名稱(外文):Emotional Voice Conversion Using Prosodic and Spectral Features
指導教授(外文):Hong-Yen Gu
口試委員(外文):Hsin-Min WangMing-Shing YuBor-Shen Lin
外文關鍵詞:emotional speechvoice conversionsegmental prosodic featuresspectral GMMF0 GMMdynamic speech duration adjusting
In this thesis, conversion methods for three prosodic features (pitch contour, duration and intensity) are studied. Then, an emotional voice conversion system is constructed. A neutral input speech is converted to a speech of angry, happy or sad emotion. In the training stage, the F0 GMM and spectrum GMM models were trained for each of the three target emotions respectively by using the corresponding parallel corpus of 120 sentences. Based on sentence segmentation rules, the mean and standard deviation values of the prosodic features are measured across sentences for three segments respectively. Also, this measuring is performed for each target emotion’s training sentences respectively. In the conversion stage, the pitch contour and DCC coefficients of a neutral input speech are mapped to the pitch contour and DCC coefficients for a specified target emotion in terms of the corresponding F0 and spectrum GMM. When using F0 GMM to convert pitch contour, we find that the obtained pitch contour is of fluctuations. Therefore, we study to reduce the fluctuations with median smoothing and moving average processing. Next, by using segmental tables of statistical parameters obtained in the training stage, the three prosodic features (pitch contour, duration, and intensity) are converted with the method, segmental standard deviation matching (SSDM). To let the emotion expressed in the converted speech more close to the target emotion, we propose a dynamic speech duration adjusting method. The duration of a frame is dynamically determined according to its energy ratio.
To evaluate the performance of our emotional voice conversion system, we had conducted two subjective listening tests. The first test is to compare the emotional expressions of two converted speeches by two conversion methods. The percentages of the votes obtained by our method are 95% for angry emotion, 65% for happy emotion, and 67.5% for sad emotion. As to the second test, each participant is requested to recognize the emotion expressed in the speech played to him. The results show that the recognition rates obtained by our conversion method are 87.5% for angry emotion, 61.3% for happy emotion, and 77.5% for sad emotion. Therefore, the emotional voice conversion system using the studied conversion method is effective in converting a neutral speech to a speech of a specified target emotion.
摘要 I
誌謝 III
目錄 IV
圖表索引 VI
第1章 緒論 1
1.1 研究動機 1
1.2 文獻回顧 1
1.2.1 頻譜特徵係數 2
1.2.2 韻律參數之轉換方法 3
1.2.3 情緒語音轉換之方法 5
1.3 研究方法 8
1.3.1 情緒語音轉換系統之訓練階段 10
1.3.2 情緒語音轉換系統之轉換階段 13
1.4 論文架構 17
第2章 語料準備與情緒參數估計 18
2.1 語料錄音 18
2.2 標音 18
2.3 離散倒頻譜係數估計 21
2.4 音框音高、音量與音長係數估計 23
2.5 DTW音框匹配 24
第3章 情緒語音轉換模型之訓練 27
3.1 高斯混合模型(GMM) 27
3.2 GMM 模型訓練方法 29
3.3 音高GMM之訓練 30
3.4 頻譜GMM之訓練 33
3.5 韻律參數之分段統計表 35
第4章 情緒語音轉換方法 41
4.1 頻譜係數轉換方法 41
4.2 韻律參數轉換方法 42
4.2.1 音高轉換方法 42
4.2.2 音量轉換方法 45
4.2.3 音長轉換方法 47
第5章 系統整合與實驗 51
5.1 程式製作與系統介面 51
5.2 HNM訊號合成 54
5.3 音高轉換實驗 56
5.4 音量轉換實驗 59
5.5 音長轉換實驗 61
5.6 頻譜距離量測實驗 63
5.7 聽測實驗 64
5.7.1 聽測實驗一:情緒語音比較 65
5.7.2 聽測實驗二:情緒語音辨別 67
第6章 結論 70
參考文獻 76
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
