跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.134) 您好!臺灣時間:2025/12/22 05:09
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:黃姿云
研究生(外文):Huang, Tzu-Yun
論文名稱:雙重互補聲學嵌入網絡: 從原始波形挖掘與特徵集相異的語音情感識別特徵
論文名稱(外文):A Dual Complementary Acoustic Embedding Network: Mining Discriminative Characteristics from Raw-waveform for Speech Emotion Recognition
指導教授:李祈均李祈均引用關係
指導教授(外文):Lee, Chi-Chun
口試委員:曹昱李宏毅陳冠宇
口試委員(外文):Tsao, YuLee, Hung-YiChen, Kuan-Yu
口試日期:2019-02-15
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2019
畢業學年度:107
語文別:英文
論文頁數:43
中文關鍵詞:語音情緒辨識原始波型端對端學習聲學空間擴大
外文關鍵詞:speech emotion recognitionraw waveformend-to-end learningacoustic space augmentation
相關次數:
  • 被引用被引用:0
  • 點閱點閱:319
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
語音情緒辨識近年來在眾多廣泛的領域中成為一股潮流,同時憑藉著深度學習這項技術下,達到了令人驚豔的的成果。然而,因為端對端學習使用了複雜且時變性的原始波形導致它很難超越被精細設計過的特徵集。特徵集的擴大透過互補特徵的啟發, 這個方法可以同時利用辨別能力區別兩邊。在本篇論文中,我們提出了雙重互補聲學嵌入網路(DCaEN)的方法,透過將專家裁切特徵和原始波形模組在一起進而改進情緒辨識能力。基於專家設計用來使用在情感計算的聲學特徵集,我們使得餘弦相似性優化到負值來當作互補的限制,藉此限制來挖掘原始波形中額外資訊。在我們的實驗中,我們將所提出的模型應用於IEMOCAP和MSP-IMPROV 資料庫上,其準確率分別達到59.31%和46.22%,這結果皆比單單使用原始波形或是特徵集來的高。再者,在被學習的互補空間中,我們透過視覺化分析來更進一步的突出我們所提出的互補限制所造成的效果。
Speech emotion recognition has recently become a trend in broader fields and has achieved stunning performance with deep learning technology. However, end-to-end learning using complex and time-varying raw waveform can still hardly exceed the finely-designed hand-crafted feature sets. A feature space augmentation approach by complementary feature elicitation can simultaneously leverage the discriminative power of both sides. In this study we propose a Dual Complementary Acoustic Embedding Network (DCaEN) jointly modeling hand-crafted features with raw waveform to improve emotion recognition. We specify the cosine distance to negative value as a complementary constraint to mine additional information from raw waveform in terms of expert-designed acoustic feature set for affective computing.
Experimental results of predicting emotion categories on IEMOCAP and MSP-IMPROV database show that our proposed model achieves 59.31% and 46.22% respectively which both outperform the networks using either raw waveform or feature set solely. Moreover, we present visualization analysis on the learned complementary space to further illuminate the effect of complementary constraint.
誌謝 I
摘要 II
ABSTRACT III
CONTENTS IV
CHAPTER 1 INTRODUCTION 1
CHAPTER 2 DATABASE AND FEATURES 5
2.1 DATABASE 5
2.1.1 IEMOCAP 5
2.1.2 MSP-IMPROV 6
2.2 FEATURES 8
2.2.1 eGeMAPS 8
2.2.2 emobase 2010 9
2.2.3 Preprocessing on raw waveform 9
CHAPTER 3 RESEARCH METHODOLOGY 10
3.1 NEURAL NETWORK 10
3.1.1 Deep Neural Network (DNN) 12
3.1.2 Convolution Neural Network (CNN) 12
3.1.3 Long-Short Term Memory (LSTM) 15
3.1.4 Attention Mechanism 18
3.2 DUAL COMPLEMENTARY ACOUSTIC EMBEDDING NETWORK (DCAEN) 20
3.2.1 Feature Network (Stage 1) 21
3.2.2 Raw Waveform Complementary Network (Stage 2) 21
CHAPTER 4 EXPERIMENTS AND RESULTS 24
4.1 EXPERIMENT SETTING 24
4.2 NETWORK CONFIGURATION 24
4.3 RECOGNITION RESULTS ON IEMOCAP 26
4.3.1 eGeMAPS 26
4.3.1.1 Comparison Models 26
4.3.1.2 Results 27
4.3.2 emobase 2010 29
4.4 RECOGNITION RESULTS ON MSP-IMROV 30
4.4.1 eGeMAPS 30
4.4.1 emobase 2010 31
4.5 COMPLEMENTARY LEVEL ANALYSIS 32
4.5.1 Analysis on IEMOCAP 33
4.5.2 Analysis on MSP-IMPROV 34
CHAPTER 5 CONCLUSION 37
REFERENCE 38
1. Björn Schuller; Gerhard Rigoll; Manfred Lang, Hidden markov model-based speech emotion recognition, in Multimedia and Expo. 2003. p. I-401.
2. Tin Lay Nwe; Say Wei Foo; Liyanage C De Silva, Speech emotion recognition using hidden markov models. Speech communication, 2003. 41(4): p. 603-623.
3. Antonio Camurri; Ingrid Lagerlof; Gualtiero Volpe, Recognizing emotion from dance movement: comparison of spectator recognition and automated techniques. International journal of human-computer studies, 2003. 59: p. 213–225.
4. Andrew J Calder, “Facial emotion recognition after bilateral amygdala damage: differentially severe impairment of fear. Cognitive Neuropsychology, 1996. 13: p. 699–745.
5. Liyanage C De Silva; Tsutomu Miyasato; Ryohei Nakatsu, Facial emotion recognition using multimodal information, in Information, Communications and Signal Processing. 1997. p. 397–401.
6. Jeong-Sik Park; Ji-Hwan Kim; Yung-Hwan Oh, Feature vector classification based speech emotion recognition for service robots. IEEE Transactions on Consumer Electronics, 2009. 55(3).
7. Cynthia Breazeal; Lijin Aryananda, Recognition of affective communicative intent in robot-directed speech. Autonomous robots, 2002. 12(1): p. 83-104.
8. Kristin Byron; Sophia Terranova; Stephen Nowicki, Nonverbal emotion recognition and salespersons: Linking ability to perceived and actual success. Journal of Applied Social Psychology, 2007. 37(11): p. 2600-2619.
9. Alex Pentland, Healthwear: medical technology becomes wearable. Computer, 2004. 37(5): p. 42-49.
10. Michael Neumann; Ngoc Thang Vu. Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech. 2017.
11. Wenming Zheng; Minghai Xin; Xiaolan Wang; Bei Wang, A Novel Speech Emotion Recognition Method via Incomplete Sparse Least Square Regression. IEEE Signal Processing Letters, 2014. 21: p. 569-572.
12. Maximilian Schmitt; Fabien Ringeval; Bjorn Schuller, At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech. INTERSPEECH, 2016: p. 495-499.
13. Bjorn Schuller; Stefan Steidl; Anton Batliner; et al, The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social signals, conflict, emotion, autism, in INTERSPEECH. 2013. p. 148–152.
14. Fabien Ringeval; Björn Schuller; Michel Valstar; Shashank Jaiswal; Erik Marchi; Denis Lalanne; Roddy Cowie; Maja Pantic, AV+EC 2015 – The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data, in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. 2015. p. 3-8.
15. Mousmita Sarma; Pegah Ghahremani; Daniel Povey; et al, Emotion Identification from raw speech signals using DNNs, in INTERSPEECH. 2018.
16. Pegah Ghahremani; Vimal Manohar; Daniel Povey; Sanjeev Khudanpur, Acoustic modelling from the signal domain using CNNs, in INTERSPEECH. 2016.
17. Egor Lakomkin; Cornelius Weber; Sven Magg; Stefan Wermter, Reusing Neural Speech Representations for Auditory Emotion Recognition, in Proceedings of the Eighth International Joint Conference on Natural Language Processing. 2017.
18. Zixiaofan Yang; Julia Hirschberg, Predicting Arousal and Valence from Waveforms and Spectrograms Using Deep Neural Networks, in INTERSPEECH. 2018.
19. Carlos Busso; Murtaza Bulut; Chi-Chun Lee, Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 2008. 42(4): p. 335.
20. Haytham M. Fayek; Margaret Lech; Lawrence Cavedon, Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 2017. 92: p. 60-68.
21. Carlos Busso; Srinivas Parthasarathy; Alec Burmania; et al, MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, 2015.
22. Florian Eyben; Klaus R Scherer; Björn Schuller; et al, The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Transactions on Affective Computing, 2016. 7(2): p. 190-202.
23. Jing Han; Zixing Zhang; Gil Keren; Björn Schuller, Emotion Recognition in Speech with Latent Discriminative Representations Learning, in Acta Acustica united with Acustica. 2018. p. 737-740.
24. Zakaria Aldeneh; Emily Mower Provost, Using regional saliency for speech emotion recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017. p. 2741??2745.
25. Florian Eyben; Felix Weninger; Florian Gross; Björn Schuller, Recent developments in opensmile, the munich open-source multimedia feature extractor, in Proceedings of the 21st ACM international conference on Multimedia. 2013. p. 835-838.
26. Florian Eyben; Martin Wöllmer; Böjrn Schuller, The openSMILE book - openSMILE: The Munich Versatile and Fast Open-Source Audio Feature Extractor, in ACM Multimedia. 2010.
27. Saurabh Sahu; Rahul Gupta; Carol Espy-Wilson, On enhancing speech emotion recognition using generative adversarial networks. 2018, arXiv preprint arXiv:1806.06626.
28. Shizhe Chen; Qin Jin; Xirong Li; Gang Yang; Jieping Xu, Speech emotion classification using acoustic features, in 9th International Symposium on Chinese Spoken Language Processing (ISCSLP). 2014. p. 579-583.
29. Orith Toledo-Ronen; Alexander Sorin, Voice-based sadness and anger recognition with cross-corpora evaluation, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2013. p. 7517-7521.
30. Frank Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 1958.
31. David E. Rumelhart; Geoffrey E. Hinton; Ronald J. Williamss, Learning representations by backpropagating errors. Cognitive modeling, 1988.
32. Geoffrey E. Hinton; Ruslan R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science, 2006.
33. Alex Krizhevsky; Ilya Sutskever; Geoffrey E. Hinton, Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 2012.
34. S. Lawrence; C.L. Giles; Ah Chung Tsoi; A.D. Back, Face recognition: a convolutional neural-network approach. IEEE Transactions on Neural Networks, 1997. 8: p. 98-113.
35. Shuiwang Ji; Wei Xu; Ming Yang; Kai Yu, 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013. 35(1): p. 221-231.
36. Ronald J. Williams; David Zipser, A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1989. 1(2): p. 270-280.
37. Grégoire Mesnil; Xiaodong He; Li Deng; Yoshua Bengio, Investigation of Recurrent-Neural-Network Architectures and Learning Methods for Spoken Language Understanding. INTERSPEECH, 2013.
38. Tomáš Mikolov; Stefan Kombrink; Lukáš Burget; et al, Extensions of recurrent neural network language model IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011.
39. Tomas Mikolov; Martin Karafiat; Lukas Burget; et al, Recurrent neural network based language model. INTERSPEECH, 2010. 2.
40. Haşim Sak; Andrew Senior; Kanishka Rao; Françoise Beaufays, Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv:1507.06947, 2015.
41. Sepp Hochreiter; Jürgen Schmidhuber, Long Short-Term Memory. Neural computation, 1997.
42. Y. Bengio; P. Simard; P. Frasconi, Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994. 5(2): p. 157 - 166.
43. Klaus Greff; Rupesh K. Srivastava; Jan Koutník; et al, LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems 2017. 28(10): p. 2222 - 2232.
44. Haşim Sak; Andrew Senior; Françoise Beaufays, Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition. arXiv:1402.1128 2014.
45. Has¸im Sak; Andrew Senior; Franc¸oise Beaufays, Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling. INTERSPEECH, 2014.
46. Panagiotis Tzirakis; George Trigeorgis; Mihalis A. Nicolaou; Böjrn Schuller; Stefanos Zafeiriou, End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 2017. 11(8): p. 1301-1309.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top