跳到主要內容

臺灣博碩士論文加值系統

(44.211.84.185) 您好!臺灣時間:2023/05/30 07:15
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:吳博元
研究生(外文):Po-Yuan Wu
論文名稱:透過文本內文、語音和說話者身分之三模態線索生成用於健康照護機器人的對話手勢
論文名稱(外文):Generation of Co-Speech Gestures of a Health Care Robot from Trimodal Cues: Contents of Text, Speech, and Speaker Identity
指導教授:范欽雄范欽雄引用關係
指導教授(外文):Chin-Shyurng Fahn
口試委員:馮輝文王榮華鄭為民
口試日期:2023-01-17
學位類別:碩士
校院名稱:國立臺灣科技大學
系所名稱:資訊工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2023
畢業學年度:111
語文別:英文
論文頁數:47
中文關鍵詞:深度學習三模態線索生成手勢TED手勢資料集生成對抗網路照護機器人
外文關鍵詞:deep learningtrimodal cues generating gesturesTED gesture datasetgenerative adversarial networkscare robots
相關次數:
  • 被引用被引用:0
  • 點閱點閱:28
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
近年來,人口高齡化已經變成全球各國都面臨到的議題,老年人照護政策逐漸地受到重視,尤其我國的國民人均壽命高於全球人均壽命的情況,更顯得老年人照護的重要性,而在照護人口無法負荷需長期照護的老年人比率下,推動了照護機器人的發展。本論文的目的及動機是希望透過深度學習方法,訓練模型學習國際演講者的演講音訊、內容和動作,用以生成照護機器人說話時的肢體手勢,訓練完的模型可藉由輸入音訊和文本內文,生成對應機器人關節點的位置,使機器人在與老年人互動的過程中,能更生動的表達說話內容。
現有透過深度學習方法生成手勢的研究並不多,一部分是處理音訊,藉由傳統的方法或卷積神經網路抽取特徵,另一部分是透過演講內容做語意分析抽取特徵,再利用長短期記憶模型架構生成手勢,鮮少有人使用生成對抗網路生成手勢。我們則是透過音訊、演講內容、演講者身分三種模態線索生成手勢,藉由訓練三種不同的神經網路用以抽取各個模態的特徵,接著建立一個生成對抗網路,其中的生成器會根據抽取的特徵生成手勢,而判別器負責辨識生成手勢和實際手勢的真偽,經過模型的交互訓練後,使生成器生成近乎真實的手勢。
實驗結果方面,我們會將生成出來的手勢和實際的手勢,利用三個評估指標—關節位置Mean absolute error (MAE)、Mean acceleration distance (MAD),以及加速Fréchet 手勢距離(FGD)進行觀察及分析,並與現有的數個出色的手勢生成模型做比較。針對TED手勢資料集,由實驗數據上看出,相較於當代最佳手勢生成模型Sp2AG,我們的生成模型的位置誤差率,在關節點的Mean absolute error (MAE)下降7.88%,以及距離偏移率,在手勢的Mean acceleration distance (MAD)下降10.23%,以及加速Fréchet 手勢距離 (FGD)下降8.75%。
In recent years, aging population has become an issue in many countries all over the world, and elderly care’s policies are gradually concerned. In particular, the life expectancy in Taiwan is more than the average life expectancy in most countries. The development of care robots has been driven by the proportion of the elderly who need long-term care. The purpose and motivation of this thesis is to build a deep learning model and train by using trimodal cues: the contents of text, speech, and speaker identity. The trained model can generate corresponding gestures from the inputs, so that the care robot can express gestures more vividly during human-robot interaction.
So far, there are not many studies on generation of robot gesture through deep learning methods. In the data preprocessing, one group is only to extract audio features and another is only to analyze semantics of text content. The architecture of model usually uses long short-term memory to build. In this thesis, we propose a generative model based on generative adversarial networks to generate gestures by using the trimodal cues. We train three different neural networks for extracting the features of each modality separately. And then we build a generative adversarial network, where the generator of network generates gestures based on the extracted features, and the discriminator identifies the authenticity of the generative gestures and the actual gestures. After training, the generator will generate near-realistic gestures.
From experimental results, we observe and analyze the gestures differences between the generative gestures and the actual gestures by use of three evaluation metrics: Mean absolute error (MAE) of joint coordinates, Mean acceleration distance (MAD) of gestures, and accelerated Fréchet gesture distance (FGD), and then compare the performance with several existing outstanding generative models. In comparison with Sp2AG, the state-of-the-art gesture generation model, for the position error rate of our generated model, the Mean absolute error (MAE) at the joint point is decreased by 7.88%, while for the distance offset rate, the Mean acceleration distance (MAD) of gestures is decreased by 10.23%, and accelerated Fréchet gesture distance (FGD) is decreased by 8.75%.
Contents
摘要 i
Abstract ii
中文致謝 iv
Contents v
List of Figures vii
List of Tables ix
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 2
1.3 System Description 3
1.4 Thesis Organization 5
Chapter 2 Related Work 6
2.1 Literature Review 6
2.2 Deep Neural Networks 7
2.2.1 Multi-Layer Perceptron 8
2.2.2 Recurrent Neural Networks 9
2.2.3 Self-Attention Mechanism 10
2.2.4 Transformer 11
2.2.5 Squeeze and Excitation Networks 13
Chapter 3 Deep-Learning-Based Gesture Generation Method 15
3.1 Data Preprocessing 15
3.1.1 Text Processing 15
3.1.2 Audio Processing 17
3.1.3 Speaker Identity Style Sampling 17
3.2 Generative Adversarial Networks Model 19
3.2.1 Network Architecture 19
3.2.2 Bi-GRU block architecture 20
3.2.3 Searching for Weighted Coefficients in Losses 21
Chapter 4 Experimental Results and Discussion 23
4.1 Datasets 23
4.2 Experimental Environment and Training Detail 24
4.3 Data Visualization 26
4.4 Training and Validating Result and Analysis 27
4.5 Testing Result and Analysis 35
4.6 Comparison with Baseline Methods 41
Chapter 5 Conclusions and Future Work 43
5.1 Conclusions and Contributions 43
5.2 Future Work 44
References 45
References
[1] World Health Organization. World Health Statistics Overview 2019: Monitoring Health for the Sustainable Development Goals (SDGs). 2019. Available online: https://apps.who.int/iris/bitstream/handle/10665/311696/WHO-DAD-2019.1-eng.pdf
(accessed on 8 August 2022).
[2] Ministry of Health and Welfare. 2018 Taiwan Health and Welfare Report. 2018. Available online: https://www.mohw.gov.tw/cp-137-47558-2.html (accessed on 8 August 2022).
[3] N. Sadoughi and C. Busso. “Speech-driven animation with meaningful behaviors,” Speech Communication, vol. 110, pp. 90-100, 2019.
[4] C.-M. Huang and B. Mutlu. “Learning-based modeling of multimodal behaviors for humanlike robots,” in Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction, Bielefeld, Germany, 2014, pp. 57-64.
[5] M. Kipp. Gesture Generation by Imitation: From Human Behavior to Computer Character Animation, Universal-Publishers: Irvine, California, 2005.
[6] S. Levine, P. Krahenbuhl, S. Thrun, and V. Koltun. “Gesture controllers,” Transactions on Graphics, vol. 29, no. 4, pp. 1-11, 2010.
[7] Y. Ferstl, M. Neff, and R. McDonnell. “Multi-Objective adversarial gesture generation,” in Proceedings of the ACM SIGGRAPH Conference on Motion, Interaction and Games, Newcastle Upon Tyne, United Kingdom, 2019, pp. 1-10.
[8] S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik. “Learning individual styles of conversational gesture,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019, pp. 3497-3506.
[9] S. Alexanderson, G. E. Henter, T. Kucherenko, and J. Beskow. “Style-controllable speech-driven gesture synthesis using normalizing flows,” Computer Graphics Forum, vol. 39, no. 2, pp. 487-496, 2020.
[10] Y. Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee. “Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots,” in Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, Canada, 2019, pp. 4303-4309.
[11] T. Kucherenko, P. Jonell, S. vanWaveren, G. E. Henter, S. Alexanderson, I. Leite, and H. Kjellstrom. “Gesticulator: a framework for semantically-aware speech-driven gesture generation,” in Proceedings of the ACM International Conference on Multimodal Interaction, Utrecht, Netherlands, 2020, pp. 242-250.
[12] A. B Hostetter and A. L Potthoff. “Effects of personality and social situation on representational gesture production,” Gesture, vol. 12, no. 1, pp. 62-83, 2012.
[13] T. Baltrušaitis, C. Ahuja, and L.-P. Morency. “Multimodal machine learning: a survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423-443, 2018.
[14] C. Ahuja and L.-P. Morency. “Language2Pose: natural language grounded pose forecasting,” in Proceedings of the IEEE International Conference on 3D Vision, Quebec City, Canada, 2019, pp. 719-728.
[15] M. Roddy, G. Skantze, and N. Harte. “Multimodal continuous turn-taking prediction using multiscale RNNs,” in Proceedings of the ACM International Conference on Multimodal Interaction, Boulder, Colorado, 2018, pp. 186-190.
[16] D. Bahdanau, K. Cho, and Y. Bengio. “Neural machine translation by jointly learning to align and translate,” arXiv:1409.0473, 2015.
[17] A. Aristidou, E. Stavrakis, P. Charalambous, Y. Chrysanthou, and S. Loizidou Himona. “Folk dance evaluation using laban movement analysis,” Computing and Cultural Heritage, vol. 8, no. 4, pp. 1-19, 2015.
[18] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. “Improved techniques for training GANs,” in Proceedings of the Conference on Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 2234-2242.
[19] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” in Proceedings of the Conference on Neural Information Processing Systems, Long Beach, California, 2017, pp. 6626-6637.
[20] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi. “Frechet audio distance: a metric for evaluating music enhancement algorithms,” arXiv:1812.08466, 2018.
[21] L. Medsker and C. J. Jain. Recurrent neural networks: Design, and Applications, CRC Press: Boca Raton, California, 1999.
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. “Attention is all you need,” in Proceedings of the Conference on Neural Information Processing Systems, Long Beach, California, 2017, pp. 30.
[23] J. Hu, L. Shen, and G. Sun. “Squeeze-and-excitation networks,” In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UTAH, 2018, pp. 7132-7141.
[24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
[25] G. Ian, P.-A. Jean, M. Mehdi, B. Xu, D.-F. SherjilOzair, A. Courville, and Y. Bengio. “Generative adversarial nets,” in Proceedings of the Conference on Neural Information Processing Systems, Montreal, Canada, 2014, 2672-2680.
[26] Y. Yoon, B. Cha, J.-H. Lee, M. Jang, J. Lee, J. Kim, and G. Lee. “Speech gesture generation from the trimodal context of text, audio, and speaker identity,” Transactions on Graphics, vol. 39, no. 6, pp. 1-16, 2020.
[27] U. Bhattacharya, E. Childs, N. Rewkowski, and D. Manocha. “Speech2affectivegestures: synthesizing co-speech gestures with generative adversarial affective expression learning,” in Proceedings of the ACM International Conference on Multimedia, Chengdu, China, 2021, pp. 2027-2036.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊