跳到主要內容

臺灣博碩士論文加值系統

(98.82.120.188) 您好!臺灣時間:2024/09/15 16:22
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:徐宇霆
研究生(外文):Yu-Ting Hsu
論文名稱:多模態知識圖像描述系統於服務型機器人之應用
論文名稱(外文):Multi-Modal Knowledge Image Caption System for Intelligent Service Robotics Applications
指導教授:羅仁權羅仁權引用關係
口試委員:張帆人顏炳郎
口試日期:2019-07-29
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:電機工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2019
畢業學年度:107
語文別:英文
論文頁數:80
中文關鍵詞:圖像描述身分辨識情緒辨識物體偵測深度學習
DOI:10.6342/NTU201902530
相關次數:
  • 被引用被引用:0
  • 點閱點閱:222
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
Service robot是未來機器人以及人工智慧的趨勢。過去的工業型機器人搭配人工智慧,能幫助工廠進行多項自動化工作。而在機器人真正進入人們的生活之前,服務型機器人所擁有的智慧程度,便是一大考驗。要使機器人的智慧到達一定水準,deep learning便是一項不可或缺的技術。deep learning技術近幾年開始逐漸興盛,包含影像的部分所使用的CNN (convolutional neural network)以及語言處理所使用的RNN (recurrent neural network)。
而當今將此二者進行整合的image caption技術則是更貼近人工智慧的代表作品。Image caption即輸入一張影像後,根據影像的內容,輸出一句描述影像的文字,就像一個人一樣,不斷描述他所見到的場景。這樣的功能雖然在其他領域有其應用實例,例如image retrieval,image indexing,但在服務型機器人上仍然無法真正落實。其原因有二。一、目前的image caption所學習的範圍太過廣泛。著名的image caption dataset如MSCOCO以及flickr8k, flickr30k都包含著許多大自然的風景、手繪圖案、抽象畫作品等等實際生活中較少見的畫面。而服務型機器人則可能因為之前所訓練的dataset的關係而說出錯誤的話。二、由於dataset中所包含的內容廣泛,因此其label只能是非常通俗且一般的用語。而服務型機器人在其服務場域中必須具備該場域的特定知識如場域中的人、事、物等等。因此使用一般的image caption model無法包含特定場域的知識。
本論文的目的在於將image caption真正落實於服務型機器人。本論文所著重的目標應用在於巡邏用途的服務型機器人。也就是說,此機器人能夠將他所看到的影像進行描述後告知管理者,使管理者更便於進行管理。為了達到此一目標,我們試圖了解管理者需要知道的資訊例如影像中所包含的物體。而影像中若包含人的話,則須知道該人的身分以及該人的狀態。此狀態有可分為情緒以及行為。因此本論文使用三種方法將image caption model與特定環境中物體辨識進行整合。再將此image caption model與人臉辨識、情緒辨識進行整合。使巡邏機器人能夠回報該人的身分以及情緒狀態。機器人還能具備有語意式定位的功能,使管理者能夠獲得更全面的資訊。
透過實驗比較,我們發現使用我們的informative image caption system能夠比MSCOCO pre-trained image caption model具有更高的物體辨識正確率。且使用我們的informative image caption system能夠比單純進行fine-tune的image caption model具有更高的人臉辨識與情緒辨識的正確率。
Service robot is a trend for the field of robotics and artificial intelligence. In the past, industrial robots equipped with artificial intelligence help the factory automation. When the robots gradually come into daily lives, it''s a big challenge to equip them with sufficient intelligence. To make this possible, deep learning is an important technique. Deep learning gets popular in recent years including CNN (Convolutional Neural Network) for image processing and RNN (Recurrent Neural Network) for natural language processing.
The more intelligent function, image caption, combines the techniques of CNN and RNN. Image caption is a function that given an image, it will generate a sentence to describe the image as a person does. Although this can be used in image retrieval, image indexing, etc., it cannot be applied directly on a service robot due to two main reasons.
First, the image caption models proposed in recent works are trained with the famous common dataset, i.e. MSCOCO or flickr. These datasets gather images from a broad variety of field such as hand-drawn pictures, natural scene and paintings that are not usually seen in daily lives. Therefore, a robot equipped with such models may generate these special sentences sometimes even when it does not see the related images. Second, a service robot usually serves in a specific environment, so that it should be equipped with some specific knowledge corresponding to the objects and human in the environment. Unfortunately, the public and general dataset will not have that knowledge.
The purpose of this work is to ground the image caption to a real service robot. This work will focus on the service robot for patrol. In other words, the robot should make a caption about what it sees and read that sentence to the guard in the remote control room. For this kind of purpose, we need to know what information the guard wants to know. For example, if there is a person in an image, the guard may want to know the person''s identity and state. The state includes the emotion and the behavior. In this work, the author proposes three methodologies to combine an image caption model with specific object recognition, so that the output sentence will contain the knowledge about the objects. Then, this image caption model is combined with a face recognition model and an emotion classification model so that the robot can also report the person''s identity and the emotion. Furthermore, the robot is also equipped with semantic localization to give the guard more comprehensive information about the scene the robot sees.
From the experiment, we conclude that our informative image caption system outperform the MSCOCO-pretrained image caption model with higher object recognition accuracy. Our model also has a higher facial recognition rate and emotion recognition rate compared to the fine-tuned model.
誌謝 i
中文摘要 ii
ABSTRACT iv
LIST OF FIGURES ix
LIST OF TABLES xii
Chapter 1 Introduction 1
1.1 Background 1
1.1.1 Service Robot 1
1.1.2 Image Caption on Service Robotics 2
1.2 Literature Review on Image Caption 3
1.2.1 Early Work 3
1.2.2 Deep Learning Based Method 4
1.2.3 Attention Mechanism 5
1.3 Motivation and Objectives 8
Chapter 2 Multi-Modal Knowledge Image Caption 12
2.1 Image Caption with object recognition 13
2.1.1 Encoder-Decoder Formulation 13
2.1.2 Fine-Tuning Based Image Caption 16
2.1.3 Region-Based Object Injection Method 19
2.1.4 Template-Based Object Appending Method 23
2.2 Face Recognition 25
2.2.1 Introduction to Identity Recognition 25
2.2.2 Vanilla Face Recognition 26
2.2.3 Side-View Face Recognition 30
2.3 Facial Expression Recognition 33
2.3.1 Introduction of Facial Expression Recognition 33
2.3.2 Vanilla Implementation 33
2.3.3 Unbiased Facial Expression Recognition 35
2.4 Human-Aware Context Generation 37
2.4.1 Combining Image Caption with Face Recognition and Emotion Recognition 38
2.5 Semantic Localization 41
2.5.1 SLAM 42
2.5.2 Localization 43
2.5.3 Topological Semantic Map 43
2.6 Implementation Detail 44
2.6.1 ROS 45
2.6.2 Robot Hardware 47
Chapter 3 Experiments 50
3.1 Image Caption with Specific Objects 50
3.1.1 Implementation Caption Model 50
3.1.2 Pre-Training Dataset 50
3.1.3 Fine-Tuning Dataset 51
3.1.4 Testing Dataset 52
3.1.5 Experimental Results 52
3.2 Image Caption with Region-Based Injection and Template-Based Appending 55
3.2.1 Implementation Caption Model 55
3.2.2 Dataset 57
3.2.3 Quantitative Results 57
3.2.4 Qualitative Results 58
3.3 Face Recognition 60
3.3.1 Frontal Face Dataset 60
3.3.2 Side-View Face Dataset 61
3.3.3 Evaluation Metrics 63
3.3.4 Frontal Face Recognition Results 63
3.3.5 Side-View Face Recognition Results 64
3.4 Facial Expression Recognition 65
3.4.1 Public Dataset 65
3.4.2 Emotion-Identity Dataset 66
3.4.3 Quantitative Results 66
3.4.4 Qualitative Results 68
3.5 Human-Aware Context Generation 69
3.5.1 Fine-Tune Dataset 69
3.5.2 Evaluation Metrics 69
3.5.3 Experimental Setups 70
3.5.4 Quantitative Results 70
3.5.5 Qualitative Results 71
Chapter 4 Conclusions and Future Works 74
4.1 Conclusions 74
4.2 Future Works 74
REFERENCE 77
[1]Y. Ushiku, T. Harada, and Y. Kuniyoshi, “Efficient image annotation for au-tomatic sentence generation,” in Proc. 20th ACM international conference on Multimedia, 2012.
[2]M. Mitchell et al., “Generating image descriptions from computer vision detections,” in Proc. 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012.
[3]R. C. Luo, C. C. Chang, and C. C. Lai, “Multisensor Fusion and Integration: Theories, Applications, and its Perspectives,” IEEE Sensors Journal, Vol.11, No.12 , pp. 3122-3138, 2011.
[4]A. Farhadi et al., “Every picture tells a story: Generating sentences from images,” in European conference on computer vision, pp. 15-29, 2010.
[5]A. Gupta, Y. Verma, andC. V. Jawahar, “Choosing Linguistics over Vision to Describe Images,” in Twenty-Sixth AAAI Conference on Artificial Intelligence, pp. 1, Jul., 2012.
[6]G. Kulkarni, et.al., “Babytalk: Understanding and generating simple image descriptions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2891-2903, 2013.
[7]D. Elliott and F. Keller, “Image description using visual dependency representations,” in Proc. 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1292-1302.
[8]B. Yoshua, C. Aaron, and V. Pascal, “Representation Learning: A Review and New Perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798-1828, Aug, 2013.
[9]A. Frome et al., “DeViSE: A deep visual-semantic embedding model,” in Proc. Advances in Neural Information Processing Systems, pp. 2121–2129, 2013.
[10]A. Karpathy, A. Joulin, and F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in Proc. Advances in Neural Information Processing Systems, pp. 1889-1897, 2014.
[11]F. Yan and K. Mikolajczyk, “Deep correlation for matching images and text,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015, pp. 3441–3450.
[12]O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neuralimagecaption generator,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.
[13]K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. 32nd International Conference on International Conference on Machine Learning, 2015, pp. 2048–2057.
[14]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. International Conference on Learning Representations, 2015.
[15]Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image caption with semantic attention,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4651–4659.
[16]P. Anderson et al., “Bottom-Up and Top-Down Attention for Image Caption and Visual Question Answering,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6077-6086.
[17]J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image caption,” in Proc. IEEE Computer Vision and Pattern Recognition, Jun. 2017, pp. 375–383.
[18]Q. Wu, C. Shen, P. Wang, A. Dick, and A. V. D. Hengel, “Image Caption and Visual Question Answering Based on Attributes and External Knowledge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, Jun., pp. 1367–1381, 2018.
[19]C. Szegedy et al., “Going deeper with convolutions,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9.
[20]F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, and K. Keutzer, “Densenet: Implementing efficient convnet descriptor pyramids,” arXiv:1404.1869.
[21]S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol.9, no. 8, pp. 1735–1780 , 1997.
[22]Xinlei Chen et al., “Microsoft COCO Captions: Data Collection and Evaluation Server,” arXiv:1504.00325.
[23]M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as aranking task: data, models and evaluation metrics,” Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013.
[24]A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling Task Transfer Learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 3712-3722.
[25]J. Lu, J. Yang, D. Batra, and D. Parikh, “Neural Baby Talk,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7219-7228.
[26]M. M. A. Baig, M. I. Shah, M. A. Wajahat, N. Zafar, and O. Arif, “Image Caption Generator with Novel Object Injection,” in 2018 Digital Image Computing: Techniques and Applications (DICTA), Dec. 10-13, 2018, pp. 1-8.
[27]C. Liu, J. Mao, F. Sha, and A. Yuille, “Attention Correctness in Neural Image Caption,” in Thirty-First AAAI Conference on Artificial Intelligence. 2017.
[28]O. M. Nezami, M. Dras, P. Anderson, and L. Hamey, “Face-Cap: Image Caption Using Facial Expression Analysis,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2018. pp. 226-240.
[29]K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition, ” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.
[30]A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks, ” in Advances in neural information processing systems (NIPS), 2012.
[31]K. Simonyan and A. Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
[32]J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788
[33]W. Liu et al., "SSD: Single Shot MultiBox Detector," arXiv:1512.02325
[34]D. Erhan, C. Szegedy, A. Toshev, D. Anguelov, “Scalable object detection using deep neural networks,” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014 , pp. 2147-2154.
[35]R. Girshick, “Fast R-CNN,” in The IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440-1448.
[36]G. B. Huang, M. Mattar, T. Berg, and E. L.-Miller, “Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments,” in Workshop on faces in''Real-Life''Images: detection, alignment, and recognition., 2008.
[37]K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002, pp. 311–318.
[38]M. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in Association for Computational Linguistics workshop, pp. 376–380, 2014.
[39]R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in IEEE Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566–4575.
[40]P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in European Conference on Computer Vision, 2016, pp. 382–398.
[41]C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, “The Stanford CoreNLP natural language processing toolkit,” in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55–60.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊