跳到主要內容

臺灣博碩士論文加值系統

(44.211.26.178) 您好!臺灣時間:2024/06/15 03:22
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:葉煥駿
研究生(外文):Huan-Chun Yeh
論文名稱:多模型生成對抗網路人機互動圖像描述系統於智慧服務機器人之應用
論文名稱(外文):Multi-Modal Generative Adversarial Network Based Image Caption System for Intelligent Human-Robot Interactive Service Robotics Applications
指導教授:羅仁權羅仁權引用關係
指導教授(外文):Ren-C Luo
口試委員:王富正張帆人
口試委員(外文):Fu-Zheng WangFan-Ren Chang
口試日期:2020-07-28
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:電機工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2020
畢業學年度:108
語文別:英文
論文頁數:102
中文關鍵詞:圖像描述人臉辨識文字辨識物體偵測深度學習
外文關鍵詞:image captionface recognitionemotion recognitionobject detectiondeep learning
DOI:10.6342/NTU202003900
相關次數:
  • 被引用被引用:0
  • 點閱點閱:187
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
服務型機器人是未來市場中的一大趨勢,在人力資源昂貴且匱乏的情況下,將機器人引進日常生活中便是能有效提升生活便利的一個方式,世界上存在著許多需要幫助的弱勢族群,隨著科技的進步,傳統的導盲設備已經無法滿足瞬息萬變的環境,導盲機器人應需而生且逐漸受到企業的高度關注。這篇論文的目的在於幫助那些視覺退化、視覺受損或是看不懂文字的人們,我們提出了一個完善的視覺服務系統,讓弱勢族群能夠在這科技發達的社會中生活便利,因此,為了讓服務型機器人更加的靈活且多功能,結合人工智慧是不可避免的趨勢。

為了能夠成為視覺障礙者的眼睛,圖像描述(Image caption) 是整篇論文最重要的核心概念。圖像描述(Image caption)即輸入一張影像後,根據影像的內容,輸出一句描述影像的文字,就像人一樣,不斷描述他所見到的場景。這樣的技術也能應用在不同的領域上,例如圖像搜尋(image retrieval),圖像指引(image indexing) ,但應用在服務型機器人上,仍有許多地方要改進,下面列出兩點傳統圖像描述(image caption) 應用在服務型機器人上的缺點。一、經過傳統交叉荻(cross entropy) 的方法訓練,模型會頃向於回答MSCOCO 訓練集 上的模板句子,句子回答起來死板又固定,並沒有像人一樣回答得那樣多樣且自然。二、由於訓練集中所包含的內容廣泛,訓練集 裡面的句子大都是概略且籠統的回答,但在實際的應用層面上,人們真正想知道的是有意義且資訊豐富的句子。

在學術上,有別於傳統的訓練方法,本論文採用GAN以及強化學習的技術,提升句子的變化性以及自然性,並在各個的評分指標上取得相當好的分數。在應用上,本論文的目的在於將圖像描述(image caption)真正落實於服務型機器人,並且幫助視覺上有障礙的族群。本論文所著重的目標應用在於幫助盲人的服務型機器人。也就是說,此機器人能夠將他所看到的影像進行描述後告知使用者,搭配上靈活的語音系統,可以詢問任何想知道的資訊,如果遇到資料庫以外的資訊,可以使用網路爬蟲來上網抓取資料,在日常生活中,使用者想知道的事情不外乎是附近人們的身分、動作、性別以及髮型等細微的資訊。我們利用物體偵測的技術,找出我們有興趣的區域,再透過人臉辨識、文字辨識、年齡辨識、物體辨識等模型蒐得到我們想要的資訊,我們在多模型資訊描述系統(Multi-modal informative Caption)裡,整合高達六個模型的資訊,分別是身份辨識、表情辨識、年齡辨識、圖像描述、密集圖像描述以及圖像分割。

透過實驗比較,我們發現使用我們強化學習(Reinforcement Learning)以及 GAN的方法能夠比MSCOCO 預訓練的圖像模型具有更高的分數。且使用我們的GAN為基底圖像描述模型能夠比單純進行微調的圖像描述模型具有更高的變化性以及準確度。
Service robots are a major trend in the future market. To reduce the cost of hiring many workers providing services, introducing robots into daily life is a way to effectively improve the quality of life. The purpose of this thesis is to develop necessary technologies in helping people who are visually degraded, visually impaired, or unable to understand the text. We propose an assistive visual service system to enable vulnerable groups to live in this technologically advanced society.

In order to be the eyes of the visually impaired, Image caption is one of the most important core assistive technologies. Image caption is to input an image, and according to the content of the image, output a sentence describing the image, just like a person, constantly describing the scene he observes. This technology can also be applied in different fields, such as image retrieval and image indexing, but there are still many places to be improved in the application of service robots. There are two shortcomings of the traditional image caption application in service robots. Firstly, after training by the traditional cross-entropy method, the model will tend to answer the template sentences on the MSCOCO training data. The sentences are rigid and fixed in response and are not as diverse, vivid, and nature as the human answers. Secondly, due to the extensive content contained in the dataset, the sentences in training data are mostly rough and general answers, however, in reality on the practical application level, what people really want to know are meaningful and informative sentences. In the Multi-modal informative caption system, we integrate six models of information, including identity recognition, facial expression recognition, age recognition, image caption, dense caption and image segmentation.

Academically, unlike traditional training methods, this thesis uses Generative Adversarial Network (GAN) and reinforcement learning techniques to improve the variability and naturalness of the sentence and to obtain quite good scores on various evaluation metrics. In terms of application, the purpose of this thesis is to truly implement image caption in service robots and to help visually impaired ethnic groups. The target application emphasized in this thesis is the service robot that helps the visually impaired person. In other words, the service robot can describe the images he sees and informs the user. With a flexible virtual assistant, users can obtain critical information such as the identity, movement, gender, and hairstyle of a certain person. We use object detection technology to find areas of interest, and then collect useful information through models such as face recognition, text recognition, age recognition, and object recognition.
誌謝 i
中文摘要 ii
ABSTRACT iv
CONTENTS vi
LIST OF FIGURES ix
LIST OF TABLES xii
Chapter 1 Introduction 1
1.1 Background 1
1.1.1 Service Robot 1
1.1.2 Image Caption on Service Robotics 2
1.2 Previous Research Works on Image Caption 4
1.2.1 Retrieval-based method 5
1.2.2 Neural network-based method(Cross entropy) 6
1.2.3 Neural network-based method (Reinforcement learning) 7
1.2.4 Attention Mechanism 8
1.3 Multi-task Learning 9
1.3.1 Implicit Data Augmentation. 10
1.3.2 Temporal Domain. 10
1.4 Motivation and Objectives 11
Chapter 2 Multi-model GAN(Generative Adversarial Network) Based Image Caption 14
2.1 Image Caption with object recognition 15
2.1.1 Encoder-Decoder Formulation 15
2.1.2 GAN(Generative Adversarial Network)-Based Image caption 19
2.1.3 Region-Based Object Injection Method 21
2.1.4 Dense caption 23
2.2 Face Recognition 25
2.2.1 Introduction to Identity Recognition 25
2.2.2 OpenCV Face Recognition 25
2.2.3 Triplet Loss 28
2.3 Facial Expression 31
2.3.1 Introduction of Facial Expression Recognition 31
2.3.2 Multi-task Cascaded Convolutional Networks (MTCNN)Algorithm 32
2.4 OCR(Optical Character Recognition) 35
2.4.1 Data Preprocessing 36
2.4.2 Extract the Feature 36
2.5 Multi-task Learning 38
2.5.1 Image Peripherals 39
2.5.2 Text Peripherals 39
2.5.3 Core Mechanism 39
2.6 Virtual assistant 42
2.6.1 Artificial Intelligence Markup Language(AIML) 42
2.6.2 ROS(Robotic Operating System) Topic 42
2.6.3 ROS(Robotic Operating System) SMACH 43
2.6.4 Web crawler 44
2.7 Implementation and Experimental Results 46
2.7.1 ROS(Robotic Operating System) 47
2.7.2 Robot Hardware 49
2.7.3 Simultaneous Localization and Mapping (SLAM) 51
Chapter 3 Implementation and Experimental Results 54
3.1 GAN(Generative Adversarial Network)-based Image Caption 54
3.1.1 Implementation of Caption Model 54
3.1.2 Pre-Training Dataset & Testing Dataset 54
3.1.3 Diversity & Naturalness 56
3.1.4 Experimental Results 58
3.2 Image Caption with Bottom-up attention mechanism 70
3.2.1 Implementation of Caption Model 70
3.2.2 Dataset 72
3.2.3 Quantitative Results 73
3.2.4 Experimental Results 73
3.3 Reinforcement Learning 74
3.3.1 Policy gradient 74
3.3.2 CIDEr as Reward 76
3.3.3 Evaluation Metrics 79
3.3.4 Quantity result analysis 81
3.4 Informative Caption 83
3.4.1 Public Dataset (Visual Genome) 83
3.4.2 Object detection 84
3.4.3 Dense caption 87
3.5 Optical Character Recognition (OCR) 90
Chapter 4 Contributions, Conclusions and Future Works 92
4.1 Contributions 92
4.2 Conclusions 92
4.3 Future Works 94
REFERENCE 95
VITA 102
[1] R. C. Luo, Y. Hsu, Y. Wen and H. Ye, "Visual Image Caption Generation for Service Robotics and Industrial Applications," 2019 IEEE International Conference on Industrial Cyber Physical Systems (ICPS), Taipei, Taiwan, 2019, pp. 827-832.

[2] R. C. Luo, Y. Hsu and H. Ye, "Multi-Modal Human-Aware Image Caption System for Intelligent Service Robotics Applications," 2019 IEEE 28th International Symposium on Industrial Electronics (ISIE), Vancouver, BC, Canada, 2019, pp.1180 1185.

[3] X. Xu, Z. Wang, Z. Tu, D. Chu and Y. Ye, "E-SBOT: A Soft Service Robot for UserCentric Smart Service Delivery," 2019 IEEE World Congress on Services (SERVICES), Milan, Italy, 2019, pp. 354-355.

[4] F. Yan and K. Mikolajczyk, "Deep correlation for matching images and text", Proc.IEEE Conference on Computer Vision and Pattern Recognition, pp. 3441-3450, Jun.2015.

[5] M. Hodosh, P. Young and J. Hockenmaier, "Framing image description as a ranking task: data models and evaluation metrics", Journal of Artificial Intelligence Research, vol. 47, pp. 853-899, 2013.

[6] J. Mun, M. Cho and B. Han, "Text-Guided Attention Model for Image Captioning",31st AAAI Conference on Artificial Intelligence, 2017.

[7] Y. Zhou, Z. Hu, Y. Zhac, X. Liu and Richang Hong, "Enhanced Text-Guided Attention Model for Image Captioning", 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM), pp. 1-5.

[8] Y. Ushiku, T. Harada and Y. Kuniyoshi, "Efficient image annotation for automatic sentence generation", Proceedings of the 20th ACM international conference on Multimedia, 2012.

[9] M. Mitchell et al., "Midge: Generating image descriptions from computer vision detections", Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012.

[10] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.

[11] J. Lu, C. Xiong, D. Parikh and R. Socher, "Knowing when to look: Adaptive attention via a visual sentinel for image captioning", Proc. IEEE Computer Vision and Pattern Recognition, pp. 375-383, Jun. 2017.

[12] Y. Keneshloo, T. Shi, N. Ramakrishnan and C. K. Reddy, "Deep Reinforcement Learning for Sequence-to-Sequence Models," in IEEE Transactions on Neural Networks and Learning Systems, doi: 10.1109/TNNLS.2019.2929141.

[13] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross and V. Goel, "Self-Critical Sequence Training for Image Captioning," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 1179-1195, doi:10.1109/CVPR.2017.131.

[14] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba, "Sequence Level Training with Recurrent Neural Networks", arXiv:1511.06732v7 [cs.LG] 6 May 2016.

[15] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.

[16] K. Xu et al., "Show attend and tell: Neural image caption generation with visual attention", in Proc. 32nd International Conference on International Conference on Machine Learning, pp. 2048-2057, 2015.

[17] D. Bahdanau, K. Cho and Y. Bengio, "Neural machine translation by jointly learning to align and translate", Proc. International Conference on Learning Representations, 2015.

[18] X. Liu and H. Wang, "AdvNet: Multi-Task Fusion of Object Detection and Semantic Segmentation," 2019 Chinese Automation Congress (CAC), Hangzhou, China, 2019, pp. 3359-3362.

[19] H. Hsieh, W. Hsu and Y. Chen, "Multi-task learning for face identification and attribute estimation," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, 2017, pp. 2981-2985.

[20] Yulin Wang1, Xuran Pan1, Shiji Song1, Hong Zhang, Cheng Wu1 Gao Huang1, "Implicit Semantic Data Augmentation for Deep" 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

[21] J. Deng, W. Dong, R. Socher, L. Li, Kai Li and Li Fei-Fei, "ImageNet: A large-scale hierarchical image database," 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, 2009, pp. 248-255.

[22] Ramakanth Pasunuru, Mohit Bansal "Multi-Task Video Captioning with Video and Entailment Generation" Computation and Language (cs.CL); Artificial Intelligence(cs.AI); Computer Vision and Pattern Recognition (cs.CV).

[23] Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. TVQA: localized, compositional video question answering. CoRR, abs/1809.01696, 2018. J. Lu, J. Yang, D. Batra, and D. Parikh, “Neural Baby Talk,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7219-7228.

[24] T. M. W. Vithanawasam and B. G. D. A. Madhusanka, "Face and Upper-Body Emotion Recognition Using Service Robots Eyes in a Domestic Environment," 2019 International Research Conference on Smart Computing and Systems Engineering (SCSE), Colombo, Sri Lanka, 2019, pp. 44-50, doi: 10.23919/SCSE.2019.8842658..

[25] Y. Ushiku, T. Harada and Y. Kuniyoshi, "Efficient image annotation for automatic sentence generation", Proceedings of the 20th ACM international conference on Multimedia, 2012.

[26] M. Mitchell et al., "Midge: Generating image descriptions from computer vision detections", Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012.

[27] G. Saqib, K Faizan, and N Ghatte, "Intelligent Chatting Service Using " in 2018 International Conference on Current Trends towards Converging Technologies(ICCTCT)

[28] R. Sangpal, T. Gawand, S. Vaykar and N. Madhavi, "JARVIS: An interpretation of AIML with integration of gTTS and Python," 2019 2nd International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT),Kannur,Kerala, India, 2019, pp. 486-489.

[29] M Quigley, K Conley, B Gerkey, J Faust, T Foote, J Leibs, R Wheeler, NgAY. ROS:an open-source Robot Operating System. InICRA workshop on open source software, vol. 3, no. 3.2, pp. 5, 2009 May 12.

[30] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.

[31] M. Abas and T. Skripčák, "A modification of gradient policy in reinforcement learning procedure," 2012 15th International Conference on Interactive Collaborative Learning (ICL), Villach, 2012, pp. 1-2, doi: 10.1109/ICL.2012.6402200.

[32] S. Yang, P. Luo, C. C. Loy and X. Tang, "Faceness-Net: Face Detection through Deep Facial Part Responses," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 8, pp. 1845-1859, 1 Aug. 2018, doi:10.1109/TPAMI.2017.2738644.

[33] C. Szegedy et al., "Going deeper with convolutions," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 1-9.

[34] J. Deng, W. Dong, R. Socher, L. Li, Kai Li and Li Fei-Fei, "ImageNet: A large-scale hierarchical image database," 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, 2009, pp. 248-255.

[35] S. Chan, P. Wu and L. Fu, "Robust 2D Indoor Localization Through Laser SLAM and Visual SLAM Fusion," 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 2018, pp. 1263-1268, doi: 10.1109/SMC.2018.00221..

[36] Y. Liu, "An Improved Faster R-CNN for Object Detection," 2018 11th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China,2018, pp. 119-123, doi: 10.1109/ISCID.2018.10128.

[37] A. Notsu, K. Yasuda, S. Ubukata and K. Honda, "Optimization of Learning Cycles in Online Reinforcement Learning Systems," 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 2018, pp. 3530-3534, doi: 10.1109/SMC.2018.00597..

[38] S. Mukhopadhyay, O. Tilak and S. Chakrabarti, "Reinforcement Learning Algorithms for Uncertain, Dynamic, Zero-Sum Games," 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, 2018, pp. 48-54, doi: 10.1109/ICMLA.2018.00015.

[39] G. Koulinas, A. Xanthopoulos, A. Kiatipis and D. Koulouriotis, "A Summary Of Using Reinforcement Learning Strategies For Treating Project And Production Management Problems," 2018 Thirteenth International Conference on Digital Information Management (ICDIM), Berlin, Germany, 2018, pp. 33-38, doi: 10.1109/ICDIM.2018.8847099.

[40] R. Vedantam, C. L. Zitnick and D. Parikh, "CIDEr: Consensus-based image description evaluation," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 4566-4575, doi: 10.1109/CVPR.2015.7299087.

[41] P. Anderson, B. Fernando, M. Johnson and S. Gould, "Spice: Semantic propositional image caption evaluation," in European Conference on Computer Vision, pp. 382?398, 2016.

[42] Y. Keneshloo, T. Shi, N. Ramakrishnan and C. K. Reddy, "Deep Reinforcement Learning for Sequence-to-Sequence Models," in IEEE Transactions on Neural Networks and Learning Systems, doi:10.1109/TNNLS.2019.2929141

[43] P. Malik and A. S. Baghel, "An improvement in BLEU metric for English-Hindi machine translation evaluation," 2016 International Conference on Computing, Communication and Automation (ICCCA), Noida, 2016, pp. 331-336, doi:10.1109/CCAA.2016.7813740.

[44] J. Yu, W. Liu, H. He and L. Wang, "Automatic Meta-evaluation of Low-Resource Machine Translation Evaluation Metrics," 2019 International Conference on Asian Language Processing (IALP), Shanghai, Singapore, 2019, pp. 136-141, doi:10.1109/IALP48816.2019.9037658.

[45] V. Gupta, N. Joshi and I. Mathur, "Subjective and objective evaluation of English to Urdu Machine translation," 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Mysore, 2013, pp. 1520-1525, doi: 10.1109/ICACCI.2013.6637405.

[46] B. Dai, S. Fidler, R. Urtasun and D. Lin, "Towards Diverse and Natural Image Descriptions via a Conditional GAN," 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 2017, pp. 2989-2998, doi: 10.1109/ICCV.2017.323.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊