跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.171) 您好!臺灣時間:2024/12/13 20:48
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:沈柏瑋
研究生(外文):Po-Wei Shen
論文名稱:電影與動畫橋段中的隱喻理解與偵測
論文名稱(外文):Trope Understanding in Movies and Animations
指導教授:徐宏民
指導教授(外文):Winston H. Hsu
口試委員:陳文進陳奕廷葉梅珍余能豪
口試委員(外文):Wen-Chin ChenYi-Ting ChenMei-Chen YehNeng-Hao Yu
口試日期:2021-07-28
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2021
畢業學年度:109
語文別:英文
論文頁數:28
中文關鍵詞:資料集隱喻理解多模態學習深度感知影片推理
外文關鍵詞:datasettrope understandingmulti-modal learningdeep cognitionvideo reasoning
DOI:10.6342/NTU202102121
相關次數:
  • 被引用被引用:0
  • 點閱點閱:184
  • 評分評分:
  • 下載下載:21
  • 收藏至我的研究室書目清單書目收藏:1
理解與領悟影片內容對於許多實際應用至關重要,例如搜索或推 薦系統等。儘管近年來深度學習的發展利用視覺上的線索提高了各 種任務的性能,但對於更挑戰性的問題如推理意圖、動機或因果關係 的深度認知仍有很大的進步空間。現今宣稱能測試影片推理能力的 資料集都著重於比表層視覺上的訊號,例如動作、物件或物件間的關係,又或者可以利用文本上的偏差來解決任務。鑑於此,我們提出 了一項新的任務以及一個新的數據集:Trope Understanding in Movies and Animations (TrUMAn),目標在利用視覺訊號來開發與測試具有影片推理能力的深度學習系統。隱喻 (Trope) 是創作者們常用來在作品中表達想法與概念的手段,我們通過讓機器學習解決隱喻理解的任務來使其深度認知的能力更加強大,並且我們認為能藉此將將數據挖掘的應用及演算法的性能提升到一個新的水平。為了解決這個具有挑戰性的 TrUMAn 數據集,我們提出了一個 Trope Understanding and Storytelling (TrUSt) 模型和一個新的 Conceptual Storyteller 模組,該模組通過在潛空間中學習看影片說故事的能力來強化我們的影片編碼模組,並且能將由模型生成的故事輸入到隱喻理解模組中來提供模型進一步的訊息以學習隱喻理解。我們的實驗結果顯示,現有任務中最好的深度學習模型在利用影片的輸入訊號上只能達到 12.01% 的準確度。此外,即使在利用具有人工標示影片介紹的最理想情況下,利用 BERT 的語意理解模型也只能達到最多 28% 的準確度。而我們提出的 TrUSt 在僅利用影片輸入訊號的情況下提高了模型效能並達到了 13.94% 的準確度。我們同時也提供了詳細的實驗分析來為未來進行的相關研究鋪路。目前我們的資料集: TrUMAn 已經在下列網址公開:https://www.cmlab.csie.ntu.edu.tw/project/trope。
Understanding and comprehending video content is crucial for many real world applications such as search and recommendation systems. While recent progress of deep learning has boosted performance on various tasks using visual cues, deep cognition to reason intentions, motivation, or causality remains challenging. Existing datasets that aim to examine video reasoning capability focus on visual signals such as actions, objects, relations, or could be answered utilizing text bias. Observing this, we propose a novel task, along with a new dataset: Trope Understanding in Movies and Animations (TrUMAn), intending to evaluate and develop learning systems beyond visual signals. Tropes are frequently used storytelling devices for creative works. By coping with the trope understanding task and enabling the deep cognition skills of machines, we are optimistic that data mining applications and algorithms could be taken to the next level. To tackle the challenging TrUMAn dataset, we present a Trope Understanding and Storytelling (TrUSt) with a new Conceptual Storyteller module, which guides the video encoder by performing video storytelling on a latent space. The generated story embedding is then fed into the trope understanding model to provide further signals. Experimental results demonstrate that state-of-the-art learning systems on existing tasks reach only 12.01% of accuracy with raw input signals. Also, even in the oracle case with human-annotated descriptions, BERT contextual embedding achieves at most 28% of accuracy. Our proposed TrUSt boosts the model performance and reaches 13.94% performance. We also provide detailed analysis to pave the way for future research. TrUMAn is publicly available at: https://www.cmlab.csie.ntu.edu.tw/project/trope
誌謝 i
摘要 ii
Abstract iii
1 Introduction 1
2 ImpactandPotentialExtensions 4
3 RelatedWork 6
4 TrUMAnDataset 8
4.1 Overview 8
4.2 Data Collection 10
4.3 Data Analysis 10
4.4 Human Evaluation on TrUMAn 11
4.5 Data Availability 13
5 TropeUnderstandingandStorytelling(TrUSt)Model 14
5.1 Video Encoder 14
5.2 Conceptual Storyteller 16
5.3 Trope Understanding 17
6 Experiments 18
6.1 Modality 18
6.2 Compared Methods 19
6.3 Results and Discussion 20
7 Conclusion 25
Bibliography 26
Yoshua Bengio. From system 1 deep learning to system 2 deep learning. NeuripS, 2019.
Chen-Hsi Chang, Hung-Ting Su, Juiheng Hsu, Yu-Siang Wang, Yu-Cheng Chang, Zhe Yu Liu, Ya-Liang Chang, Wen-Feng Cheng, Ke-Jyun Wang, and Winston H. Hsu. Situation and behavior understanding by trope detection on films. In WWW, 2021.
Kuo-HaoZeng,Tseng-HungChen,Ching-YaoChuang,Yuan-HongLiao,JuanCarlos Niebles, and Min Sun. Leveraging video descriptions to learn video question answering. In AAAI, 2017.
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question-answering. In CVPR, pages 4631–4640. IEEE Computer Society, 2016.
Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, and Byoung-Tak Zhang. Deepstory: Video story QA by deep embedded memory networks. In IJCAI, 2017.
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. Tvqa: Localized, compositional video question answering. In EMNLP, 2018.
Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, and Jingjing Liu. Violin: A large-scale dataset for video-and-language inference. In CVPR, 2020.
Thomas Winterbottom, Sarah Xiao, Alistair McLean, and Noura Al Moubayed. On modality bias in the tvqa dataset. In BMVC, 2020.
R. Girdhar B. Jasani and D. Ramanan. Are we asking the right questions in MovieQA? In ICCV Workshops, 2019.
Jianing Yang, Yuying Zhu, Yongxin Wang, Ruitao Yi, Amir Zadeh, and LouisPhilippe Morency. What gives the answer away? question answering bias analysis on video qa datasets. In Human Multimodal Language Workshop, 2020.
Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, and Chuang Gan. Location-aware graph convolutional networks for video question answering. In AAAI, 2020.
Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran. Self-supervised learning by cross-modal audio-video clustering. In NeurIPS, 2020.
DavidChenandWilliamDolan. Collectinghighlyparalleldataforparaphraseevaluation. In ACL, 2011.
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
Michael Heilman and Noah A. Smith. Good question! statistical ranking for question generation. In HLT-NAACL, 2010.
JohnR.Smith,DhirajJoshi,BenoitHuet,WinstonHsu,andJozefCota. Harnessing a.i. for augmenting creativity: Application to movie trailer creation. In ACM MM, 2017.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representationsfromunlabeledvideo. InProceedingsofthe30thInternationalConference on Neural Information Processing Systems, NIPS’16, page 892–900. Curran Associates Inc., 2016.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding, 2019.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 91–99. MIT Press, 2015.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top