(3.238.186.43) 您好!臺灣時間:2021/03/01 09:28
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:商資穎
研究生(外文):SHANG, ZI-YING
論文名稱:結合靜態及動態特徵的注意力網路進行關鍵影像擷取之研究
論文名稱(外文):Extraction of Video Key-Frames Based on Static and Motion Features Using Attention-Based Networks
指導教授:王正豪王正豪引用關係
指導教授(外文):WANG, JENQ-HAUR
口試委員:王正豪楊凱翔張嘉惠
口試委員(外文):WANG, JENQ-HAURYANG, KAI-HSIANGCHANG, CHIA-HUI
口試日期:2020-07-09
學位類別:碩士
校院名稱:國立臺北科技大學
系所名稱:資訊工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2020
畢業學年度:108
語文別:中文
論文頁數:30
中文關鍵詞:影片摘要注意力機制相對位置特徵
外文關鍵詞:Video SummarizationAttention MechanismRelative Position Representations
相關次數:
  • 被引用被引用:0
  • 點閱點閱:56
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
隨著現今大量的影片被上傳到網路上,能夠有效率的管理資料(例如:影片檢索),顯得更為重要。為了幫助使用者更快速的了解影片的內容並且增進影片搜尋的效果,我們可以透過關鍵影像擷取完成影片摘要任務。
過去的方法僅透過2D CNN取得影片中每張影像的靜態特徵以進行建模,忽略了影片中影像之間的相依性。本論文以3D CNN取得影片中數張影像的動態特徵,結合上述的靜態特徵作為輸入,並使用基於注意力機制的類神經網路完成影片摘要的任務。
其次,本論文使用關係感知注意力機制(Relation-Aware Attention)的概念,有別於一般的注意力機制無法編碼位置資訊,關係感知注意力機制在尋找靜態以及動態特徵之間的相互關係時,可以同時考量到影片中每個鏡頭之間相對位置的資訊並賦予其不同的權重,找出靜態與動態特徵各自重要的特徵。最後將兩者合併後,透過迴歸神經網路計算影片中每個鏡頭的分數,並以該結果求解0-1背包問題(0-1 knapsack problem),以產生影片摘要。
本論文使用SumMe以及TVSum影片摘要資料集作為實驗資料,並以F-measure評估效果。實驗結果顯示,結合動態特徵以及相對位置的資訊可以提升影片摘要任務的表現。在TVSum資料集比起僅使用靜態特徵以及一般的注意力機制方法提升了0.7%的準確度達到62.08%,驗證了本論文所提方法的效果。
Given the increasing rate of video data being generated in recent years, effective video processing methods have received significant attention. Summarizing videos by forming sequences of key-frames, further video processing algorithms can be implemented saving time and cost.
The proposed framework learn attention from different deep CNN features (static and motion) interactively, and generate the representations for multiple features respectively.
Furthermore, the concept of relation-aware attention is applied to the proposed framework, which encodes the relative position information into the vanilla attention, takes the relative relationship between any two shots in the input video into consideration, and assigns different attention weights to it. Two types of features is concatenated for the regression networks to generate the importance score for each shot, and the predicted video summary is generated by solving the 0-1 knapsack problem.
The proposed framework is evaluated on publicly available datasets, SumMe and TVSum, using F-measure as evaluation metrics. The results show that the extension of both motion features and relative position information improves the performance for video summarization task. Comparing to the overall result only using static features and vanilla attention, the result of the proposed framework improves the F1 score by 0.7% and performs a comparable result against state-of-the-art methods.
摘 要 i
ABSTRACT iii
誌 謝 v
目 錄 vi
表目錄 viii
圖目錄 ix
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 1
1.3 研究貢獻 1
1.4 章節概要 2
第二章 相關研究 4
2.1 影片摘要 4
2.2 Seq2Seq 6
2.3 注意力機制 7
第三章 研究方法 8
3.1 特徵擷取 8
3.2 模型架構 10
3.2.1 關係感知注意力機制 11
3.2.2 前饋神經網路層(Feed-Forward Networks) 13
3.2.3 Residual Dropout以及Layer Normalization 14
3.2.4 串接網路層以及迴歸網路層 14
3.3 關鍵鏡頭選擇(Key-Shots Selection) 15
第四章 實驗與討論 16
4.1 實驗環境以及超參數 16
4.1.1 CNN特徵 17
4.2 實驗評估指標與公式 17
4.3 實驗結果 19
4.3.1 模型整體結果比較 19
4.3.2 位置編碼(Positional Encoding)實驗 20
4.3.3 編碼器層數實驗 21
4.3.4 動態特徵實驗 22
4.4 Qualitative Result 23
4.4.1 視覺化 23
4.4.2 關係感知注意力權重之視覺化 24
第五章 結論 25
參考文獻 27
[1] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9, 1735-1780.
[2] Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. CoRR, abs/1409.0473.
[3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. ArXiv, abs/1706.03762.
[4] Sukhbaatar, S., Szlam, A., Weston, J., & Fergus, R. (2015). End-To-End Memory Networks. NIPS.
[5] Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-Attention with Relative Position Representations. NAACL-HLT.
[6] Potapov, D., Douze, M., Harchaoui, Z., & Schmid, C. (2014). Category-Specific Video Summarization. ECCV.
[7] Gygli, M., Grabner, H., Riemenschneider, H., & Gool, L.V. (2014). Creating Summaries from User Videos. ECCV.
[8] Song, Y., Vallmitjana, J., Stent, A., & Jaimes, A. (2015). TVSum: Summarizing web videos using titles. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5179-5187.
[9] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2014). Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1-9.
[10] Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556.
[11] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778.
[12] Hara, K., Kataoka, H., & Satoh, Y. (2017). Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), 3154-3160.
[13] Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. CVPR 2009.
[14] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, A., Suleyman, M., & Zisserman, A. (2017). The Kinetics Human Action Video Dataset. ArXiv, abs/1705.06950.
[15] Zhang, K., Chao, W., Sha, F., & Grauman, K. (2016). Video Summarization with Long Short-Term Memory. ArXiv, abs/1605.08110.
[16] Rochan, M., Ye, L., & Wang, Y. (2018). Video Summarization Using Fully Convolutional Sequence Networks. ECCV.
[17] Zhao, B., Li, X., & Lu, X. (2017). Hierarchical Recurrent Neural Network for Video Summarization. Proceedings of the 25th ACM international conference on Multimedia.
[18] Zhao, B., Li, X., & Lu, X. (2018). HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7405-7414.
[19] Zhang, K., Grauman, K., & Sha, F. (2018). Retrospective Encoders for Video Summarization. ECCV.
[20] Ji, Z., Xiong, K., Pang, Y., & Li, X. (2020). Video Summarization With Attention-Based Encoder–Decoder Networks. IEEE Transactions on Circuits and Systems for Video Technology, 30, 1709-1717.
[21] Fajtl, J., Sokeh, H.S., Argyriou, V., Monekosso, D., & Remagnino, P. (2018). Summarizing Videos with Attention. ACCV Workshops.
[22] Agyeman, R., Muhammad, R., & Choi, G.S. (2019). Soccer Video Summarization Using Deep Learning. 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), 270-273.
[23] Zhou, K., & Qiao, Y. (2018). Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward. ArXiv, abs/1801.00054.
[24] Mahasseni, B., Lam, M., & Todorovic, S. (2017). Unsupervised Video Summarization with Adversarial LSTM Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2982-2991.
[25] Fu, T., Tai, S., & Chen, H. (2019). Attentive and Adversarial Learning for Video Summarization. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 1579-1587.
[26] Liu, Y., Li, Y., Yang, F., Chen, S., & Wang, Y.F. (2019). Learning Hierarchical Self-Attention for Video Summarization. 2019 IEEE International Conference on Image Processing (ICIP), 3377-3381.
[27] Feng, L., Li, Z., Kuang, Z., & Zhang, W. (2018). Extractive Video Summarizer with Memory Augmented Neural Networks. Proceedings of the 26th ACM international conference on Multimedia.
[28] Sahrawat, D., Agarwal, M., Sinha, S., Adhikary, A., Agarwal, M., Shah, R.R., & Zimmermann, R. (2019). Video Summarization using Global Attention with Memory Network and LSTM. 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM), 231-236.
[29] Ba, J., Kiros, J.R., & Hinton, G.E. (2016). Layer Normalization. ArXiv, abs/1607.06450.
[30] Kingma, D.P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. CoRR, abs/1412.6980.
[31] Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS.
[32] Aubry, M., & Russell, B.C. (2015). Understanding Deep Features with Computer-Generated Imagery. 2015 IEEE International Conference on Computer Vision (ICCV), 2875-2883.
[33] Otani, M., Nakashima, Y., Rahtu, E., & Heikkilä, J. (2019). Rethinking the Evaluation of Video Summaries. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7588-7596.
電子全文 電子全文(網際網路公開日期:20210825)
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文
 
無相關期刊
 
無相關點閱論文
 
系統版面圖檔 系統版面圖檔