跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.89) 您好!臺灣時間:2025/01/26 02:03
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:黃晨
研究生(外文):Huang, Cheng
論文名稱:基於語意混和資料增強、影片模態對比損失和語意邊界偵測之多目標影片片段檢索
論文名稱(外文):Multi-Target Video Moment Retrieval by Semantic Fusion Augmentation With Intra-Video Contrastive Loss and Semantic Boundary Detection
指導教授:帥宏翰
指導教授(外文):SHUAI, HONG-HAN
口試委員:帥宏翰鄭文皇黃敬群
口試委員(外文):SHUAI, HONG-HANCheng, Wen-HuangHuang, Ching-Chun
口試日期:2023-08-29
學位類別:碩士
校院名稱:國立陽明交通大學
系所名稱:電機工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2023
畢業學年度:112
語文別:中文
論文頁數:43
中文關鍵詞:視覺語言任務多模態學習資訊檢索影片片段檢索跨模態檢索
外文關鍵詞:Vision-Language TaskMultimodal LearningInformation RetrievalVideo Moment RetrievalCross-Modal Retrieval
相關次數:
  • 被引用被引用:0
  • 點閱點閱:9
  • 評分評分:
  • 下載下載:1
  • 收藏至我的研究室書目清單書目收藏:0
給定一個未經剪裁的影片和一個自然語言查詢句,影片片段檢索(VMR) 旨在找出與查詢句描述相匹配的影片片段。然而現有的影片片段檢索方法假設輸入的查詢句跟影片目標片段之間是一對一的映射 (單目標影片片段檢索),從而忽略了影片中可能包含多個與查詢句匹配的影片片段的可能性 (多目標影片片段檢索)。在本文中,我們提出透過語意混合資料增強搭配影片模態對比損失和語意邊界偵來解決多目標影片片段檢索的問題。具體來說,我們使用特徵層混合資料增強來生成標籤,並使用影片模態對比損失確保語意的一致性。同時,我們也進行語意邊界偵測來動態地移除負樣本集中的所偽負樣本,以避免語意混淆。我們在 Charades-STA, ActivityNet Captions 和 QVHighlights 上進行了大量的實驗,以證明我們提出的方法在多目標影片片段檢索和單目標影片片段檢索指標上都達到了不錯的表現。
Given an untrimmed video and a natural language query, video moment retrieval (VMR) aims to retrieve video moments described by the query. However, most existing VMR methods assume a one-to-one mapping between the input query and the target video moment (single-target VMR), disregarding the possibility that a video may contain multiple target moments that match the query description (multi-target VMR). In this thesis, we propose to tackle multi-target VMR by Semantic Fusion Augmentation with Intra-Video Contrastive Loss and Semantic Boundary Detection (SFABD). Specifically, we use feature-level augmentation to generate augmented target moments, along with an intra-video contrastive loss to ensure feature consistency. Meanwhile, we perform semantic boundary detection to adaptively remove all false negatives from the negative set of contrastive loss to avoid semantic confusion. Extensive experiments conducted on Charades-STA, ActivityNet Captions, and QVHighlights show that our method achieves state-of-the-art performance on multi-target metrics and single-target metrics.
摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Traditional Single-Target Video Moment Retrieval . . . . . . . . 5
2.2 Multi-Target Video Moment Retrieval . . . . . . . . . . . . . . . 6
2.3 False Negative Detection . . . . . . . . . . . . . . . . . . . . . . 6
3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Semantic Fusion Augmentation (SFA) . . . . . . . . . . . . . . . 9
3.4 Intra-Video Contrastive Loss . . . . . . . . . . . . . . . . . . . . 10
3.5 Semantic Boundary Detection (SBD) . . . . . . . . . . . . . . . 12
3.6 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
[1] Z. Wang, L. Wang, T. Wu, T. Li, and G. Wu, “Negative sample matters: A
renaissance of metric learning for temporal grounding,” in Proceedings of
the AAAI Conference on Artificial Intelligence, 2022, pp. 2613–2623.
[2] L. Wang, Y. Qiao, X. Tang et al., “Action recognition and detection by
combining motion and appearance features,” THUMOS14 Action Recog-
nition Challenge, vol. 1, no. 2, p. 2, 2014.
[3] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei, “End-to-end learn-
ing of action detection from frame glimpses in videos,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2016, pp. 2678–2687.
[4] J. Yuan, B. Ni, X. Yang, and A. A. Kassim, “Temporal action localization
with pyramid of score distribution features,” in Proceedings of the IEEE/
CVF Conference on Computer Vision and Pattern Recognition, 2016, pp.
3093–3102.
[5] A. Karbalaie, F. Abtahi, and M. Sjöström, “Event detection in
surveillance videos: a review,” Multimedia Tools and Applications,
vol. 81, no. 24, pp. 35 463–35 501, 2022. [Online]. Available: https:
//doi.org/10.1007/s11042-021-11864-2
[6] S. Ghosh, A. Agarwal, Z. Parekh, and A. Hauptmann, “ExCL: Extractive
Clip Localization Using Natural Language Descriptions,” in Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers). Minneapolis, Minnesota: Association
for Computational Linguistics, Jun. 2019, pp. 1984–1990. [Online].
Available: https://aclanthology.org/N19-1198
[7] C. Rodriguez, E. Marrese-Taylor, F. S. Saleh, H. Li, and S. Gould,
“Proposal-free temporal moment localization of a natural-language query
in video using guided attention,” in Proceedings of the IEEE/CVF Winter
conference on Applications of Computer Vision, 2020, pp. 2464–2473.
[8] J. Mun, M. Cho, and B. Han, “Local-global video-text interactions for tem-
poral grounding,” in Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, 2020, pp. 10 810–10 819.
[9] H. Zhang, A. Sun, W. Jing, L. Zhen, J. T. Zhou, and R. S. M. Goh, “Natural
language video localization: A revisit in span-based question answering
framework,” IEEE transactions on pattern analysis and machine intelli-
gence, vol. 44, no. 8, pp. 4252–4266, 2021.
[10] H. Tang, J. Zhu, M. Liu, Z. Gao, and Z. Cheng, “Frame-wise cross-modal
matching for video moment retrieval,” IEEE Transactions on Multimedia,
vol. 24, pp. 1338–1349, 2021.
[11] K. Li, D. Guo, and M. Wang, “Proposal-free video grounding with contex-
tual pyramid network,” in Proceedings of the AAAI Conference on Artifi-
cial Intelligence, vol. 35, 2021, pp. 1902–1910.
[12] O. Mayu, N. Yuta, R. Esa, and H. Janne, “Uncovering hidden challenges
in query-based video moment retrieval,” in The British Machine Vision
Conference (BMVC), 2020.
[13] J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity localiza-
tion via language query,” in Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2017, pp. 5267–5275.
[14] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense-
captioning events in videos,” in Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision, 2017, pp. 706–715.
[15] H. Zhou, C. Zhang, Y. Luo, C. Hu, and W. Zhang, “Thinking inside uncer-
tainty: Interest moment perception for diverse temporal grounding,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10,
pp. 7190–7203, 2022.
[16] H. Zhou, C. Zhang, Y. Chen, and C. Hu, “Towards diverse temporal
grounding under single positive labels,” arXiv preprint arXiv:2303.06545,
2023.
[17] T. Huynh, S. Kornblith, M. R. Walter, M. Maire, and M. Khademi, “Boost-
ing contrastive self-supervised learning with false negative cancellation,”
in Proceedings of the IEEE/CVF Winter conference on Applications of
Computer Vision, 2022, pp. 2785–2795.
[18] T.-S. Chen, W.-C. Hung, H.-Y. Tseng, S.-Y. Chien, and M.-H. Yang,
“Incremental false negative detection for contrastive learning,” in
International Conference on Learning Representations, 2022. [Online].
Available: https://openreview.net/forum?id=dDjSKKA5TP1
[19] J. Lei, T. L. Berg, and M. Bansal, “Detecting moments and highlights in
videos via natural language queries,” Advances in Neural Information Pro-
cessing Systems, vol. 34, pp. 11 846–11 858, 2021.
[20] R. Zeng, H. Xu, W. Huang, P. Chen, M. Tan, and C. Gan, “Dense regres-
sion network for video grounding,” in Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, 2020, pp. 10 287–
10 296.
[21] M. Zhang, Y. Yang, X. Chen, Y. Ji, X. Xu, J. Li, and H. T. Shen, “Multi-
stage aggregated transformer network for temporal language localization in
videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 2021, pp. 12 669–12 678.
[22] H. Zhou, C. Zhang, Y. Luo, Y. Chen, and C. Hu, “Embracing uncertainty:
Decoupling and de-bias for robust temporal grounding,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2021, pp. 8445–8454.
[23] L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and
B. Russell, “Localizing moments in video with natural language,” in Pro-
ceedings of the IEEE/CVF International Conference on Computer Vision,
2017, pp. 5803–5812.
[24] J. Chen, X. Chen, L. Ma, Z. Jie, and T.-S. Chua, “Temporally grounding
natural sentence in video,” in Proceedings of the 2018 Conference on Em-
pirical Methods in Natural Language Processing, 2018, pp. 162–171.
[25] S. Zhang, H. Peng, J. Fu, and J. Luo, “Learning 2d temporal adjacent net-
works for moment localization with natural language,” in Proceedings of
the AAAI Conference on Artificial Intelligence, 2020, pp. 12 870–12 877.
[26] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz,
and Y. Bengio, “Manifold mixup: Better representations by interpolating
hidden states,” in International Conference on Machine Learning. PMLR,
2019, pp. 6438–6447.
[27] S. Yun, S. J. Oh, B. Heo, D. Han, and J. Kim, “Videomix: Rethinking data
augmentation for video classification,” arXiv preprint arXiv:2012.03457,
2020.
[28] H. Wu, C. Song, S. Yue, Z. Wang, J. Xiao, and Y. Liu, “Dynamic video
mix-up for cross-domain action recognition,” Neurocomputing, vol. 471,
pp. 358–368, 2022.
[29] A. Falcon, G. Serra, and O. Lanz, “A feature-space multimodal data aug-
mentation technique for text-video retrieval,” in Proceedings of the 30th
ACM International Conference on Multimedia, 2022, pp. 4385–4394.
[30] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta,
“Hollywood in homes: Crowdsourcing data collection for activity under-
standing,” in Computer Vision–ECCV 2016: 14th European Conference,
Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I
14. Springer, 2016, pp. 510–526.
[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2016, pp. 770–778.
[32] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled
version of BERT: smaller, faster, cheaper and lighter,” in 5th Workshop on
Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS
2019, 2019. [Online]. Available: http://arxiv.org/abs/1910.01108
[33] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac,
T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen,
C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame,
Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language
processing,” in Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System Demonstrations.
Online: Association for Computational Linguistics, Oct. 2020, pp. 38–45.
[Online]. Available: https://aclanthology.org/2020.emnlp-demos.6
[34] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in
International Conference on Learning Representations, 2019. [Online].
Available: https://openreview.net/forum?id=Bkg6RiCqY7
[35] Y. Liu, S. Li, Y. Wu, C.-W. Chen, Y. Shan, and X. Qie, “Umt: Unified
multi-modal transformers for joint video moment retrieval and highlight
detection,” in Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, 2022, pp. 3042–3051.
[36] W. Moon, S. Hyun, S. Park, D. Park, and J.-P. Heo, “Query-dependent
video representation for moment retrieval and highlight detection,” in Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2023, pp. 23 023–23 033.
[37] M. Seol, J. Kim, and J. Moon, “Bmrn: Boundary matching and refinement
network for temporal moment localization with natural language,” in Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2023, pp. 5570–5578.
[38] S. Zhang, H. Peng, J. Fu, Y. Lu, and J. Luo, “Multi-scale 2d temporal
adjacency networks for moment localization with natural language,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12,
pp. 9073–9087, 2021.
[39] Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu, “Semantic conditioned dy-
namic modulation for temporal sentence grounding in videos,” Advances
in Neural Information Processing Systems, vol. 32, 2019.
[40] D. Liu, X. Qu, J. Dong, P. Zhou, Y. Cheng, W. Wei, Z. Xu, and Y. Xie,
“Context-aware biaffine localizing network for temporal sentence ground-
ing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2021, pp. 11 235–11 244.
[41] J. Gao and C. Xu, “Fast video moment retrieval,” in Proceedings of the
IEEE/CVF International Conference on Computer Vision, 2021, pp. 1523–
1532.
[42] H. Xu, K. He, B. A. Plummer, L. Sigal, S. Sclaroff, and K. Saenko, “Mul-
tilevel language and vision integration for text-to-clip retrieval,” in Pro-
ceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019,
pp. 9062–9069.
[43] J. Shin and J. Moon, “Learning to combine the modalities of language and
video for temporal moment localization,” Computer Vision and Image Un-
derstanding, vol. 217, p. 103375, 2022.
[44] Z. Jia, M. Dong, J. Ru, L. Xue, S. Yang, and C. Li, “Stcm-net: A sym-
metrical one-stage network for temporal language localization in videos,”
Neurocomputing, vol. 471, pp. 194–207, 2022.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top