|
[1] Z. Wang, L. Wang, T. Wu, T. Li, and G. Wu, “Negative sample matters: A renaissance of metric learning for temporal grounding,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 2613–2623. [2] L. Wang, Y. Qiao, X. Tang et al., “Action recognition and detection by combining motion and appearance features,” THUMOS14 Action Recog- nition Challenge, vol. 1, no. 2, p. 2, 2014. [3] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei, “End-to-end learn- ing of action detection from frame glimpses in videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 2678–2687. [4] J. Yuan, B. Ni, X. Yang, and A. A. Kassim, “Temporal action localization with pyramid of score distribution features,” in Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 3093–3102. [5] A. Karbalaie, F. Abtahi, and M. Sjöström, “Event detection in surveillance videos: a review,” Multimedia Tools and Applications, vol. 81, no. 24, pp. 35 463–35 501, 2022. [Online]. Available: https: //doi.org/10.1007/s11042-021-11864-2 [6] S. Ghosh, A. Agarwal, Z. Parekh, and A. Hauptmann, “ExCL: Extractive Clip Localization Using Natural Language Descriptions,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 1984–1990. [Online]. Available: https://aclanthology.org/N19-1198 [7] C. Rodriguez, E. Marrese-Taylor, F. S. Saleh, H. Li, and S. Gould, “Proposal-free temporal moment localization of a natural-language query in video using guided attention,” in Proceedings of the IEEE/CVF Winter conference on Applications of Computer Vision, 2020, pp. 2464–2473. [8] J. Mun, M. Cho, and B. Han, “Local-global video-text interactions for tem- poral grounding,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2020, pp. 10 810–10 819. [9] H. Zhang, A. Sun, W. Jing, L. Zhen, J. T. Zhou, and R. S. M. Goh, “Natural language video localization: A revisit in span-based question answering framework,” IEEE transactions on pattern analysis and machine intelli- gence, vol. 44, no. 8, pp. 4252–4266, 2021. [10] H. Tang, J. Zhu, M. Liu, Z. Gao, and Z. Cheng, “Frame-wise cross-modal matching for video moment retrieval,” IEEE Transactions on Multimedia, vol. 24, pp. 1338–1349, 2021. [11] K. Li, D. Guo, and M. Wang, “Proposal-free video grounding with contex- tual pyramid network,” in Proceedings of the AAAI Conference on Artifi- cial Intelligence, vol. 35, 2021, pp. 1902–1910. [12] O. Mayu, N. Yuta, R. Esa, and H. Janne, “Uncovering hidden challenges in query-based video moment retrieval,” in The British Machine Vision Conference (BMVC), 2020. [13] J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity localiza- tion via language query,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 5267–5275. [14] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense- captioning events in videos,” in Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2017, pp. 706–715. [15] H. Zhou, C. Zhang, Y. Luo, C. Hu, and W. Zhang, “Thinking inside uncer- tainty: Interest moment perception for diverse temporal grounding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 7190–7203, 2022. [16] H. Zhou, C. Zhang, Y. Chen, and C. Hu, “Towards diverse temporal grounding under single positive labels,” arXiv preprint arXiv:2303.06545, 2023. [17] T. Huynh, S. Kornblith, M. R. Walter, M. Maire, and M. Khademi, “Boost- ing contrastive self-supervised learning with false negative cancellation,” in Proceedings of the IEEE/CVF Winter conference on Applications of Computer Vision, 2022, pp. 2785–2795. [18] T.-S. Chen, W.-C. Hung, H.-Y. Tseng, S.-Y. Chien, and M.-H. Yang, “Incremental false negative detection for contrastive learning,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=dDjSKKA5TP1 [19] J. Lei, T. L. Berg, and M. Bansal, “Detecting moments and highlights in videos via natural language queries,” Advances in Neural Information Pro- cessing Systems, vol. 34, pp. 11 846–11 858, 2021. [20] R. Zeng, H. Xu, W. Huang, P. Chen, M. Tan, and C. Gan, “Dense regres- sion network for video grounding,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2020, pp. 10 287– 10 296. [21] M. Zhang, Y. Yang, X. Chen, Y. Ji, X. Xu, J. Li, and H. T. Shen, “Multi- stage aggregated transformer network for temporal language localization in videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 669–12 678. [22] H. Zhou, C. Zhang, Y. Luo, Y. Chen, and C. Hu, “Embracing uncertainty: Decoupling and de-bias for robust temporal grounding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8445–8454. [23] L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, “Localizing moments in video with natural language,” in Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 5803–5812. [24] J. Chen, X. Chen, L. Ma, Z. Jie, and T.-S. Chua, “Temporally grounding natural sentence in video,” in Proceedings of the 2018 Conference on Em- pirical Methods in Natural Language Processing, 2018, pp. 162–171. [25] S. Zhang, H. Peng, J. Fu, and J. Luo, “Learning 2d temporal adjacent net- works for moment localization with natural language,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 12 870–12 877. [26] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio, “Manifold mixup: Better representations by interpolating hidden states,” in International Conference on Machine Learning. PMLR, 2019, pp. 6438–6447. [27] S. Yun, S. J. Oh, B. Heo, D. Han, and J. Kim, “Videomix: Rethinking data augmentation for video classification,” arXiv preprint arXiv:2012.03457, 2020. [28] H. Wu, C. Song, S. Yue, Z. Wang, J. Xiao, and Y. Liu, “Dynamic video mix-up for cross-domain action recognition,” Neurocomputing, vol. 471, pp. 358–368, 2022. [29] A. Falcon, G. Serra, and O. Lanz, “A feature-space multimodal data aug- mentation technique for text-video retrieval,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4385–4394. [30] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity under- standing,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 2016, pp. 510–526. [31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [32] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” in 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019, 2019. [Online]. Available: http://arxiv.org/abs/1910.01108 [33] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://aclanthology.org/2020.emnlp-demos.6 [34] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7 [35] Y. Liu, S. Li, Y. Wu, C.-W. Chen, Y. Shan, and X. Qie, “Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection,” in Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2022, pp. 3042–3051. [36] W. Moon, S. Hyun, S. Park, D. Park, and J.-P. Heo, “Query-dependent video representation for moment retrieval and highlight detection,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 023–23 033. [37] M. Seol, J. Kim, and J. Moon, “Bmrn: Boundary matching and refinement network for temporal moment localization with natural language,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5570–5578. [38] S. Zhang, H. Peng, J. Fu, Y. Lu, and J. Luo, “Multi-scale 2d temporal adjacency networks for moment localization with natural language,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 9073–9087, 2021. [39] Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu, “Semantic conditioned dy- namic modulation for temporal sentence grounding in videos,” Advances in Neural Information Processing Systems, vol. 32, 2019. [40] D. Liu, X. Qu, J. Dong, P. Zhou, Y. Cheng, W. Wei, Z. Xu, and Y. Xie, “Context-aware biaffine localizing network for temporal sentence ground- ing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 235–11 244. [41] J. Gao and C. Xu, “Fast video moment retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1523– 1532. [42] H. Xu, K. He, B. A. Plummer, L. Sigal, S. Sclaroff, and K. Saenko, “Mul- tilevel language and vision integration for text-to-clip retrieval,” in Pro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 9062–9069. [43] J. Shin and J. Moon, “Learning to combine the modalities of language and video for temporal moment localization,” Computer Vision and Image Un- derstanding, vol. 217, p. 103375, 2022. [44] Z. Jia, M. Dong, J. Ru, L. Xue, S. Yang, and C. Li, “Stcm-net: A sym- metrical one-stage network for temporal language localization in videos,” Neurocomputing, vol. 471, pp. 194–207, 2022.
|