|
[1] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. [2] T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents with adaptive loss and large-scale incident db,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3521–3529, 2018. [3] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents in dashcam videos,” in Proceedings of the Asian Conference on Computer Vision, pp. 136–153, 2016. [4] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in Proceedings of the European Conference on Computer Vision, pp. 803–818, 2018. [5] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997. [6] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649, 2013. [7] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regu- larization,” arXiv preprint arXiv:1409.2329, 2014. [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014. [9] A. P. Shah, J.-B. Lamare, T. Nguyen-Anh, and A. Hauptmann, “Cadp: A novel dataset for cctv traffic camera based accident analysis,” in Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 1–9, 2018. [10] K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. Carlos Niebles, and M. Sun, “Agent- centric risk assessment: Accident anticipation and risky region localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2222–2230, 2017. [11] Y. Takimoto, Y. Tanaka, T. Kurashima, S. Yamamoto, M. Okawa, and H. Toda, “Predicting traffic accidents with event recorder data,” in Pro- ceedings of the International Workshop on Prediction of Human Mobility, pp. 11–14, 2019. [12] Y. Yao, M. Xu, Y. Wang, D. J. Crandall, and E. M. Atkins, “Unsu- pervised traffic accident detection in first-person videos,” arXiv preprint arXiv:1903.00618, 2019. [13] K. C. NG, Y. MURATA, and M. ATSUMI, “Traffic risk estimation from on- vehicle video by region-based spatio-temporal dnn trained using comparative loss,” in 人工知能全大論文集 一般社法人 人工知能, pp. 3Rin201–3Rin201, 2019. [14] J.-C. Chen, Z.-Y. Lian, C.-L. Huang, and C.-H. Chuang, “Automatic recog- nition of driving events based on dashcam videos,” in Proceedings of the In- ternational Conference on Image and Graphics Processing, pp. 22–25, 2020. [15] L. Taccari, F. Sambo, L. Bravi, S. Salti, L. Sarti, M. Simoncini, and A. Lori, “Classification of crash and near-crash events from dashcam videos and telematics,” in Proceedings of the International Conference on Intelligent Transportation Systems, pp. 2460–2465, 2018. [16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in Proceedings of the European Conference on Computer Vision, pp. 21–37, 2016. [17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988, 2017. [18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788, 2016. [19] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271, 2017. [20] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018. [21] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 9627–9636, 2019. [22] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Confer- ence on Computer Vision, pp. 1440–1448, 2015. [23] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time ob- ject detection with region proposal networks,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 91–99, 2015. [24] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region- based fully convolutional networks,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 379–387, 2016. [25] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, “Light-head r-cnn: In defense of two-stage object detector,” arXiv preprint arXiv:1711.07264, 2017. [26] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969, 2017. [27] J. Butepage, M. J. Black, D. Kragic, and H. Kjellstrom, “Deep represen- tation learning for human motion prediction and classification,” in Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6158–6166, 2017. [28] S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Deep learning for detecting multiple space-time action tubes in videos,” arXiv preprint arXiv:1608.01529, 2016. [29] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recog- nition,” in Proceedings of the European Conference on Computer Vision, pp. 20–36, 2016. [30] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 6299–6308, 2017. [31] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolu- tional encoder-decoder architecture for image segmentation,” IEEE transac- tions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481– 2495, 2017. [32] P. Moeskops, J. M. Wolterink, B. H. van der Velden, K. G. Gilhuijs, T. Leiner, M. A. Viergever, and I. Išgum, “Deep learning for multi-task medical image segmentation in multiple modalities,” in Proceedings of the In- ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 478–486, 2016. [33] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez, “A review on deep learning techniques applied to se- mantic segmentation,” arXiv preprint arXiv:1704.06857, 2017. [34] M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, “The importance of skip connections in biomedical image segmentation,” in Pro- ceedings of the Deep Learning and Data Labeling for Medical Applications, pp. 179–187, 2016. [35] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proceedings of the Advances in Neural In- formation Processing Systems, pp. 568–576, 2014. [36] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941, 2016. [37] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spa- tiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497, 2015. [38] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in Proceedings of the IEEE International Con- ference on Computer Vision, pp. 5533–5541, 2017. [39] D. Purwanto, R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Three- stream network with bidirectional self-attention for action recognition in ex- treme low resolution videos,” IEEE Signal Processing Letters, vol. 26, no. 8, pp. 1187–1191, 2019. [40] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, “Mars: Motion- augmented rgb stream for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7882–7891, 2019. [41] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 3156–3164, 2015. [42] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164, 2017. [43] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scale-aware semantic image segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [44] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. [45] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image inpainting with contextual attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. [46] W. Du, Y. Wang, and Y. Qiao, “Rpan: An end-to-end recurrent pose- attention network for action recognition in videos,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3725–3734, 2017. [47] R. Girdhar and D. Ramanan, “Attentional pooling for action recognition,” in Proceedings of the Neural Information Processing Systems, pp. 34–45, 2017. [48] W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention net- work for action recognition in videos,” IEEE Transactions on Image Pro- cessing, vol. 27, no. 3, pp. 1347–1360, 2017. [49] Y. Wang, S. Wang, J. Tang, N. O’Hare, Y. Chang, and B. Li, “Hierar- chical attention network for action recognition in videos,” arXiv preprint arXiv:1607.06416, 2016. [50] Y. Rao, J. Lu, and J. Zhou, “Attention-aware deep reinforcement learning for video face recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2017. [51] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” arXiv preprint arXiv:1511.04119, 2015. [52] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pp. 248–255, 2009. [53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008, 2017. [54] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [55] S. Gao, A. Ramanathan, and G. Tourassi, “Hierarchical convolutional at- tention networks for text classification,” in Proceedings of the Workshop on Representation Learning for Natural Language Processing, pp. 11–23, 2018. [56] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
|