|
[1]Y. Ushiku, T. Harada, and Y. Kuniyoshi, “Efficient image annotation for au-tomatic sentence generation,” in Proc. 20th ACM international conference on Multimedia, 2012. [2]M. Mitchell et al., “Generating image descriptions from computer vision detections,” in Proc. 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012. [3]R. C. Luo, C. C. Chang, and C. C. Lai, “Multisensor Fusion and Integration: Theories, Applications, and its Perspectives,” IEEE Sensors Journal, Vol.11, No.12 , pp. 3122-3138, 2011. [4]A. Farhadi et al., “Every picture tells a story: Generating sentences from images,” in European conference on computer vision, pp. 15-29, 2010. [5]A. Gupta, Y. Verma, andC. V. Jawahar, “Choosing Linguistics over Vision to Describe Images,” in Twenty-Sixth AAAI Conference on Artificial Intelligence, pp. 1, Jul., 2012. [6]G. Kulkarni, et.al., “Babytalk: Understanding and generating simple image descriptions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2891-2903, 2013. [7]D. Elliott and F. Keller, “Image description using visual dependency representations,” in Proc. 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1292-1302. [8]B. Yoshua, C. Aaron, and V. Pascal, “Representation Learning: A Review and New Perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798-1828, Aug, 2013. [9]A. Frome et al., “DeViSE: A deep visual-semantic embedding model,” in Proc. Advances in Neural Information Processing Systems, pp. 2121–2129, 2013. [10]A. Karpathy, A. Joulin, and F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in Proc. Advances in Neural Information Processing Systems, pp. 1889-1897, 2014. [11]F. Yan and K. Mikolajczyk, “Deep correlation for matching images and text,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015, pp. 3441–3450. [12]O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neuralimagecaption generator,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164. [13]K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. 32nd International Conference on International Conference on Machine Learning, 2015, pp. 2048–2057. [14]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. International Conference on Learning Representations, 2015. [15]Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image caption with semantic attention,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4651–4659. [16]P. Anderson et al., “Bottom-Up and Top-Down Attention for Image Caption and Visual Question Answering,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6077-6086. [17]J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image caption,” in Proc. IEEE Computer Vision and Pattern Recognition, Jun. 2017, pp. 375–383. [18]Q. Wu, C. Shen, P. Wang, A. Dick, and A. V. D. Hengel, “Image Caption and Visual Question Answering Based on Attributes and External Knowledge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, Jun., pp. 1367–1381, 2018. [19]C. Szegedy et al., “Going deeper with convolutions,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9. [20]F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, and K. Keutzer, “Densenet: Implementing efficient convnet descriptor pyramids,” arXiv:1404.1869. [21]S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol.9, no. 8, pp. 1735–1780 , 1997. [22]Xinlei Chen et al., “Microsoft COCO Captions: Data Collection and Evaluation Server,” arXiv:1504.00325. [23]M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as aranking task: data, models and evaluation metrics,” Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013. [24]A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling Task Transfer Learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 3712-3722. [25]J. Lu, J. Yang, D. Batra, and D. Parikh, “Neural Baby Talk,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7219-7228. [26]M. M. A. Baig, M. I. Shah, M. A. Wajahat, N. Zafar, and O. Arif, “Image Caption Generator with Novel Object Injection,” in 2018 Digital Image Computing: Techniques and Applications (DICTA), Dec. 10-13, 2018, pp. 1-8. [27]C. Liu, J. Mao, F. Sha, and A. Yuille, “Attention Correctness in Neural Image Caption,” in Thirty-First AAAI Conference on Artificial Intelligence. 2017. [28]O. M. Nezami, M. Dras, P. Anderson, and L. Hamey, “Face-Cap: Image Caption Using Facial Expression Analysis,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2018. pp. 226-240. [29]K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition, ” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778. [30]A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks, ” in Advances in neural information processing systems (NIPS), 2012. [31]K. Simonyan and A. Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014). [32]J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788 [33]W. Liu et al., "SSD: Single Shot MultiBox Detector," arXiv:1512.02325 [34]D. Erhan, C. Szegedy, A. Toshev, D. Anguelov, “Scalable object detection using deep neural networks,” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014 , pp. 2147-2154. [35]R. Girshick, “Fast R-CNN,” in The IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440-1448. [36]G. B. Huang, M. Mattar, T. Berg, and E. L.-Miller, “Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments,” in Workshop on faces in''Real-Life''Images: detection, alignment, and recognition., 2008. [37]K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002, pp. 311–318. [38]M. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in Association for Computational Linguistics workshop, pp. 376–380, 2014. [39]R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in IEEE Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566–4575. [40]P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in European Conference on Computer Vision, 2016, pp. 382–398. [41]C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, “The Stanford CoreNLP natural language processing toolkit,” in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55–60.
|