( 您好!臺灣時間:2021/02/28 02:10
字體大小: 字級放大   字級縮小   預設字形  


研究生(外文):Chun-Yen Yeh
論文名稱(外文):SG2P: Image Paragraphing with Scene Graph
  • 被引用被引用:0
  • 點閱點閱:40
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
在近幾年的電腦視覺領域,愈來愈多人研究圖片段落生成(image paragraphing)然而,因為圖片與文字有著根本結構上的不同,很難找到適合的方式將圖片資訊對應成文字,所以由現有方法產生的圖片段落仍充斥著許多語意上的錯誤。在這篇論文,我們提出了一個兩階段生成圖片段落的方法SG2P,來解決這個問題。相較於以往直接從圖片轉換成文字,我們先將圖片轉變成另一種語意結構的表示方法──場景圖(scene graph),期望透過場景圖可以生成更加語意正確的段落。除此之外,我們還使用了分級的循環語言模型,搭配跳躍連結以減輕在長句文字產生時的梯度消失問題。
Automatically describing an image with a paragraph has gained popularity recently in the field of computer vision. However, the results of existing methods are full of semantic errors as the features extracted directly from raw image they use have difficulty bridging the visual semantic information to language. In this thesis, we propose SG2P which is a two-staged network to address this issue. Instead of from raw image, the proposed method leverages features encoded from the scene graph, an intermediate semantic structure of an image, aiming to generate stronger semantically correct paragraphs. With the explicit semantic representation, we hypothesize that features from scene graph retains more semantic information than directly from raw image. In addition, we use hierarchical recurrent language model with skip connection in SG2P to reduce the effect of gradient vanishment during long generation process.
To evaluate the results, we propose a new evaluation metric called c-SPICE, which can automatically compute the semantic correctness of generated paragraphs by a graph-based comparison. Experiment shows that methods utilizing features from scene graph outperform those directly from raw image in c-SPICE.
Acknowledgments i
Abstract iii
List of Figures viii
List of Tables ix
Chapter 1 Introduction 1
1.1 BackgroundandMotivation 1
1.2 ResearchObjective 2
1.3 ThesisOrganization 3
Chapter 2 Related Work 4
2.1 ImageCaptioning 4
2.2 ImageParagraphing 5
2.3 SceneGraph 6
Chapter 3 Problem Definition 8
3.1 SymbolTable 9
Chapter 4 Methodology 11
4.1 SceneGraph 12
4.2 SceneGraphConstruction 13
4.2.1 Generation 14
4.2.2 Generation 14
4.3 GraphConvolutionNetwork 15
4.4 ParagraphGenerator 18
4.4.1 SentenceRNN with Semantic Node Attention 19
4.4.2 WordRNNwithSharedSemanticContext 20
4.5 NetworkArchitecture 21
4.6 ImplementationDetail 21
Chapter 5 Experiment 25
5.1 ExperimentalSetup 25
5.1.1 DataSets 25
5.1.2 EvaluationMetrics 26
5.2 c-SPICE 27
5.3 Preprocessing 32
5.4 FullyConvolutionalLocalizationNetworks 32
5.5 ExperimentResults 33
5.5.1 Theeffectivenessofscenegraph 33
5.5.2 AblationStudy 34
5.5.3 MergingMulti-modalFeatures 35
5.5.4 QualitativeStudy 36
5.5.5 The Effect Image Scene Graph Has on the Results 36
Chapter 6 Conclusion 41
6.1 SummaryofContributions 41
6.2 FutureWork 42
Bibliography 43
[1] P. Anderson, B. Fernando, M. Johnson, and S. Gould. SPICE: semantic propositionalimage caption evaluation. InECCV, 2016.
[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang.Bottom-up and top-down attention for image captioning and visual question answer-ing. InCVPR, 2018.
[3] Y. Bengio and Y. LeCun, editors.3rd International Conference on Learning Rep-resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference TrackProceedings, 2015.
[4] M. Chatterjee and A. G. Schwing. Diverse and coherent paragraph generation fromimages. InECCV, 2018.
[5] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual representation for imagecaption generation. InCVPR, 2015.
[6] K. Clark and C. D. Manning. Deep reinforcement learning for mention-ranking coref-erence models. InEMNLP, 2016.
[7] Y. Cui, G. Yang, A. Veit, X. Huang, and S. J. Belongie. Learning to evaluate imagecaptioning. InCVPR, 2018.
[8] B. Dai, S. Fidler, R. Urtasun, and D. Lin. Towards diverse and natural imagedescriptions via a conditional GAN. InICCV, 2017.
[9] A. Farhadi, S. M. M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hocken-maier, and D. A. Forsyth. Every picture tells a story: Generating sentences fromimages. InECCV, 2010.
[10] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a rankingtask: Data, models and evaluation metrics (extended abstract). InProceedings ofthe Twenty-Fourth International Joint Conference on Artificial Intel ligence, IJCAI2015, Buenos Aires, Argentina, July 25-31, 2015, 2015.
[11] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a rankingtask: Data, models and evaluation metrics (extended abstract). InIJCAI, 2015.
[12] J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. InCVPR,2018.
[13] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localizationnetworks for dense captioning. InCVPR, 2016.
[14] J. Johnson, R. Krishna, M. Stark, L. Li, D. A. Shamma, M. S. Bernstein, and F. Li.Image retrieval using scene graphs. InCVPR, 2015.
[15] A. Karpathy, A. Joulin, and F. Li. Deep fragment embeddings for bidirectional imagesentence mapping. InNIPS, 2014.
[16] A. Karpathy and F. Li. Deep visual-semantic alignments for generating image de-scriptions. InCVPR, 2015.
[17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InICLR,2015.
[18] J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei. A hierarchical approach for gen-erating descriptive image paragraphs. InCVPR, 2017.
[19] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalan-tidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei. Visual genome: Connect-ing language and vision using crowdsourced dense image annotations.InternationalJournal of Computer Vision, 123(1):32–73, 2017.
[20] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing. Recurrent topic-transition GANfor visual paragraph generation. InICCV, 2017.
[21] X. Liang, L. Lee, and E. P. Xing. Deep variation-structured reinforcement learningfor visual relationship and attribute detection. InCVPR, 2017.
[22] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ́ar, andC. L. Zitnick. Microsoft COCO: common objects in context. InECCV, 2014.
[23] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attentionvia a visual sentinel for image captioning. InCVPR, 2017.
[24] J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. InCVPR, 2018.
[25] D. Marr. Vision: A computational investigation into the human representation andprocessing of visual information. 1982.
[26] D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visualquestion answering. InCVPR, 2017.
[27] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural imagecaption generator. InCVPR, 2015.
[28] Y. Wang, C. Liu, X. Zeng, and A. L. Yuille. Scene graph parsing as dependencyparsing. InNAACL, 2018.
[29] S. Woo, D. Kim, D. Cho, and I. S. Kweon. Linknet: Relational embedding for scenegraph. InNIPS, 2018.
[30] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterativemessage passing. InCVPR, 2017.
[31] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visualattention. InICML, 2015.
[32] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph R-CNN for scene graphgeneration. InECCV, 2018.
[33] X. Yang, K. Tang, H. Zhang, and J. Cai. Auto-encoding scene graphs for imagecaptioning.CoRR, abs/1812.02378, 2018.
[34] Z. Yang, Y. Yuan, Y. Wu, W. W. Cohen, and R. Salakhutdinov. Review networksfor caption generation. InNIPS, 2016.
[35] T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning.InECCV, 2018.
[36] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semanticattention. InCVPR, 2016.
[37] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsingwith global context. InCVPR, 2018.
[38] H. Zhang, Z. Kyaw, S. Chang, and T. Chua. Cvpr. 2017.
[39] Z. Zhu, Z. Xue, and Z. Yuan. Topic-guided attention for image captioning. InICIP,2018.
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔