跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.169) 您好!臺灣時間:2025/10/30 02:36
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:劉鎔瑄
研究生(外文):LIU, RONG-XUAN
論文名稱:不使用最大似然估計之對抗式圖片描述
論文名稱(外文):Adversarial Image Description without Maximum Likelihood Estimation
指導教授:朱元三
指導教授(外文):CHU, YUAN-SUN
口試委員:劉宗憲黃敬群陳昱仁
口試委員(外文):LIU, TSUNG-HSIENHUANG, CHING-CHUNCHEN, YU-JEN
口試日期:2017-12-25
學位類別:碩士
校院名稱:國立中正大學
系所名稱:電機工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2018
畢業學年度:106
語文別:英文
論文頁數:32
中文關鍵詞:圖片描述圖片註解生成對抗網路文本生成推土機距離瓦瑟斯坦甘貝爾
外文關鍵詞:image descriptionimage captioninggenerative adversarial networktext generationearth mover’s distanceWassersteinGumbel
相關次數:
  • 被引用被引用:0
  • 點閱點閱:892
  • 評分評分:
  • 下載下載:42
  • 收藏至我的研究室書目清單書目收藏:0
透過最大似然估計訓練全可見信念網絡(FVBN)為學習語言模型的常見方式。然而,為了預測下一個字詞,在訓練階段時模型可以得到過去的訊息作為提示,可是在進行推斷的當下,並沒有辦法取得這些提示。由於在訓練與推斷兩個階段的操作方式不同,這樣的方法造成了暴露偏差(exposure bias),使得系統在預測較長的句子時的表現容易變差。相較之下,我們基於生成對抗網路(GAN)的框架設計了用於圖片描述的模型,由隨機初始化的語言模型開始進行訓練,過程中不使用最大似然估計,並且排除了暴露偏差的問題。其中生成器的目標是最小化推土機距離(EMD)使其本身的分布盡可能接近訓練資料的分布。為了克服無法微分的採樣動作,我們使用甘貝爾最大化技巧(Gumbel-max trick)來近似獨熱(one-hot)的詞向量。如此的設計能透過一般的反向傳播法配合梯度下降來對優化參數。實驗結果顯示我們的方法有助於提升圖片描述在數個評估機制上的表現分數。
A fully visible belief network trained with maximum-likelihood is a typical strategy to learn a language model. However such an approach yields the exposure bias due to different behaviors at training and inference stage: To predict the next symbol, the model has provided with preceding information that is available at training stage but not at inference stage, when it could result in worse predictions along with accumulated errors and increased sentence length. On the contrary, we train another neural paradigm for the image description via an adversarial fashion from scratch. We do not adopt any maximum-likelihood manner and address exposure bias. The generative model takes the learning objective of minimizing the earth mover’s distance to make the generator’s distribution indistinguishable from the empirical distribution. We also employ Gumbel-max trick as a continuous approximation of the one-hot word encoding, conquering the “non-differentiable sampling problem”. In this case training both the discriminator and generator requires only generic end-to-end back-propagation and gradient-based optimization methods. Experimental results show that our adversarial approach improves the performance on several evaluation metrics of the image captioning task.
Acknowledgements ..................................................................................................... i
Abstract (Chinese) ...................................................................................................... ii
Abstract ...................................................................................................................... iii
List of Abbreviations ................................................................................................... vi
List of Figures ............................................................................................................. vii
Chapter 1. Introduction ............................................................................................... 1
Chapter 2. Background .............................................................................................. 4
2.1. Maximum likelihood estimation ........................................................................... 4
2.2. Fully visible belief networks ................................................................................. 5
2.3. Generative adversarial networks ........................................................................ 7
2.4. Farkas’ lemma and Kantrovich-Robinstein duaility ............................................. 8
Chapter 3. Related Works ......................................................................................... 10
3.1. Image captioning ................................................................................................ 10
3.2. Generative adversarial nets for natural language processing ............................ 11
Chapter 4. Problem Formulation ............................................................................... 14
4.1. Vanishing gradients and earth mover’s distance ................................................ 14
4.2. Sentences decoded from latent space and Gumbel-max trick ............................ 17
Chapter 5. Architectures ............................................................................................ 19
5.1. Maximum-likelihood baseline .............................................................................. 20
5.2. Proposed adversarial image description ............................................................. 21
5.2. REINFORCE baseline ......................................................................................... 25
Chapter 6. Experiments ............................................................................................. 27
6.1. Settings ................................................................................................................ 27
6.2. Results ................................................................................................................. 28
Conclusion and Future Work ...................................................................................... 30
References ................................................................................................................. 31
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. NIPS, 2014.
[2] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. arXiv:1411.4555, 2014.
[3] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
[4] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP, 2014.
[5] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. NIPS, 2015.
[6] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. arXiv:1412.2306, 2014.
[7] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural Image Caption Generation with Visual Attention. ICML, 2015.
[8] T. Tieleman and G. Hinton. Lecture 6.5—Rmsprop: Divide the gradient by a running average of its recent magnitude. Coursera: Neural networks for machine learning, 2012.
[9] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
[10] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256, 1992.
[11] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
[12] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR, 2016.
[13] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv:1703.10593, 2017.
[14] J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky. Adversarial learning for neural dialogue generation. arXiv:1701.06547, 2017.
[15] L. Yu, W. Zhang, J. Wang, Y. Yu. SeqGAN: Sequence generative adversarial nets with policy gradient. AAAI, 2017.
[16] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. ICML, 2017.
[17] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. arXiv:1704.00028, 2017.
[18] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. arXiv:1605.05396, 2016.
[19] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. ICLR, 2017.
[20] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with Gumbel-Softmax. ICLR, 2016.
[21] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: a continuous relaxation of discrete random variables. ICLR, 2016.
[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. arXiv:1405.0312, 2014.
[23] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning, 160-164, 2016.
[24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. arXiv:1409.0575, 2014.
[25] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.
[26] M. Ranzato, S. Chopra, M. Auli, W. Zaremba. Sequence level training with recurrent neural networks. ICLR, 2016.
[27] https://github.com/tylin/coco-caption
[28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CVPR, 2015.
[29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.
[30] C. Villani. Optimal Transport: Old and new. Springer, 2009.
[31] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Boosting image captioning with attributes. arXiv:1611.01646, 2016.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top