跳到主要內容

臺灣博碩士論文加值系統

(44.210.83.132) 您好!臺灣時間:2024/05/25 04:47
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:王濛
研究生(外文):Wang, Meng
論文名稱:基於局部模板匹配和自注意機制的像素卷積神經網絡的深度影像預測
論文名稱(外文):Deep Video Prediction Using Local Template Matching and Self-attention PixelCNN
指導教授:彭文孝
指導教授(外文):Peng, Wen-Hsiao
口試委員:杭學鳴蕭旭峰
口試委員(外文):Hang, Hsueh-MingHsiao, Hsu-Feng
口試日期:2019-01-30
學位類別:碩士
校院名稱:國立交通大學
系所名稱:多媒體工程研究所
學門:電算機學門
學類:軟體發展學類
論文種類:學術論文
論文出版年:2019
畢業學年度:107
語文別:英文
論文頁數:31
中文關鍵詞:影像預測深度學習機器學習
外文關鍵詞:Video PredictionSelf-attentionPixelCNNLocal Template Matchingmachine learning
相關次數:
  • 被引用被引用:0
  • 點閱點閱:421
  • 評分評分:
  • 下載下載:75
  • 收藏至我的研究室書目清單書目收藏:0
基於前面時間點的一系列影像去預測當前的一張影像是一個很有挑戰的課題。雖然大多數前人的方法在簡單的影像數據預測上有很好的表現。但是他們很難成功預測高解析度的影像和複雜的自然影像。實驗結果通常是模糊的,並且會丟失物體的細節信息。這很有可能是因為這些模型的結構和能力不夠強大。
為了解決這個問題,我們脫離傳統的黑盒子做法,提出一個基於局部匹配和自注意機制的像素卷積神經網絡的生成模型。我們的工作分為兩個部分:(1)運動搜索(2)改善自回歸預測模型。我們遵循傳統視訊壓縮框架的做法,把影像中的畫面切成沒有重複的方塊。然後用基於注意網絡的模板模型去前一時間點的畫面的所有的方塊中找一個預測信號。這個預測信號作為一個條件輸入使用在像素卷積神經網絡中。改良的像素卷積神經網絡使用一個像素接著一個像素的方式生成目標方塊。並且,這個改良是通過引入自注意模型實現的。

這個模型的發展還停留在初級階段,我們展示了在探索這個模型的過程中的一系列發現。
Predicting a future video frame based on few seen past video frames, also known as one-step video prediction/extrapolation, has been a challenging task. Although recent deep learning-based models demonstrate good performance on simple datasets, such as moving MNIST, they fail to generalize to high resolution, complex natural videos. Often the predicted frames are blurry and lack details. The reasons may be attributed to the poor model capacity and the poor network architecture.

To address this problem, we deviate from the pure black-box approach to introduce a generative video prediction model based on local template matching and self-attention PixelCNN. Our work divides the task into two parts: (1) motion search and (2) auto-regressive prediction refinement. Following the conventional video compression framework, we first divide a video frame into non-overlapping blocks. We then find a prediction signal for each of these blocks from the previous frame based on an attention-based template matching model. Such a prediction signal is further utilized in PixelCNN as a conditioning signal to synthesize a target block pixel-by-pixel for prediction refinement. In particular, the generation process is improved by a self-attention mechanism.

The development of this model is still in its early stage. We present findings encountered along the way.
摘要 i
Abstract ii
誌謝 iii
Contents iv
List of Tables vi
List of Figures vii
1 Introduction 1
1.1 Contributions 2
1.2 Organization 2
2 Related Work 3
2.1 PixelCNN 3
2.2 Self-attention Models 6
3 Proposed Method 8
3.1 Attention-based Template Matching 8
3.1.1 Motion Estimation 8
3.1.2 Template Matching 9
3.1.3 Attention-based Template Matching 11
3.2 Self-attention PixelCNN 13
4 Experiments 16
4.1 Datasets, Evaluation Methodology and Details. 16
4.2 Template Matching Results 17
4.3 Self-Attention PixelCNN 22
5 Conclusion 28
References 29
[1]D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014.
[2]R. Brunelli.Template matching techniques in computer vision: theory and practice.John Wiley & Sons, 2009.
[3]B. Chen, W. Wang, and J. Wang. Video imagination from a single image with transformation generation. InProceedings of the on Thematic Workshops of ACM Multimedia 2017, pages 358–366. ACM, 2017.
[4]S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed. Recurrent environment simulators.arXiv preprint arXiv:1704.02254, 2017.
[5]E. Denton and R. Fergus. Stochastic video generation with a learned prior.arXivpreprint arXiv:1802.07687, 2018.
[6]C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. InAdvances in neural information processing systems, pages 64–72, 2016.
[7]I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. InAdvances in NeuralInformation Processing Systems (NIPS), 2014.
[8]N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks.arXiv preprint arXiv:1610.00527, 2016.
[9]Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[10] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
[11] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
[12] A. Show. Tell: Neural image caption generation with visual attention. Kelvin Xu et. al.. arXiv Pre-Print, 83:89, 2015.
[13] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852, 2015.
[14] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
[16] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to generate long-term future via hierarchical prediction. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3560–3569. JMLR. org, 2017.
[17] C. Vondrick and A. Torralba. Generating the future with adversarial transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1020–1028, 2017.
[18] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613–621, 2016.
[19] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2018.
[20] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
[21] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊