跳到主要內容

臺灣博碩士論文加值系統

(44.200.122.214) 您好!臺灣時間:2024/10/14 11:06
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:林八林
研究生(外文):Ba-Lin Lin
論文名稱:應用於電腦視覺之影像人物凝視注意力偵測模型
論文名稱(外文):GazeVAE: Gaze Visual Attention Estimator
指導教授:花凱龍
指導教授(外文):Kai-Lung Hua
口試委員:陳永耀陳駿丞楊傳凱陸敬互
口試委員(外文):Yung-Yao ChenJun-Cheng ChenChuan-Kai YangChing-Hu Lu
口試日期:2022-01-18
學位類別:碩士
校院名稱:國立臺灣科技大學
系所名稱:資訊工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2022
畢業學年度:110
語文別:中文
論文頁數:43
中文關鍵詞:凝視跟隨凝視目標偵測凝視預估視線注意前景物
外文關鍵詞:Gaze followingGaze target detectionGaze estimationVisual attentionSaliency
相關次數:
  • 被引用被引用:0
  • 點閱點閱:152
  • 評分評分:
  • 下載下載:24
  • 收藏至我的研究室書目清單書目收藏:0
偵測影像中的人所向看的地方能對人類社交或是動作分析等領域提供非常多的資訊,而影像人物凝視注意力偵測模型之目的即為給定一張完整的影像以及目標人物的頭部影像,透過深度學習的方式預測出其在影像中所看向的地方。最近的研究針對此問題已提出並證明提供深度資訊以及產生角度遮罩可以幫助模型進行判斷,但是這些模型都仰賴著許多其他預訓練的模型來達到更好的效能。因此我們提出了一種兩階段的模型架構,在第一階段中我們透過深度資訊的偽標籤在資料集訓練三維的視覺方向,並將其分解成二維影像平面以及一維深度的遮罩。在第二階段中我們透過原始的影像、頭部位置以及第一階段的輸出結果去預測此人看向的目標是在影像中還是影像外,若是其目標在影像中則預測其看向的圖像位置。我們的架構中除了使用現有的深度預測模型以外並沒有使用其他的預訓練模型,並提出了前瞻的角度等量損失函數提高二維角度的準確度。我們在實驗中證明,我們的模型即使不使用預訓練的骨幹模型也能在曲線下方的面積 (AUC) 方面優於幾個最先進的基線,並在距離等其他指標上達到非常接近的成果。
A person's gaze can reveal where their interest or attention lies in a social scenario. Detecting a person's gaze is essential in multiple domains (i.e., security, psychology, or medical diagnosis). Therefore, visual attention models aim to automate this and determine where multiple people's gazes in a scene lie. Most existing works in this field are dependent on multiple pre-trained models. We propose a two-stage framework, Gaze Visual Attention Estimator (GazeVAE). In the first stage, we train the 3D direction on the GazeFollow dataset with a pseudo label to produce the field of view. Afterward, we decompose the 3D direction into a 2D image plane and a depth-channel gaze to obtain the depth mask image. In the second stage, we concatenate the scene image, the output from stage one, and the head position to predict the gaze target's location. We propose a novel equivalent loss to reduce angle error further. We train the model from scratch except for the off-the-shelf depth network. Our model outperforms the baseline model in AUC and achieves competitive results for GazeFollow and VideoAttentionTarget datasets without pretraining.
Contents
Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Gaze Target Prediction . . . . . . . . . . . . . . . 4
2.2 Gaze Direction Estimation . . . . . . . . . . . . . 4
2.3 Visual Saliency . . . . . . . . . . . . . . . . . . . 5
3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Visual Attention Module . . . . . . . . . . . . . . 7
3.3 Heatmap Regression Module . . . . . . . . . . . . 12
3.4 Objective Function . . . . . . . . . . . . . . . . . 16
4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 GazeFollow . . . . . . . . . . . . . . . . . . . . . 17
4.2 VideoAttentionTarget . . . . . . . . . . . . . . . . 17
4.3 Implementation Details . . . . . . . . . . . . . . . 21
4.4 Experimental Results . . . . . . . . . . . . . . . . 22
4.5 Ablation Study . . . . . . . . . . . . . . . . . . . 26
4.6 Examples of Failure . . . . . . . . . . . . . . . . 28
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1 Future works . . . . . . . . . . . . . . . . . . . . 30
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Letter of Authority . . . . . . . . . . . . . . . . . . . . . . . . . . 35
[1] D. Lian, Z. Yu, and S. Gao, “Believe it or not, we know what you are looking at!,” in ACCV, 2018.
[2] E. Chong, Y. Wang, N. Ruiz, and J. M. Rehg, “Detecting attended visual targets in video,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[3] E. Chong, N. Ruiz, Y. Wang, Y. Zhang, A. Rozga, and J. M. Rehg, “Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency,” in The European Conference on Computer Vision (ECCV), September 2018.
[4] P. A. Dias, D. Malafronte, H. Medeiros, and F. Odone, “Gaze estimation for assisted living environments,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
(WACV), March 2020.
[5] X. Zhong, X. Qu, C. Ding, and D. Tao, “Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13234–13243, June 2021.
[6] H. Tomas, M. Reyes, R. Dionido, M. Ty, J. Casimiro, R. Atienza, and R. Guinto, “Goo: A dataset for gaze object prediction in retail environments,” in CVPR Workshops (CVPRW), 2021.
[7] K. Campbell, K. L. H. Carpenter, J. Hashemi, S. Espinosa, S. Marsan, J. S. Borg, Z. Chang, Q. Qiu, S. Vermeer, E. Adler, M. Tepper, H. L. Egger, J. P. Baker, G. Sapiro, and G. Dawson, “Computer vision analysis captures atypical attention in toddlers with autism.,” Autism, vol. 23, no. 3, pp. 619–628, 2019.
[8] A. Recasens, A. Khosla, C. Vondrick, and A. Torralba, “Where are they looking?,” in Advances in Neural Information Processing Systems (C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015.
[9] Y. Fang, J. Tang, W. Shen, W. Shen, X. Gu, L. Song, and G. Zhai, “Dual attention guided gaze target detection in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11390–11399, June 2021.
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
[11] P. Kellnhofer, A. Recasens, S. Stent, W. Matusik, and A. Torralba, “Gaze360: Physically unconstrained gaze estimation in the wild,” in IEEE International Conference on Computer Vision (ICCV), October 2019.
[12] A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks),” in International Conference on Computer Vision, 2017.
[13] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
[14] Y. Cheng and F. Lu, “Gaze estimation using transformer,” 2021.
[15] R. Siegfried and J.-M. Odobez, “Visual focus of attention estimation in 3d scene with an arbitrary number of targets,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 3153–3161, June 2021.
[16] S. Ghosh, M. Hayat, A. Dhall, and J. Knibbe, “Mtgls: Multi-task gaze estimation with limited supervision,” 2021.
[17] Y. Cheng, S. Huang, F. Wang, C. Qian, and F. Lu, “A coarse-to-fine adaptive network for appearance-based gaze estimation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 10623–10630, 04 2020.
[18] M. L. R. D and P. Biswas, “Appearance-based gaze estimation using attention and difference mechanism,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 3143–3152, June 2021.
[19] N. Liu, N. Zhang, K. Wan, L. Shao, and J. Han, “Visual saliency transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4722–4732, October 2021.
[20] P. Sun, W. Zhang, H. Wang, S. Li, and X. Li, “Deep rgb-d saliency detection with depth-sensitive attention and automatic multi-modal fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1407–1417, June 2021.
[21] S. Gorji and J. J. Clark, “Attentional push: A deep convolutional network for augmenting image salience with shared attention modeling in social scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[22] D. Parks, A. Borji, and L. Itti, “Augmented saliency model using automatic 3d head pose detection and learned gaze following in natural scenes,” Vision Research, vol. 116, pp. 113–126, 2015. Computational Models of Visual Attention.
[23] Z. Bylinskii, A. Recasens, A. Borji, A. Oliva, A. Torralba, and F. Durand, “Where should saliency models look next?,” in Computer Vision – ECCV 2016 (B. Leibe, J. Matas, N. Sebe, and M. Welling, eds.), (Cham), pp. 809–824, Springer International Publishing, 2016.
[24] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” ArXiv preprint, 2021.
[25] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” 2019.
[26] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in AISTATS, 2010
[27] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32 (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), pp. 8024–8035, Curran Associates, Inc., 2019.
[28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 200
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top