研究生(外文):Ba-Lin Lin
論文名稱(外文):GazeVAE: Gaze Visual Attention Estimator
指導教授(外文):Kai-Lung Hua
口試委員(外文):Yung-Yao ChenJun-Cheng ChenChuan-Kai YangChing-Hu Lu
外文關鍵詞:Gaze followingGaze target detectionGaze estimationVisual attentionSaliency
偵測影像中的人所向看的地方能對人類社交或是動作分析等領域提供非常多的資訊,而影像人物凝視注意力偵測模型之目的即為給定一張完整的影像以及目標人物的頭部影像,透過深度學習的方式預測出其在影像中所看向的地方。最近的研究針對此問題已提出並證明提供深度資訊以及產生角度遮罩可以幫助模型進行判斷,但是這些模型都仰賴著許多其他預訓練的模型來達到更好的效能。因此我們提出了一種兩階段的模型架構,在第一階段中我們透過深度資訊的偽標籤在資料集訓練三維的視覺方向,並將其分解成二維影像平面以及一維深度的遮罩。在第二階段中我們透過原始的影像、頭部位置以及第一階段的輸出結果去預測此人看向的目標是在影像中還是影像外,若是其目標在影像中則預測其看向的圖像位置。我們的架構中除了使用現有的深度預測模型以外並沒有使用其他的預訓練模型,並提出了前瞻的角度等量損失函數提高二維角度的準確度。我們在實驗中證明,我們的模型即使不使用預訓練的骨幹模型也能在曲線下方的面積 (AUC) 方面優於幾個最先進的基線,並在距離等其他指標上達到非常接近的成果。
A person's gaze can reveal where their interest or attention lies in a social scenario. Detecting a person's gaze is essential in multiple domains (i.e., security, psychology, or medical diagnosis). Therefore, visual attention models aim to automate this and determine where multiple people's gazes in a scene lie. Most existing works in this field are dependent on multiple pre-trained models. We propose a two-stage framework, Gaze Visual Attention Estimator (GazeVAE). In the first stage, we train the 3D direction on the GazeFollow dataset with a pseudo label to produce the field of view. Afterward, we decompose the 3D direction into a 2D image plane and a depth-channel gaze to obtain the depth mask image. In the second stage, we concatenate the scene image, the output from stage one, and the head position to predict the gaze target's location. We propose a novel equivalent loss to reduce angle error further. We train the model from scratch except for the off-the-shelf depth network. Our model outperforms the baseline model in AUC and achieves competitive results for GazeFollow and VideoAttentionTarget datasets without pretraining.
