跳到主要內容

臺灣博碩士論文加值系統

(44.213.60.33) 您好!臺灣時間:2024/07/22 16:14
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:魏資碩
研究生(外文):Tzu-Shuo Wei
論文名稱:以變換器方法來修補缺失關節點用於骨架為基礎的動作識別
論文名稱(外文):A Transformer Approach to Recovering Missing Joints in Skeleton-Based Human Activity Recognition
指導教授:許永真許永真引用關係
指導教授(外文):Yung-Jen Hsu
口試委員:古倫維郭彥伶陳駿丞鄭素芳
口試委員(外文):Lun-Wei KuYen-Ling KuoJun-Cheng ChenSuh-Fang Jeng
口試日期:2023-07-26
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊網路與多媒體研究所
學門:電算機學門
學類:網路學類
論文種類:學術論文
論文出版年:2023
畢業學年度:111
論文頁數:73
中文關鍵詞:資料缺失修補變換器方法人體骨架表示法深度學習人類活動辨識
外文關鍵詞:Missing Joints RecoveryTransformer ApproachHuman Skeleton RepresentationDeep LearningHuman Activity Recognition
DOI:10.6342/NTU202302931
相關次數:
  • 被引用被引用:0
  • 點閱點閱:92
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
人類活動辨識(Human Activity Recognition)很常透過骨架座標的方式表達動態關係。然而,當我們透過人體姿態檢測模型(Pose Estimation Model)來對RGB影片進行辨識,以生成骨架資料時,會因為目標物被畫面邊緣切割而造成資料缺失。這些缺失都是發生在人體四隻肢體,且都是從距離人體軀幹最遠的部分開始缺失。而這些缺失對人類活動辨識造成負面影響。
為了避免人類活動辨識錯誤,已經有許多針對人體骨架缺失還原的研究。然而,過去都是針對少量缺失點散佈在骨架序列的情快來做修補。然而,當缺失點發生長時間缺失而且在四肢肢體時,目前都沒辦法準確的修補這些缺失。因此,本研究將針對人體骨架序列中,單一肢體的邊緣點長時間資料缺失,並利用深度學習模型來還原缺失的資料點。
在本論文中,我們提出的群組式抽樣(Group-based Sampling),來增加資料歧異度以及資料數量。在訓練方面,我們設計了專屬的兩階段訓練(Two-stage Training),同時透過掩碼語言模型(Masked Language Model)的遮蔽訓練方式,以漸進式的遮蔽骨架來讓模型逐漸學習不同動作下缺失區域的運動軌跡。我們實作了混合結構的變換器模型(Transformer Model),能同時萃取骨架結構特徵以及變化特徵,並將所獲得的特徵有效的混合,讓後續預測模塊能對缺失區塊做準確預測。
本研究首先於Human3.6M資料集做實驗。我們發現同時萃取骨架結構特徵以及變化特徵,在修補缺失區塊以及重建整體骨架序列的準確度最高。最後,我們的方法雖然目前只能針對人體骨架單一肢體長時間缺失做修補,但對於缺失區塊的修補能力以及骨架的重建能力都成功超過了最先進的(state-of-the-art)方法。
Human Activity Recognition (HAR) often employs skeleton coordinates to express dynamic relationships. However, when RGB videos are recognized through pose estimation model to generate skeleton data, data loss may occur due to the target being cut off by the edge of the screen. This loss typically occurs in the four limbs of the human body, and starts from the part farthest from the torso. This kind of data loss have a negative impact on human activity recognition.
In order to avoid errors in human activity recognition, there have been many studies focusing on the recovery of missing human skeleton joints. However, previous research has primarily targeted the repair of a small number of missing joints scattered in the skeleton sequence. Nevertheless, when missing joints occur over a long period of time and in the limb, there are currently no accurate methods for recovering these missing joints. Therefore, this study specifically targets prolonged data loss in the distal joints of a single limb within the human skeleton sequence and employs a deep learning model to recover the missing joints.
In this thesis, we propose a group-based sampling method to increase data diversity and quantity. For training, we design a two-stage training strategy along with a masking strategy that progressively masks the skeleton using a masked language modeling training technique. This allows the model to gradually learn the motion trajectories of missing areas across different actions. We also implement a hybrid transformer-based model that extracts both structural and motion features from the skeleton and effectively combines these features. This enables accurate prediction of missing areas by subsequent prediction modules of the model.
We first conducted our experiments on the Human3.6M dataset. In our experiments, we found that simultaneously extracting structural and motion features achieves the highest accuracy in recovering missing areas and reconstructing the sequence of skeleton. Although our method currently addresses long-term missing in a single limb of the human skeleton, it surpasses state-of-the-art methods in terms of both the ability to recover missing areas and reconstruct the sequence of skeleton.
誌謝 i
摘要 ii
Abstract iii
1 Introduction 1
1.1 Background and Motivation 1
1.2 Problem Description 4
1.3 Proposed Method 5
1.4 Outline of the Thesis 6
2 Literature Review 7
2.1 Applying Mask Language Modeling Techniques in Computer Vision 7
2.2 Human Skeleton Missing Joints Recovery 9
2.2.1 Prior-based Methods 10
2.2.2 Deep Neural Network Methods 11
3 Problem Definition 14
3.1 Notations 14
3.2 Missing Joints Recovery in Human Limb Skeleton 16
4 Methodology 18
4.1 Human Skeleton Data Preprocessing 19
4.2 Group-based Sampling Method 20
4.3 Masking Method 22
4.4 Two-stage Training Method 23
4.5 Masking Strategy with our Two-stage Training Method 24
4.6 Transformer-based Model Architecture 25
4.6.1 Structure-based Model 25
4.6.2 Movement-based Model 26
4.6.3 Hybrid Model 27
5 Experiments 31
5.1 Experiment Setup 31
5.1.1 The Dataset 32
5.1.2 Experiment Details 32
5.1.3 Evaluation Protocal 34
5.2 Experiment Result 35
5.2.1 The Impact of Different Skeleton Feature Extraction Methods on Model Structures 36
5.2.2 The Impact of the Length of the Skeleton Sequence Input on Performance 37
5.2.3 The Impact of Different Sampling Methods on the Results 38
5.2.4 The Impact of Two-stage Training Strategy and Different Masking Strategy 39
5.2.5 Comparing the Performance of Our Method with Other Methods 45
5.3 Discussion 45
5.3.1 Structure-based Model Results Observation and Analysis 46
5.3.2 Movement-based Model Results Observation and Analysis 46
5.3.3 Hybrid Model Results Observation and Analysis 47
5.3.4 Analysis of Missing Joints at Various Limb Locations 48
6 Conclusion 60
6.1 Summary of Work 60
6.2 Contribution 61
6.3 Limitation 62
6.4 Future Study 62
Bibliography 62
A Application on the AIMS Dataset 71
A.1 Introduction to the AIMS Dataset 71
A.2 Skeleton Data Collection 72
A.3 Testing on Real Lost Video Data 72
Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, “Modeling spatial-temporal clues in a hybrid deep learning framework for video classification,” in Proceedings of the 23rd ACM international conference on Multimedia, pp. 461–470, 2015.
Z. Wang, Y. Yang, Z. Liu, and Y. Zheng, “Deep neural networks in video human action recognition: A review,” 2023.
S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, 2018.
T. Ren, W. Li, Z. Jiang, X. Li, Y. Huang, and J. Peng, “Video-based human motion capture data retrieval via motionset network,” IEEE Access, vol. 8, pp. 186212–186221, 2020.
H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20186–20196, June 2022.
Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” 2019.
Z. Liu, H. Chen, R. Feng, S. Wu, S. Ji, B. Yang, and X. Wang, “Deep dual consecutive network for human pose estimation,” 2021.
R. Bajpai and D. Joshi, “Movenet: A deep neural network for joint profile prediction across variable walking speeds and slopes,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–11, 2021.
S. Sharma, S. Verma, M. Kumar, and L. Sharma, “Use of motion capture in 3d animation: Motion capture systems, challenges, and recent trends,” in 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), pp. 289–294, 2019.
Q. Cui, B. Chen, and H. Sun, “Nonlocal low-rank regularization for human motion recovery based on similarity analysis,” Information Sciences, vol. 493, pp. 57–74, 2019.
G. Xia, H. Sun, B. Chen, Q. Liu, L. Feng, G. Zhang, and R. Hang, “Nonlinear low-rank matrix completion for human motion recovery,” IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 3011–3024, 2018.
S. Lohit, R. Anirudh, and P. Turaga, “Recovering trajectories of unmarked joints in 3d human actions using latent space optimization,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2342–2351, 2021.
Q. Cui and H. Sun, “Towards accurate 3d human motion prediction from incomplete observations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4801–4810, 2021.
Q. Cui, H. Sun, Y. Li, and Y. Kong, “A deep bi-directional attention network for human motion recovery.,” in IJCAI, pp. 701–707,2019
M. Burke and J. Lasenby, “Estimating missing marker positions using low dimensional kalman smoothing,” Journal of Biomechanics, vol. 49, no. 9, pp. 1854–1858, 2016.
S.-J. Peng, G.-F. He, X. Liu, and H.-Z. Wang, “Hierarchical block-based incomplete human mocap data recovery using adaptive nonnegative matrix factorization,” Computers & Graphics, vol. 49, pp. 10–23, 2015.
C.-H. Tan, J. Hou, and L.-P. Chau, “Human motion capture data recovery using trajectory-based matrix completion,” Electronics letters, vol. 49, no. 12, pp. 752–754, 2013.
C. Hong, J. Yu, J. Wan, D. Tao, and M. Wang, “Multimodal deep autoencoder for human pose recovery,” IEEE transactions on image processing, vol. 24, no. 12, pp. 5659–5670, 2015.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017.
Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Sim-mim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9653–9663, June 2022.
C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, pp. 1325–1339, jul 2014.
C. S. Catalin Ionescu, Fuxin Li, “Latent structured models for human pose estimation,” in International Conference on Computer Vision, 2011.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019.
H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254, 2021.
K. He, X. Chen, S. Xie, Y. Li, P. Doll ́ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16000–16009, June 2022.
C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichtenhofer, “Masked feature prediction for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14668–14678, 2022.
B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1290–1299, June 2022.
Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “Beit v2: Masked image modeling with vector-quantized visual tokenizers,” 2022.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
M. Caron, H. Touvron, I. Misra, H. J ́egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660, 2021.
Y.-B. Cheng, X. Chen, D. Zhang, and L. Lin, “Motion-transformer: Self-supervised pre-training for skeleton-based action recognition,” in Proceedings of the 2nd ACM International Conference on Multimedia in Asia, pp. 1–6, 2021.
W. Wu, Y. Hua, C. Zheng, S. Wu, C. Chen, and A. Lu, “Skeletonmae: Spatial-temporal masked autoencoders for self-supervised skeleton action recognition,”2023.
R. Y. Lai, P. C. Yuen, and K. K. Lee, “Motion capture data completion and denoising by singular value thresholding.,” in Eurographics (Short Papers), pp. 45–48, 2011.
W. Hu, Z. Wang, S. Liu, X. Yang, G. Yu, and J. J. Zhang, “Motion capture data completion via truncated nuclear norm regularization,” IEEE Signal Processing Letters, vol. 25, no. 2, pp. 258–262, 2017.
J. Xiao, Y. Feng, and W. Hu, “Predicting missing markers in human motion capture using l1-sparse representation,” Computer Animation and Virtual Worlds, vol. 22, no. 2-3, pp. 221–228, 2011.
A. Hernandez, J. Gall, and F. Moreno-Noguer, “Human motion prediction via spatio-temporal inpainting,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7134–7143, 2019.
D. Holden, T. Komura, and J. Saito, “Phase-functioned neural networks for character control,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, pp. 1–13, 2017.
J. Martinez, M. J. Black, and J. Romero, “On human motion prediction using recurrent neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2891–2900, 2017.
T. Kucherenko, J. Beskow, and H. Kjellstr ̈om, “A neural network approach to missing marker reconstruction in human motion capture,” arXiv preprint arXiv:1803.02665, 2018.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
S. Hochreiter, “The vanishing gradient problem during learning recurrent neura nets and problem solutions,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116, 1998.
D. Weng, Y. Wang, and D. Li, “A time reversal symmetry based real-time optical motion capture missing marker recovery method,” in 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pp. 772–773, IEEE, 2022.
L. Ji, R. Liu, D. Zhou, Q. Zhang, and X. Wei, “Missing data recovery for human mocap data based on a-lstm and ls constraint,” in 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), pp. 729–734, IEEE, 2020.
T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1110–1118, 2015.
J. N. Kundu, M. Gor, P. K. Uppala, and V. B. Radhakrishnan, “Unsupervised feature learning of human actions as trajectories in pose embedding manifold,” in 2019 IEEE winter conference on applications of computer vision (WACV), pp. 1459–1467, IEEE, 2019.
Z. Yu, L. Zhang, Y. Xu, C. Tang, L. Tran, C. Keskin, and H. S. Park, “Multi-view human body reconstruction from uncalibrated cameras,” in Advances in Neural Information Processing Systems, 2022.
M. Li, S. Chen, Z. Zhang, L. Xie, Q. Tian, and Y. Zhang, “Skeleton-parted graph scattering networks for 3d human motion prediction,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI, pp. 18–36, Springer, 2022.
C. Xu, R. T. Tan, Y. Tan, S. Chen, Y. G. Wang, X. Wang, and Y. Wang, “Eqmotion: Equivariant multi-agent motion prediction with invariant interaction reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1410–1420, June 2023.
A. Zeng, X. Ju, L. Yang, R. Gao, X. Zhu, B. Dai, and Q. Xu, “Deciwatch: A simple baseline for 10× efficient 2d and 3d pose estimation,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pp. 607–624, Springer, 2022.
B. G. Gerats, J. M. Wolterink, and I. A. Broeders, “3d human pose estimation in multi-view operating room videos using differentiable camera projections,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, pp. 1–9, 2022.
A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” 2020.
H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” 2022.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊