跳到主要內容

臺灣博碩士論文加值系統

(98.82.120.188) 您好!臺灣時間:2024/09/13 03:20
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:邱信傑
研究生(外文):Hsin-Chieh Chiu
論文名稱:基於堆疊沙漏型網路進行漸進式人體姿態估計
論文名稱(外文):Progressive Processes for Human Pose Estimation by Modified Stacked Hourglass Networks
指導教授:蘇順豐
指導教授(外文):Shun-Feng Su
口試委員:蔡清池莊鎮嘉王乃堅
口試委員(外文):Ching-Chih TsaiChen-Chia ChuangNai-Jian Wang
口試日期:2019-07-12
學位類別:碩士
校院名稱:國立臺灣科技大學
系所名稱:電機工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2019
畢業學年度:107
語文別:英文
論文頁數:66
中文關鍵詞:深度學習人體姿態估計卷積神經網路影像辨識
外文關鍵詞:deep learninghuman pose estimationconvolutional neural networksimage recognize
相關次數:
  • 被引用被引用:0
  • 點閱點閱:371
  • 評分評分:
  • 下載下載:49
  • 收藏至我的研究室書目清單書目收藏:0
在本論文中,提出了漸進式(progressive)的過程來進行人體姿態估計(human pose estimation),並且由改良過的堆疊沙漏型網路(modified stacked hourglass networks)來實現。人體關鍵點(keypoints)可以經由漸進式的過程來估計:偵測人的所在位置;推測人體支架(skeleton)的位置;最後估計人體的關鍵點的位置。使用這種策略,我們可以建立一個有效的串聯沙漏型模塊,沙漏型模塊(hourglass module)是由編碼器解碼器的架構以及ASPP模塊所組成的。沙漏型模塊可以藉由編碼器解碼器的架構來合併區域特徵跟全局特徵。Atrous Spatial Pyramid Pooling (ASPP)[1]擅長擷取多尺度特徵。在偵測的物件只佔據了整張圖很小的部分時,使用bootstrapped cross-entropy[2]可以使模型在更新時更專注於目標物的周圍,而不是整張圖。最後,經過改良後的堆疊沙漏型網路在實驗結果上有更好的效果跟效率。我們的網路在檢測時間為26.12ms,比原本的網路還要快;在PCKh上的表現比原本的好12.4%。PCKh是在MPII Human Pose Dataset 中常被使用的準確度估計方式。本論文利用漸進式方法的優點,在人體姿態估計上提供了更有效率的系統。本論文成為了人類行為研究上的一個很好的基石。
In this thesis, progressive processes are proposed for human pose estimation by modified stacked hourglass networks. With the progressive processes, we can estimate the keypoints of human by the following process: detection of the whole human; prediction of the skeletons of a human body; finally, estimation of the keypoints. With this strategy, we can make an effective cascaded hourglass blocks, which consist of encoder-decoder networks and ASPP module. Hourglass block is able to combine local and global contexts by using encoder-decoder networks. The Atrous Spatial Pyramid Pooling(ASPP)[1] is good at extracting features at multi-scale level. In the case that an object to be detected occupies only a small portion of the space in the belonging image, using bootstrapped cross-entropy[2] can update the model focused on the area of detecting objects instead of the entirety of the input image. Finally, modified stacked hourglass networks have been shown to have better performance and efficiency based on experimental control results. Our networks have detection time of only 26.12ms, faster than conventional stacked hourglass networks. On MPII Human Pose Dataset, the PCKh which is Percentage of Correct Keypoints measure that uses the matching threshold as 50% of the head segment length provided by MPII Human Pose Dataset is 12.4% higher. This thesis provides an efficient system for human pose estimation through the advantage of progress strategy. The system can be a good guideline for human behavior research.
中文摘要......I
Abstract......II
致謝......III
Table of Contents......IV
List of Figures......VII
List of Tables......IX
Chapter 1 Introduction......1
1.1 Background......1
1.2 Motivation......2
1.3 Research Objective......3
1.4 Thesis Contributions......4
1.5 Thesis organization......5
Chapter 2 Related Work......6
2.1 Object Detection......6
2.2 Object Segmentation......6
2.3 Convolutional layer design......7
Chapter 3 Methodology......9
3.1 Deep learning task......10
3.2 Human pose estimation......11
3.2.1 The whole model......11
3.2.2 DCNN......12
3.2.3 Object detection......13
3.2.4 Predict skeleton......14
3.2.5 Estimate keypoints......15
3.2.6 Feature combination......16
3.2.7 Modified hourglass......17
3.2.8 Atrous Spatial Pyramid Pooling......17
3.2.9 Residual block......20
3.3 Dataset......22
3.3.1 MS COCO......22
3.3.2 MPII Human Pose Dataset......23
3.4 Accuracy function......24
3.5 Data argumentation......25
3.5.1 Random Horizontal Flip......26
3.5.2 Random Crop......28
3.5.3 Color Jitter......30
3.6 Data generator......31
3.6.1 Dataset......31
3.6.2 DataLoader......33
3.7 Loss function......35
3.7.1 Bootstrapped cross-entropy loss......35
Chapter 4 Experiments......36
4.1 Hardware......36
4.2 Software......37
4.3 Hyper-parameter......38
4.3.1 Batch size......38
4.3.2 Epoch......38
4.3.3 Learning rate......39
4.3.4 Image size......39
4.4 Label preprocess......40
4.4.1 Object......40
4.4.2 Skeleton......40
4.4.3 Keypoints......40
4.5 Training step......42
4.5.1 Pre-train model......42
4.5.2 Fine-tune model......44
4.6 Experiment Results......46
4.6.1 Performance Comparisons......46
4.6.2 Efficiency comparison......49
Chapter 5 Conclusions and future work......50
5.1 Conclusions......50
5.2 Future work......51
References......52
[1] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, "Rethinking atrous convolution for semantic image segmentation," arXiv preprint arXiv:1706.05587, 2017.
[2] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, "Full-resolution residual networks for semantic segmentation in street scenes," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4151-4160, 2017.
[3] Y. Yang and D. Ramanan, "Articulated human detection with flexible mixtures of parts," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2878-2890, 2012.
[4] A. Toshev and C. Szegedy, "Deeppose: Human pose estimation via deep neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653-1660, 2014.
[5] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, "Efficient object localization using convolutional networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648-656, 2015.
[6] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, "Convolutional pose machines," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724-4732, 2016.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in Neural Information Processing Systems, pp. 1097-1105, 2012.
[8] J. Deng, et al., "Imagenet: A large-scale hierarchical image database," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009.
[9] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[10] C. Szegedy, et al., "Going deeper with convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9, 2015.
[11] T.-Y. Lin, et al., "Microsoft coco: Common objects in context," in European Conference on Computer Vision, pp. 740-755, 2014.
[12] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, "2d human pose estimation: New benchmark and state of the art analysis," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686-3693, 2014.
[13] A. Newell, K. Yang, and J. Deng, "Stacked hourglass networks for human pose estimation," in European Conference on Computer Vision, pp. 483-499, 2014.
[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587, 2014.
[15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779-788, 2016.
[16] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431-3440, 2015.
[17] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, "Encoder-decoder with atrous separable convolution for semantic image segmentation," in Proceedings of the European Conference on Computer Vision (ECCV), pp. 801-818, 2018.
[18] M. Lin, Q. Chen, and S. Yan, "Network in network," arXiv preprint arXiv:1312.4400, 2013.
[19] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
[20] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234-241, 2015.
[21] J. Dai, K. He, Y. Li, S. Ren, and J. Sun, "Instance-sensitive fully convolutional networks," in European Conference on Computer Vision, pp. 534-549, 2016.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊