跳到主要內容

臺灣博碩士論文加值系統

(44.192.22.242) 您好!臺灣時間:2021/07/31 10:44
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:曾國維
研究生(外文):Kuo-Wei Tseng
論文名稱:學習影像和動作辨識之代表性特徵值之演算法與架構設計
論文名稱(外文):Learning Representative Feature Expression Algorithm and Architecture for Image andAction Recognition
指導教授:陳良基陳良基引用關係
指導教授(外文):Liang-Gee Chen
口試委員:簡韶逸賴永康黃朝宗
口試委員(外文):Shao-Yi ChenYeong-Kang LaiChao-Tsung Huang
口試日期:2015-07-27
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:電子工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2015
畢業學年度:103
語文別:英文
論文頁數:55
中文關鍵詞:動作辨識特徵值萃取K 群集分類法硬體架構設計訓練資料合成
外文關鍵詞:action recognitionfeature extractionk-means clusteringhardware architecture designtraining data synthesis
相關次數:
  • 被引用被引用:0
  • 點閱點閱:206
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
中文摘要
在最近十年中,電腦視覺的研究取得相當大的進步並對我們的生活造成顯著
的影響。藉由大數據分析和機器學習的演算法為基礎,許多智慧型的裝置被開發
出來。舉Google Glass 為例子,這個穿戴型的裝置能藉由擷取周遭人們臉孔的影
像並對這些影像來做分析,來認出周遭的人們的身分。另外一個例子是車牌自動
辨識系統,這個系統能夠辨識進入車輛的車牌,並將其入場時間記錄於伺服器,
於是我們開車進場經過閘門時就不用按鈕等著取得停車票卡,繳費的時候只需要
輸入車號,十分便利。這些應用結合了電腦視覺和機器學習的研究,讓我們有機
會一窺和達到我們心中未來生活的藍圖。在辨識的這個領域,除了影像外再延伸
出去,動作辨識是一個我們迫切需要解決的問題,因為在不久的將來,智慧型的
機器人會被發明出來跟我們人類互動並幫助人們日常的生活起居和代替我們執
行做一些最艱難的任務。為了達到這個目標,我們必須讓機器有學習一些包含在
影像和影片中的物件和動作的含意,就像我們人類一般思考一樣。
影像的辨識比起影像的辨識還要更加複雜,這些影片不只包含了像素的強度
和其空間上的關係,還擁有著時間上的意義,也就是禎和禎之間變化的關係。隨
著科技的進步,機器人即將出現在我們生活中,於是電腦視覺在影片的領域上更
加的蓬勃發展。許多相關的演算法在最近幾年內被提出,但大部分的方法在訓練
的計算太過於複雜,需要在雲端才能解決。
在這篇論文中,我們首先先介紹電腦視覺的背景和其發展的潛力,然後介紹
常用的視覺辨識系統架構: (1) 影像或影片的前處理 (2) 特徵值的萃取 (3) 分類
器的分類,在我們的方法中,我們集中在於探討前處理和特徵值的方法,用一些
簡單已知的演算法來達到不錯的效果。
K 群集分類法實在被用於產上BOVW 的codebook。他非常有名的是其計算
的速度,在參考文獻[8]中我們用K 群集分類法來學初有代表性的補丁。相對於
其他使用層級式的深度學習演算法去學習代表性的特徵值,這個方法只需要幾十
分鐘相對短的時間來達到在CIFAR-10 這個資料庫在辨識方面不錯的表現。在我
們的方法中,我們把這個演算法拓展的影片,在這個方法中,K 群集分類法式來
學習一個代表性的立體補丁塊,其包含了時間軸的資訊。然而,在這個方法的訓
練過程中,由於影片的訓練資料相對於影像來說比較少,所以表現較差,於是我
們提出一個方法可以藉由合成原來的訓練資料來達成從不同的資料庫學習的能
力。
總結一下,我們提出了一個基於K 群集分類法的動作辨識系統,我們可以
從不同的資料庫中學習到更好的結果,然後我們提出相對於這個演算法的硬體設
計理念,這個架構稍微微調後不論影像還是影片的辨識都可以適用。

In the past decade, computer vision makes great progress and has signi cant
impact to our daily life. Various intelligent devices are developed based on
big data analysis and machine learning algorithm. Take google glasses for
example, this wearable device can capture picture of people around you
and analyze the image to recognize who they are. In some parking lot,
vehicle license plate recognition system is used for automatic check-in and
no more parking coin is needed to get through the gate. These applications
show us the possibility to achieve a future life style by combining computer
vision and machine learning. Further thinking about visual task, action
recognition must be the top priority problem needed to be solved. In the
near future, the intelligent robot will be invented which can interact with
human-beings and do the most dangerous jobs for us. To do so, the machines
must learn the meanings of images and actions, just like us.
Visual tasks of videos recognition are much more complex than ones of
image recognition. Videos sequence contains not only intensity and spatial
information, but also temporal feature which implies the transformation
between frames. With advancement of technology, the intelligent robots
will be invented in the near future. Therefore, the machine vision in video
domain which makes robots to learn our world is a vital issue, including
action recognition. Several algorithm to deal with video tasks has been
proposed in recent years, but the training procedure is too complex.
In the thesis, we rst introduce some applications and common-used
recognition pipeline in the eld of computer vision. A general visual recog-
nition pipeline consists of three parts: (i) image/ video pre-processing, (ii)
feature extraction, (iii) classi cation. In our approach, we focus on pre-
processing and feature extraction part, using simple algorithm to achieve
high performance.
K-means clustering is broadly used for codebook generation in Bag of
Visual Words (BOVW)[6] [7] method. It is known for its computational
speed. The concept in [8] is to use K-means clustering not to learn the
codebook from high level feature but to learn representative patches from
pixel raw value. In contrast to constructing hierarchical and deep architec-
ture to learn complex features, this method needs only tens of minutes to
train and achieve good performance on CIFAR-10 dataset. In our approach,
we extend the method from image domain to video domain, where K-means
method clusters representative volumes of frames, instead of patches. How-
ever, the dimensionality of volumes is much larger than the one of patches,
and the size of training data in a video dataset is usually smaller than image
dataset, so it is not large enough to train a good k-means model. Therefore,
we proposed a method to learn volumes from different dataset to solve this
problem.
To sum-up, an action recognition system based on k-means cluster-
ing method is designed. We can learn and extract features from different
dataset. Furthermore, we propose a hardware architecture for this algo-
rithm. This architecture can be both used in image/action recognition with
some slight parameter changes

Abstract vii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Applications of Computer Vision . . . . . . . . . . . . . 1
1.3 The Concept of Image and Action Recognition . . . . . . . . 5
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background Knowledge of Visual Recognition Tasks 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Fundamental pipeline of Recognition Task . . . . . . . . . . 10
2.2.1 Pre-processing Stage . . . . . . . . . . . . . . . . . . 10
2.2.2 Feature Extraction Stage . . . . . . . . . . . . . . . . 10
2.2.3 Classi cation Stage . . . . . . . . . . . . . . . . . . . 10
2.3 Basic of Appearance-based method: Gabor lters and linear
ltering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Basic of Appearance-based method: HMAX . . . . . . . . . 13
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Learning representative features from K-means clustering 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Pre-Processing Stage . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Normalization by brightness and contrast . . . . . . . 22
3.3.2 Whitening transformation . . . . . . . . . . . . . . . 23
3.4 Prototypes Generation: K-means Clustering . . . . . . . . . 26
3.5 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.1 Soft-Activation . . . . . . . . . . . . . . . . . . . . . 29
3.5.2 Summation Pooling . . . . . . . . . . . . . . . . . . . 29
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Proposed Action Recognition System and architecture de-
sign 33
4.1 Extension from Image to Video . . . . . . . . . . . . . . . . 33
4.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Training Data Synthesis . . . . . . . . . . . . . . . . . . . . 37
4.4 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Conclusion 49
Bibliography 51

Bibliography
[1] S. lifestyle, Lily camera." http://szlifestyle.com/sz/2015/05/
14/lily-drone-can-flies-alone-and-follows-you/, 2015. [On-
line; accessed 02-July-2014].
[2] T. Mukai, S. Hirano, H. Nakashima, Y. Kato, Y. Sakaida, S. Guo, and
S. Hosoe, Development of a nursing-care assistant robot riba that can
lift a human in its arms.," in IROS, pp. 5996{6001, IEEE, 2010.
[3] Wikipedia, Google driverless car." http://en.wikipedia.org/wiki/
Google_driverless_car, 2014. [Online; accessed 16-April-2014].
[4] A. Hyvarinen, J. Hurri, and P. O. Hoyer, Natural Image Statistics:
A Probabilistic Approach to Early Computational Vision., vol. 39.
Springer Science & Business Media, 2009.
[5] M. Haghighat, Gabot example." http://www.mathworks.com/
matlabcentral/fileexchange/44630-gabor-feature-extraction,
2013. [Online; accessed 04-July-2014].
[6] X. Peng, L. Wang, X. Wang, and Y. Qiao, Bag of visual words and
fusion methods for action recognition: Comprehensive study and good
practice," arXiv preprint arXiv:1405.4506, 2014.
[7] J.Wu, Y. Zhang, and W. Lin, Towards good practices for action video
encoding," in Computer Vision and Pattern Recognition (CVPR), 2014
IEEE Conference on, pp. 2577{2584, IEEE, 2014.
[8] A. Coates, A. Y. Ng, and H. Lee, An analysis of single-layer networks
in unsupervised feature learning," in International conference on arti-
cial intelligence and statistics, pp. 215{223, 2011.
[9] H. P. Moravec, Obstacle avoidance and navigation in the real world
by a seeing robot rover.," tech. rep., DTIC Document, 1980.
[10] C. Harris and M. Stephens, A combined corner and edge detector.,"
in Alvey vision conference, vol. 15, p. 50, Citeseer, 1988.
[11] D. G. Lowe, Distinctive image features from scale-invariant key-
points," International journal of computer vision, vol. 60, no. 2, pp. 91{
110, 2004.
[12] P. Scovanner, S. Ali, and M. Shah, A 3-dimensional sift descriptor
and its application to action recognition," in Proceedings of the 15th
international conference on Multimedia, pp. 357{360, ACM, 2007.
[13] A. Klaser and M. Marszalek, A spatio-temporal descriptor based on
3d-gradients," 2008.
[14] H. Bay, T. Tuytelaars, and L. Van Gool, Surf: Speeded up robust fea-
tures," in Computer Vision{ECCV 2006, pp. 404{417, Springer, 2006.
[15] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
Object detection with discriminatively trained part-based models,"
Pattern Analysis and Machine Intelligence, IEEE Transactions on,
vol. 32, no. 9, pp. 1627{1645, 2010.
[16] O. Linde and T. Lindeberg, Object recognition using composed recep-
tive eld histograms of higher dimensionality," in Pattern Recognition,
2004. ICPR 2004. Proceedings of the 17th International Conference on,
vol. 2, pp. 1{6, IEEE, 2004.
[17] C. Cortes and V. Vapnik, Support-vector networks," Machine learn-
ing, vol. 20, no. 3, pp. 273{297, 1995.
[18] L. Breiman, Random forests," Machine learning, vol. 45, no. 1, pp. 5{
32, 2001.
[19] B. Schiele and J. L. Crowley, Object recognition using multidi-
mensional receptive eld histograms," in Computer VisionECCV''96,
pp. 610{619, Springer, 1996.
[20] M. Riesenhuber and T. Poggio, Hierarchical models of object recog-
nition in cortex," Nature neuroscience, vol. 2, no. 11, pp. 1019{1025,
1999.
[21] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, Robust
object recognition with cortex-like mechanisms," Pattern Analysis and
Machine Intelligence, IEEE Transactions on, vol. 29, no. 3, pp. 411{
426, 2007.
[22] J. Mutch and D. G. Lowe, Object class recognition and localization
using sparse features with limited receptive elds," International Jour-
nal of Computer Vision, vol. 80, no. 1, pp. 45{57, 2008.
[23] T. Serre, L. Wolf, and T. Poggio, Object recognition with features
inspired by visual cortex," in Computer Vision and Pattern Recogni-
tion, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2,
pp. 994{1000, IEEE, 2005.
[24] D. B. Graham, Cifar10 1st place."
http://blog.kaggle.com/2015/01/02/
cifar-10-competition-winners-interviews-with-dr-ben-graham-/
phil-culliton-zygmunt-zajac/, 2015. [Online; accessed 07-July-
2014].
[25] Q. V. Le, Building high-level features using large scale unsupervised
learning," in Acoustics, Speech and Signal Processing (ICASSP), 2013
IEEE International Conference on, pp. 8595{8598, IEEE, 2013.
[26] J. MacQueen et al., Some methods for classi cation and analysis of
multivariate observations," in Proceedings of the fth Berkeley sympo-
sium on mathematical statistics and probability, vol. 1, pp. 281{297,
Oakland, CA, USA., 1967.
[27] S. P. Lloyd, Least squares quantization in pcm," Information Theory,
IEEE Transactions on, vol. 28, no. 2, pp. 129{137, 1982.
[28] E. W. Forgy, Cluster analysis of multivariate data: efficiency versus
interpretability of classi cations," Biometrics, vol. 21, pp. 768{769,
1965.
[29] G. Hamerly and C. Elkan, Alternatives to the k-means algorithm that
nd better clusterings," in Proceedings of the eleventh international
conference on Information and knowledge management, pp. 600{607,
ACM, 2002.
[30] Y.-L. Boureau, J. Ponce, and Y. LeCun, A theoretical analysis of
feature pooling in visual recognition," in Proceedings of the 27th In-
ternational Conference on Machine Learning (ICML-10), pp. 111{118,
2010.
[31] A. Krizhevsky and G. Hinton, Learning multiple layers of features
from tiny images," 2009.
[32] C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: a
local svm approach," in Pattern Recognition, 2004. ICPR 2004. Pro-
ceedings of the 17th International Conference on, vol. 3, pp. 32{36,
IEEE, 2004.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top