研究生(外文):Chih-ta Lin
論文名稱(外文):An Efficient Feature Selection and Extraction Analysis for Malware Behavior Classification
指導教授(外文):Nai-Jian Wang
口試委員(外文):Nai-Jian Wang
外文關鍵詞:Dynamic Malware AnalysisData ClassificationDimensionality ReductionFeature SelectionFeature Extraction
本論文提出一個通用且有效率的方法來分析預測每一種惡意程式之行為,本文的方法結合特徵選取與萃取方法,在特徵提取階段先選取有效特徵,然後進行特徵降維,二段式方法大量降低特徵空間之維度,然後建立分類學習模型。本研究經過從一沙箱環境紀錄待測程式各項系統、網路及登錄行為紀錄後,以下列五步驟進行特徵降維與分類學習預測:(一)從紀錄檔提取呼叫函式n-gram 特徵文本數據、(二)以SVM 方法建立惡意程式分類學習器、(三)TF-IDF方式選取有效特徵組、(四)以PCA與KPCA等方法進行特徵轉換降維、(五)組合上述步驟建立快速學習與預測模型。
The explosive amount of malware continue their threats in network and operating systems. Signature-based method is widely used for detecting malware. Unfortunately, it is unable to determine variant malware on-the-fly. On the hand, behavior-based method can effectively characterize the behaviors of malware. However, it is time-consuming to train and predict for each specific family of malware.
We propose a generic and efficient algorithm to classify malware. Our method combines the selection and the extraction of features, which significantly reduces the dimensionality of features for training and classification. Based on malware behaviors collected from a sandbox environment, our method proceeds in five steps: (a) extracting n-gram feature space data from behavior logs, (b) building a support vector machine (SVM) classifier for malware classification, (c) selecting a subset of features, (d) transforming high-dimensional feature vectors into low-dimensional feature vectors, and (e) selecting models.
Furthermore, we propose a Multi-Grouping algorithm for each feature reduction method. During the feature selection and extraction process, we show a easy way to figure out the major behaviors for each malware type. Experiments were conducted on a real-world data set with 4,288 samples from 9 families. As a proof of concept, we have evaluated our method by online training simulation experiment. Our 2-stages dimensionality reduction approach could have reduced the time cost significantly. The combination of MG TF-IDF, PCA and SVM for online training can finish the re-training and classifying in seconds, is sufficient to meet the online learning requirement for collecting the malware behavior in every minute. The experiments were demonstrated the effectiveness and the efficiency of our approach.
