( 您好!臺灣時間:2023/12/10 14:35
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::


研究生(外文):Ng, Ka Yan
論文名稱(外文):A Data and Feature Selection Mechanism Based on Data Quality Evaluation for Intrusion Detection in IoT Networks
指導教授(外文):Tseng, Chin-yang Henry
口試委員(外文):Huang, Chun-YingTseng, Chin-yang HenryShen, Victor R.L.Tsaur, Woei-JiunnLin, Daw-Tung
外文關鍵詞:Data SelectionData SamplingData AnalyzeDimensional ReductionFeature SelectionIntrusion DetectionInformation Security
  • 被引用被引用:0
  • 點閱點閱:111
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
因此如要在如此龐大的數據集裡暢通無阻地遊走,特徵選取是過濾數據中其中一個有效的工具。然而,這還遠遠不夠。由於噪聲數據也會影響模型預測的準確性,所以從數據集中去除噪聲數據也是解決分類不平衡問題和提高模型性能的另一個選擇。隨機抽樣是進行數據抽樣的一種常見工具。由於它是靠運氣和不可控的,也無法解釋為什麼這些數據被選中。因此,本文提出了4種對數據評分的方法來定義並找出價值高的數據。除此之外,本研究還提出了對數據和特徵進行選擇的系統來增加對數據的控制能力。本實驗使用IoT-23 這個大型的資料集來進行驗證,實驗結果表明,使用本研究提出的數據評分、數據採樣以及特徵提取能提升入侵偵測多分類的準確性達97.709%,高於使用隨機森林 (Random Forest) 和相關係數 (Correlation) 進行特徵提取後的入侵偵測分類結果。

Due to the arrival of the big data era, the volume of datasets is becoming larger and larger which leads to some traditional techniques having already become outdated that cannot be processed at all. Skills such as data analysis, statistical techniques, and data-driven technologies are in high demand. In order to improve the performance of the classification, combining several technologies such as Artificial Intelligence is a must. At the same time, we believed that the more valuable data is used in training, the higher accuracy a model could perform. So, by diving into such a huge volume of datasets, feature selection is one of the general tools for filtering. However, that is not enough. As the noise data will also affect the accuracy of a model’s prediction, removing noise data from the dataset is also another option for addressing the class imbalance problem and improving the performance of models in one go. Random sampling is a common tool for performing data sampling. However, it is by luck and uncontrollable. And there is no explanation of why those data have been chosen. To address this problem, four data scoring methods have been proposed for well defining the invaluable data in this thesis. Besides that, data and feature selection systems have also been proposed to increase the power of controlling data. This experiment uses a huge dataset, IoT-23, to conduct the experiment. The experimental results have shown that the proposed data scoring, data sampling and feature extraction in this study can improve the accuracy performance of multi-classification intrusion detection by 97.709%, which is higher than the result of intrusion detection classification models that were used Random Forest and Correlation.
Acknowledgements II
Table of Contents VI
Table of Figures VIII
List of Tables IX
Chapter 1. Introduction 1
1.1 Background 1
1.2 Problem Statement 1
Chapter 2. Related works 3
2.1 Resampling 3
2.1.1 Oversampling 3
2.1.2 Undersampling 3
2.2 Feature selection 4
2.3 Gini index 5
2.4 Entropy 5
2.5 Autoencoder 6
Chapter 3. Proposed Mechanism 7
3.1 System Architecture 7
3.2 Statistic procedure 9
3.3 Scoring phase 10
3.3.1 Scoring Method 1: Method 1 11
3.3.2 Scoring Method 2: Neighbour Class Difference (NCD) 11
3.3.3 Scoring Method 3: Total Class Difference (TCD) 12
3.3.4 Scoring Method 4: Baseline Class Difference (BCD) 12
3.4 Feature selection 13
3.5 Data selection 14
3.6 Preprocessing 14
3.7 Classification 15
3.8 Evaluation metrics 16
4.1 Experimental Equipment and Environment 17
4.2 Feature selection 18
4.3 Data Sampling 19
4.4 Classification 20
4.5 Experimental Result Comparision 21
Reference 24

[1] J. Li, S. Fong and Y. Zhuang, "Optimizing SMOTE by Metaheuristics with Neural Network and Decision Tree," 2015 3rd International Symposium on Computational and Business Intelligence (ISCBI), pp. 26-32, 2015.
[2] C. Guo, Y. Ma, Z. Xu, M. Cao and Q. Yao, "An Improved Oversampling Method for imbalanced Data–SMOTE Based on Canopy and K-means," 2019 Chinese Automation Congress (CAC), pp. 1467-1469, 2019.
[3] K. Cheng, C. Zhang, H. Yu, X. Yang, H. Zou and S. Gao, "Grouped SMOTE With Noise Filtering Mechanism for Classifying Imbalanced Data," IEEE Access, vol. 7, pp. 170668-170681, 2019.
[4] E. Masry, "Polynomial interpolation and prediction of continuous-time processes from random samples," IEEE Transactions on Information Theory, vol. 43, no. 2, pp. 776-783, 1997.
[5] P. Zhao, B. Niu, W. Feng and Z. Yan, "Input-to-State Stability and Stabilization of Sampled-Data Systems Under Aperiodic Sampling and Random Sampling," IEEE Access, vol. 9, pp. 47657 - 47667, 2021.
[6] C. Luo and J. H. McClellan, "Discrete random sampling theory," 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5430-5434, 2013.
[7] N. L. Fitriyani, M. Syafrudin, G. Alfian and J. Rhee, "Development of Disease Prediction Model Based on Ensemble Learning Approach for Diabetes and Hypertension," IEEE Access, vol. 7, pp. 144777-144789, 2019.
[8] M. A. H. Farquad and I. Bose, "Preprocessing unbalanced data using support vector machine," Decision Support Systems, vol. 53, no. 1, 2012.
[9] M. Zeng, B. Zou, F. Wei, X. Liu and L. Wang, "Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data," 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS), pp. 225-228, 2016.
[10] A. Nugroho, A. Z. Fanani and C. F. Shidik, "Evaluation of Feature Selection Using Wrapper For Numeric Dataset With Random Forest Algorithm," 2021 International Seminar on Application for Technology of Information and Communication (iSemantic), pp. 179-183, 2021.
[11] J. K. Jaiswal and R. Samikannu, "Application of Random Forest Algorithm on Feature Subset Selection and Classification and Regression," 2017 World Congress on Computing and Communication Technologies (WCCCT), pp. 65-68, 2017.
[12] J.-H. Kwon, S.-W. Lee, S.-B. Lee and E.-J. Kim, "Impact of Correlation-based Feature Selection on Photovoltaic Power Prediction," 2019 4th Technology Innovation Management and Engineering Science International Conference (TIMES-iCON), pp. 1-4, 2019.
[13] N. Gopika and A. K. M. E., "Correlation Based Feature Selection Algorithm for Machine Learning," 2018 3rd International Conference on Communication and Electronics Systems (ICCES), pp. 692-695, 2018.
[14] H. Hairani, M. Innuddin and M. Rahardi, "Accuracy Enhancement of Correlated Naive Bayes Method by Using Correlation Feature Selection (CFS) for Health Data Classification," 2020 3rd International Conference on Information and Communications Technology (ICOIACT), pp. 51-55, 2020.
[15] S. S. Sundahari, "A knowledge discovery using decision tree by Gini coefficient," 2011 International Conference on Business, Engineering and Industrial Applications, pp. 232-235, 2011.
[16] M. R. Mohebbian, H. A. M. Sohag, S. S. Vedaei and K. A. Wahid, "Automated Detection of Bleeding in Capsule Endoscopy Using On-Chip Multispectral Imaging Sensors," IEEE Sensors Journal, vol. 21, no. 13, pp. 14121-14130, 2021.
[17] "Feature Selection for Gene Expression Using Model-Based Entropy," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 25-36, 2010.
[18] S. Mazuelas, Y. Shen and A. Pérez, "Generalized Maximum Entropy for Supervised Classification," IEEE Transactions on Information Theory, vol. 68, no. 4, pp. 2530-2550, April 2022.
[19] S. Ali and Y. Li, "Learning Multilevel Auto-Encoders for DDoS Attack Detection in Smart Grid Network," IEEE Access, vol. 7, pp. 108647 - 108659, 2019.
[20] Y. Li, R. Ma and R. Jiao, "A Hybrid Malicious Code Detection Method based on Deep Learning," International Journal of Security and Its Applications, vol. 9, pp. 205-216, 2015.
[21] G. Andresini, A. Appice, N. D. Mauro, C. Loglisci and D. Malerba, "Multi-Channel Deep Feature Learning for Intrusion Detection," IEEE Access, vol. 8, pp. 53346-53359, 2020.
[22] S. Garcia, A. Parmisano and M. J. Erquiaga, "IoT-23: A labeled dataset with malicious and benign IoT network traffic (Version 1.0.0) [Data set]," 2020. [Online]. Available: http://doi.org/10.5281/zenodo.4743746. [Accessed 7 March 2021].
電子全文 電子全文(網際網路公開日期:20270821)
第一頁 上一頁 下一頁 最後一頁 top