跳到主要內容

臺灣博碩士論文加值系統

(34.204.198.73) 您好!臺灣時間:2024/07/19 15:24
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:吳宗祐
研究生(外文):WU, TSUNG-YU
論文名稱:基於集成學習的數據外洩防護技術
論文名稱(外文):MFEEL: AnEnsemble Learning based Technique for Data Leakage Prevention
指導教授:鄭伯炤
指導教授(外文):CHENG, BO-CHAO
口試委員:蘇暉凱鄭伯炤邱茂清吳承崧侯廷昭
口試委員(外文):SU, HUI-KACHENG, BO-CHAOCHIU, MAO-CHINGWU, CHENG-SHONGHO, TING-CHAO
口試日期:2024-01-09
學位類別:碩士
校院名稱:國立中正大學
系所名稱:通訊工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2024
畢業學年度:112
語文別:中文
論文頁數:52
中文關鍵詞:集成學習特徵提取自然語言處理文件分類數據洩露防護
外文關鍵詞:Ensemble learningFeature extractionNatural language processingDocument classificationData Leakage Prevention
相關次數:
  • 被引用被引用:0
  • 點閱點閱:58
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
隨著網路技術和物聯網的快速發展,未經授權的敏感數據洩露對許多企業和組織構成了嚴重的安全威脅。 傳統的數據洩露防禦措施,如防火牆、虛擬私人網路(VPNs)和入侵檢測系統(IDS),在某種程度上雖然有效,但對敏感數據的特殊保護不足。 為了更有效地識別和保護敏感數據,防止其外洩,近年來開發了數據洩露預防系統(DLPS)。 其中,內容分析法通過分析交換文件、存儲數據和網路流量中的敏感信息,已成為DLPS中的一種流行方法。 然而,使用單一工具進行檢測難免有漏網之魚。 因此,本研究提出了「Multi Feature Extraction Ensemble Learning ; MFEEL」方法,該方法結合自然語言處理和多重特徵提取技術的集成學習法,以增強對機密文件的識別能力。通過選出重要語句並利用分類器進行集成學習分類,MFEEL 在諸多性能指標上均均呈現出卓越的表現。 實驗結果表明,相較於傳統單一模型,我們的方法不僅提高了準確率,也增強了對複雜數據集的適應性,整體性能優於傳統方法。
With the rapid development of internet technology and the Internet of Things, unauthorized sensitive data leakage poses a serious security threat to many businesses and organizations. Traditional data leakage defense measures such as firewalls, virtual private networks (VPNs), and intrusion detection systems (IDS) are somewhat effective but lack special protection for sensitive data. In order to more effectively identify and protect sensitive data from leaking, Data Leakage Prevention Systems (DLPS) have been developed in recent years. Among them, content analysis method has become a popular approach in DLPS by analyzing sensitive information in exchange files, stored data, and network traffic. However, using a single tool for detection inevitably leads to some loopholes. Therefore, this study proposes the "Multi Feature Extraction Ensemble Learning; MFEEL" method which combines natural language processing with integrated learning techniques of multiple feature extraction technologies to enhance the identification capability of confidential documents. By selecting important sentences and utilizing classifiers for integrated learning classification, MFEEL demonstrates excellent performance on various performance metrics. Experimental results show that compared to traditional single models, our method not only improves accuracy but also enhances adaptability to complex datasets, outperforming traditional methods overall.
誌謝辭 i
摘要 ii
Abstract iii
目錄 iv
圖目錄 vi
表目錄 vii
第一章 緒論 1
1.1 研究背景 1
1.1.1 資料洩漏防護(Data Leakage Prevention-DLP) 3
1.1.2 集成學習(Ensemble learning) 5
1.2 研究動機 6
1.3 論文架構 7
第二章 相關文獻 8
2.1 概述 8
2.2 DLP-AG [1] 9
2.3 TCBT [2] 13
2.4 DLP-CoBAn [3] 15
2.5 相關文獻比較 19
第三章 研究方法 21
3.1 概述 21
3.2 MFEEL系統架構 22
3.3 階段細節 24
3.3.1 預處理階段(Preprocessing Phase) 25
3.3.2 訓練階段(Training Phase) 30
3.3.3 測試階段(Testing Phase) 34
第四章 實驗與結果分析 35
4.1 實驗環境說明 36
4.2 實驗用資料集 37
4.2.1 20Newsgroups資料集 38
4.3 實驗流程 39
4.4 實驗結果 41
4.4.1 原始文本 42
4.4.2 預處理後 44
4.4.3 性能指標 45
第五章 結論與未來展望 47
參考文獻 48


[1] S. Alneyadi, E. Sithirasenan, and V. Muthukkumarasamy, “Adaptable n-gram clas sification model for data leakage prevention,” in 2013, 7th International Confer ence on Signal Processing and Communication Systems (ICSPCS). IEEE, 2013, pp. 1–8.
[2] G. Lu, Y. Xia, J. Wang, and Z. Yang, “Research on text classification based on tex trank,” in 2016 International Conference on Communications, Information Man agement and Network Security. Atlantis Press, 2016, pp. 319–322.
[3] G. Katz, Y. Elovici, and B. Shapira, “Coban: A context based model for data leakage prevention,” Information sciences, vol. 262, pp. 137–158, 2014. 2023-12-10.
[4] “What is dlp? how does dlp work? why should we use it?” https://technoveraco. com/what-is-dlp-how-does-dlp-work-why-should-we-use-it/, 2023, accessed: 2023-12-10.
[5] S. Alneyadi, E. Sithirasenan, and V. Muthukkumarasamy, “A semantics-aware classification approach for data leakage prevention,” in Information Security and Privacy: 19th Australasian Conference, ACISP 2014, Wollongong, NSW, Aus tralia, July 7-9, 2014. Proceedings 19. Springer, 2014, pp. 413–421.
[6] U.S. Securities and Exchange Commission, “Sec adopts rules on cybersecu rity risk management, strategy, governance, and incident disclosure by public companies,” Press Release, Jul. 2023, https://www.sec.gov/news/press-release/ 2023-139.
[7] “Kpmg臺灣發布「2022臺灣資安趨勢報告」,”https://kpmg.com/tw/zh/home/ media/press-releases/2022/09/kpmg-tw-released-cyber-risk-report-2022.html, 2022, accessed: 2023-12-10.
[8] S. Alneyadi, E. Sithirasenan, and V. Muthukkumarasamy, “A survey on data leak age prevention systems,” Journal of Network and Computer Applications, vol. 62, pp. 137–152, 2016.
[9] P. Ferguson and G. Huston, “What is a vpn?” 1998.
[10] P. S. Kenkre, A. Pai, and L. Colaco, “Real time intrusion detection and prevention system,” in Proceedings of the 3rd international conference on Frontiers of intel ligent computing: theory and applications (FICTA) 2014: volume 1. Springer, 2015, pp. 405–411.
[11] S. Kaur, P. Kumar, and P. Kumaraguru, “Automating fake news detection system 2020.
[12] D. Gupta and R. Rani, “Improving malware detection using big data and ensemble learning,” Computers & Electrical Engineering, vol. 86, p. 106729, 2020.
[13] J. Kittler, M. Hater, and R. Duin, “Combining classifiers,” in Proceedings of 13th International Conference on Pattern Recognition, vol. 2, 1996, pp. 897–901 vol.2.
[14] W.B.Cavnar,J.M.Trenkleetal., “N-gram-basedtextcategorization,” in Proceed ings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, vol. 161175. Las Vegas, NV, 1994, p. 14.
[15] X. Tian, “Study on keyword extraction using word position weighted textrank,” Data Analysis and Knowledge Discovery, vol. 29, no. 9, pp. 30–34, 2013.
[16] M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, “Industrial-strength natural language processing in python,” spaCy, 2020.
[17] S. Bird, E. Klein, and E. Loper, Natural language processing with Python: ana lyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009.
[18] R.Pramana, J.J.Subroto, A. A.S.Gunawanetal., “Systematicliterature review of stemming and lemmatization performance for sentence similarity,” in 2022 IEEE 7th International Conference on Information Technology and Digital Applications (ICITDA). IEEE, 2022, pp. 1–6.
[19] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of massive data sets. Cam bridge university press, 2020.
[20] K. Sparck Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of documentation, vol. 28, no. 1, pp. 11–21, 1972.
[21] P. F. Brown, V. J. Della Pietra, P. V. Desouza, J. C. Lai, and R. L. Mercer, “Class based n-gram models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467–480, 1992.
[22] R. Rosenfeld, “Two decades of statistical language modeling: Where do we go from here?” Proceedings of the IEEE, vol. 88, no. 8, pp. 1270–1278, 2000.
[23] C. D. Manning, An introduction to information retrieval. Cambridge university press, 2009.
[24] Ken Lang, “20 newsgroups data set,” http://qwone.com/~jason/20Newsgroups/, accessed: 2008-01-14.
[25] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in European conference on machine learning. Springer, 1998, pp. 137–142.
[26] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, pp. 273–297, 1995.

電子全文 電子全文(網際網路公開日期:20290128)
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊