跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.87) 您好!臺灣時間:2024/12/04 02:40
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:王冠渝
研究生(外文):Guan-Yu Wang
論文名稱:基於 CodeBERT/GraphCodeBERT 和深度學習模型之網頁木馬偵測研究
論文名稱(外文):WebShell Detection Based on CodeBERT/GraphCodeBERT and Deep Learning Model
指導教授:王尉任王尉任引用關係
指導教授(外文):Wei-Jen Wang
學位類別:碩士
校院名稱:國立中央大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2024
畢業學年度:112
語文別:英文
論文頁數:80
中文關鍵詞:網頁木馬CodeBERTGraphCodeBERT門控遞迴單元雙向門控遞迴單元位元組對編碼
外文關鍵詞:WebShellCodeBERTGraphCodeBERTGRUBidirectional GRUBPE
相關次數:
  • 被引用被引用:0
  • 點閱點閱:10
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
網頁木馬(WebShell) 攻擊長期以來一直是網路管理員的困擾。由於雲端服務的可擴展性和分散式的特性可能加劇 WebShell 攻擊的潛在風險和影響,因此,此類攻擊也成為雲端環境中的主要安全問題之一。因此,近年來,就有多種策略被提出來防範WebShell 的攻擊。本篇基於深度學習技術,提出了兩種有效偵測 WebShell 的方法。這兩種方法皆使用位元組對編碼(Byte Pair Encoding, BPE)對 WebShell 的原始碼進行字串編
碼,將輸入資料分割成 tokens。在生成詞嵌入向量(Word Embedding Vector)方面,方法一使用 CodeBERT,而方法二使用 GraphCodeBERT。這兩種預訓練的 CodeBERT 與GraphCodeBERT 模型使用相同的架構基底並有效理解程式碼,但 GraphCodeBERT 藉由考慮程式碼之間的關聯與內部結構,進一步提升對程式碼的理解能力。此外,方法一與方法二均使用門控遞迴單元(GRU)或雙向門控遞迴單元(雙向 GRU)來檢測程式碼中是否含有 WebShell。在實驗階段,透過使用不同的超參數對這兩種方法進行了訓練,並以K-Fold 交叉驗證來確認最優的結果和相應的模型。之後,利用測試資料集對方法一與方法二的模型進行了實驗,並將結果與相關文獻進行了比較。從實驗結果中觀察到,方法一的準確率達到了 99.54%,精確率為 98.42%,召回率為 99.29%,而 F1 分數為98.85%。方法二則表現更佳,其準確率為 99.65%,精確率為 99.29%,召回率同為99.29%,F1 分數也達到了 99.29%。這些結果顯示,本篇所提出的方法相較於先前的方法有顯著的提升。此外,與其他開源或商業工具相比,本研究所提出的方法在各項指標上都表現出色。特別值得一提的是,本研究提出的方法對於陌生資料和混淆程式碼都具有出色的準確率和精確率,表現出其優越的檢測能力和實用價值。
WebShell attacks have long been a significant challenge for website administrators. Due to the scalability and distributed nature of cloud services, these factors exacerbate the potential risks and impacts of WebShell attacks, making them one of the main security threats in cloud environments. Consequently, in recent years, various strategies have been proposed to guard against WebShell attacks. This paper presents two effective methods for detecting WebShell, based on deep learning technology. Both methods employ Byte Pair Encoding (BPE) to encode the string of the WebShell source code, split input data into tokens. For generating word embedding vectors, Method 1 uses CodeBERT, while Method 2 employs GraphCodeBERT. These methods effectively understand code using pre-trained CodeBERT and GraphCodeBERT models, and both share the same architecture. GraphCodeBERT, in particular, enhances code comprehension by considering the relationships and internal structures among the code. Additionally, both methods utilize GRU and Bidirectional GRU to detect the presence of WebShell in the code. During the experimental phase, training was conducted on these two methods using various hyperparameters, and the best results and corresponding models were confirmed through K-Fold cross-validation. Subsequently, experiments were performed on models from Methods 1 and 2 using a test dataset, and the results were compared with related works. The experimental results show that Method 1 achieved an accuracy of 99.54%, a precision of 98.42%, a recall of 99.29%, and an F1 score of 98.85%. Method 2 performed even better, with an accuracy of 99.65%, a precision of 99.29%, a recall of 99.29%, and an F1 score
of 99.29%. These results demonstrate significant improvements over previous methods. Moreover, compared to other open-source or commercial tools, the methods proposed in this paper are better in all metrics. Notably, the methods introduced here show outstanding accuracy and precision on unseen data and obfuscated code, showcasing their superior detection capabilities and practical value.
中文摘要 iv
Abstract v
Table of Contents vi
List of Figures viii
List of Tables x
Chapter I Introduction 1
1-1 Research Background 1
1-2 Motivation 1
1-3 Contribution 3
Chapter II Background Knowledge 4
2-1 Malware Analysis 4
2-1-1 Static and Dynamic Malware Analysis 4
2-2 PHP WebShell 6
2-2-1 Obfuscated Code 7
2-3 Tokenization Method 8
2-3-1 Subword Encoding 8
2-3-2 Byte Pair Encoding (BPE) 8
2-4 Word Embedding Model 10
2-4-1 Transformer 10
2-4-2 BERT 11
2-4-3 RoBERTa 12
2-4-4 CodeBERT 13
2-4-5 GraphCodeBERT 15
2-4-6 Difference Between CodeBERT and GraphCodeBERT 17
2-5 Classification Model 18
2-5-1 LSTM 18
2-5-2 GRU and Bidirectional GRU 20
Chapter III Related Work 23
3-1 Based on Machine Learning 23
3-2 Based on Deep Learning 25
Chapter IV Proposed Methods 32
4-1 Data Pre-processing 34
4-2 Tokenizer 34
4-2-1 Method 1 and Method 2 34
4-3 Word Embedding model 35
4-3-1 Method 1 36
4-3-2 Method 2 36
4-4 Classification Model 37
4-4-1 Method 1 and Method 2 38
4-4-2 Training procedures 41
Chapter V Experiment and Evaluation 42
5-1 Dataset and Configuration 42
5-2 Evaluation Matrix 45
5-3 Performance of Proposed methods 46
5-3-1 Stage 1 of Training Procedure: Identify optimal hyper-parameters 47
5-3-2 Stage 2 of Training Procedure: K-Fold Cross-Validation 56
5-3-3 Experiment 1: Test Data for generalizable testing 58
5-3-4 Experiment 2: Obfuscated Code 60
5-3-5 Experiment 3: Comparison with related works 62
5-3-6 Experiment 4: Comparison with other related tools 63
Chapter VI Conclusion and Future Work 65
References 66
[1] Netcraft, “February 2024 Web Server Survey,” Available at: https://www.netcraft.com/blog/february-2024-web-server-survey/. (Accessed 23 Apr., 2024).
[2] M. Jangjou and M.K. Sohrabi, "A Comprehensive Survey on Security Challenges in Different Network Layers in Cloud Computing," Arch Computat Methods Eng, vol. 29, pp. 3587–3608, (2022).
[3] Acunetix, “Spring 2021 Edition: Acunetix Web Vulnerability Report”. Available at: https://www.acunetix.com/white-papers/acunetix-web-application-vulnerability-report-2021/. (Accessed 23 Apr., 2024).
[4] Microsoft, “Web shell attacks continue to rise”. Available at: https://www.microsoft.com/en-us/security/blog/2021/02/11/web-shell-attacks-continue-to-rise/. (Accessed 23 Apr., 2024).
[5] CISA, “Malware Analysis Report”. Available at: https://www.cisa.gov/sites/default/files/2023-06/mar-10365227.r3.v1.clear_.pdf. (Accessed 23 Apr., 2024).
[6] Kaspersky, “PHP language source code compromise attempt”. Available at: https://www.kaspersky.com/blog/php-git-backdor/39191/. (Accessed 23 Apr., 2024).
[7] W. Yang, B. Sun, and B. Cui, “A Webshell Detection Technology Based on HTTP Traffic Analysis,” Innovative Mobile and Internet Services in Ubiquitous Computing, pp.336-342. Springer (2019).
[8] H.V. Le, H.V. Vo, T.N. Nguyen, H.N. Nguyen, and, H.T. Du, “Towards a Webshell Detection Approach Using Rule-Based and Deep HTTP Traffic Analysis,” Computational Collective Intelligence, pp.571-584. Springer (2022).
[9] W. Kang, S. Zhong, K. Chen, J. Lai, and G. Xu, “RF-AdaCost: WebShell Detection Method that Combines Statistical Features and Opcode,” Frontiers in Cyber Security, pp.667-682. Springer (2020).
[10] Z. Pan, Y. Chen, Y. Chen, Y. Shen, and X. Guo, “Webshell detection based on executable data characteristics of PHP code,” Wireless Communications and Mobile Computing, vol. 2021, no. 12, article 5533963. (2021).
[11] N.-H. Nguyen, V.-H. Le, V.-O. Phung, and P.-H. Du, “Toward a Deep Learning Approach for Detecting PHP Webshell,” in Proceedings of the 10th International Symposium on Information and Communication Technology (SoICT '19), Pages 514–521, December 2019. ACM Digital Library, New York, United States (2019).
[12] Z. Ai, N. Luktarhan, Y. Zhao, and C. Tang, “WS-LSMR: Malicious WebShell Detection Algorithm Based on Ensemble Learning,” IEEE Access, vol. 8, pp. 75785-75797, (2020).
[13] A. Hannousse, M.C. Nait-Hamoud, and S. Yahiouche, “A deep learner model for multi-language webshell detection,” Int. J. Inf. Secur., vol. 22, pp. 47–61, (2023).
[14] Y. Fang, Y. Qiu, L. Liu, and C. Huang, “Detecting Webshell Based on Random Forest with FastText,” in Proceedings of the 2018 International Conference on Computing and Artificial Intelligence (ICCAI '18). Pages 52–56, March 2018. ACM Digital Library, New York, United States (2018).
[15] T. Li, C. Ren, Y. Fu, J. Xu, J. Guo, and X. Chen, “Webshell Detection Based on the Word Attention Mechanism,” IEEE Access, vol. 7, pp. 185140-185147, (2019).
[16] W. Huang et al., “Enhancing the Feature Profiles of Web Shells by Analyzing the Performance of Multiple Detectors,” Advances in Digital Forensics XVI, vol 589. Springer (2022).
[17] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, and D. Jiang, “CodeBERT: A Pre-Trained Model for Programming and Natural Languages,” arXiv preprint arXiv:2002.08155, (2020).
[18] C. Niu, C. Li, V. Ng, D. Chen, J. Ge, and B. Luo, “An Empirical Comparison of Pre-Trained Models of Source Code,” arXiv preprint arXiv:2302.04026, (2023).
[19] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyakovskiy, S. Fu, M. Tufano, S.K. Deng, C. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, “GraphCodeBERT: Pre-training Code Representations with Data Flow,” arXiv preprint arXiv:2009.08366, (2020).
[20] Ö. Aslan and R. Samet, “Investigation of Possibilities to Detect Malware Using Existing Tools,” in 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), Hammamet, Tunisia, pp. 1277-1284. IEEE, (2017).
[21] R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” arXiv preprint arXiv:1508.07909, (2016).
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and L. Polosukhin, “Attention Is All You Need,” arXiv preprint arXiv:1706.03762, (2017).
[23] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv preprint arXiv:1810.04805, (2018).
[24] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv preprint arXiv:1907.11692, (2019).
[25] T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D.M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” arXiv preprint arXiv:2005.14165, (2020).
[26] M.N. Hossain, S.M. Milajerdi, J. Wang, B. Eshete, R. Gjomemo, R. Sekar, S. Stoller, and V.N. Venkatakrishnan, "{SLEUTH}: Real-Time Attack Scenario Reconstruction from {COTS} Audit Data," in Proceedings of the 26th {USENIX} Security Symposium, Vancouver, BC, Canada, August 16–18, 2017, pp. 487–504. USENIX Association, (2017).
[27] K.S. Wong, K. Tanaka, K. Takagi, and Y. Nakajima, “An Efficient Hybrid Webshell Detection Method for Webserver of Marine Transportation Systems,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 2, pp. 2630-2642, (2023).
[28] D-shield. D-shield. Available at: https://www.d99net.net/. (Accessed 23 Apr., 2024).
[29] PHP-malware-finder. Available at: https://github.com/nbs-system/php-malware-finder. (Accessed 23 Apr., 2024).
[30] X. Sun, X. Lu, and H. Dai, “A Matrix Decomposition based Webshell Detection Method,” in Proceedings of the 2017 International Conference on Cryptography, Security and Privacy (ICCSP '17). Pages 66–70, March 2017. ACM Digital Library, New York, United States (2017).
[31] H. Zhang, M. Liu, Z. Yue, Z. Xue, Y. Shi, and X. He, “A PHP and JSP Web Shell Detection System with Text Processing Based on Machine Learning,” in 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Guangzhou, China, pp. 1584-1591. IEEE (2020).
[32] T. Zhu, Z. Weng, L. Fu, and L. Ruan, "A Web Shell Detection Method Based on Multiview Feature Fusion," Applied Sciences, vol. 10, p. 6274, (2020).
[33] H. Cui, D. Huang, Y. Fang, L. Liu, and C. Huang, "Webshell Detection Based on Random Forest–Gradient Boosting Decision Tree Algorithm," in 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), Guangzhou, China, pp. 153-160, IEEE (2018).
[34] Z. Zhang, M. Li, L. Zhu, and X. Li, "SmartDetect: A Smart Detection Scheme for Malicious Web Shell Codes via Ensemble Learning," Smart Computing and Communication. SmartCom 2018, pp. 218-230. Springer (2018).
[35] B. Yong, W. Wei, K. Li, J. Shen, Q. Zhou, M. Wozniak, D. Połap, and R. Damaševiˇcius, "Ensemble machine learning approaches for webshell detection in Internet of things environments," Transactions on Emerging Telecommunications Technologies, (2020).
[36] Z. Ai, N. Luktarhan, A. Zhou, and D. Lv, "WebShell Attack Detection Based on a Deep Super Learner," Symmetry, vol. 12, p. 1406, (2020).
[37] Z. Liu, D. Li, L. Wei, and Y. Guo, "A New Method for WebShell Detection Based on Bidirectional GRU and Attention Mechanism," Security and Communication Networks, vol. 2022, (2022).
[38] B. Cheng, Y. Guo, Y. Ren, G. Yang, and G. Xu, "MSDetector: A Static PHP Webshell Detection System Based on Deep-Learning,". Theoretical Aspects of Software Engineering. TASE 2022, pp. 257-269. Springer (2022).
[39] T. An, X. Shui, and H. Gao, "Deep Learning Based Webshell Detection Coping with Long Text and Lexical Ambiguity," Information and Communications Security. ICICS 2022, pp. 123-137. Springer (2022).
[40] Yakpro-po. Available at: https://github.com/pk-fr/yakpro-po. (Accessed 23 Apr., 2024).
[41] Shell-Detector. Available at: https://github.com/emposha/Shell-Detector. (Accessed 23 Apr., 2024).
[42] WebShellKiller. Available at: https://edr.sangfor.com.cn/api/download/WebShellKillerTool.zip. (Accessed 23 Apr., 2024).
[43] CloudWalker. Available at: https://github.com/chaitin/cloudwalker. (Accessed 23 Apr., 2024).
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top