跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.175) 您好!臺灣時間:2024/12/09 21:00
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:林思辰
研究生(外文):Si-Chen Lin
論文名稱:基於程式語義表示向量之靜態惡意程式偵測方法
論文名稱(外文):A Static Malware Detection Approach Based on Vectorized Binary Semantic Representation
指導教授:雷欽隆雷欽隆引用關係
指導教授(外文):Chin-Laung Lei
口試委員:郭斯彥顏嗣鈞
口試委員(外文):Sy-Yen KuoHsu-Chun Yen
口試日期:2021-07-23
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:電機工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2021
畢業學年度:109
語文別:英文
論文頁數:60
中文關鍵詞:程式嵌入向量可解釋機器學習圖神經網路惡意程式偵測語意表示
外文關鍵詞:Binary Embedding VectorExplainable Machine LearningGraph Neural NetworkMalware DetectionSemantic Representation
DOI:10.6342/NTU202101787
相關次數:
  • 被引用被引用:0
  • 點閱點閱:116
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
惡意程式已經給人們帶來了資料和金錢上的損失,而且這些惡意程式的數量如今還在迅速增加。面對大量的未知攻擊程式,安全分析人員必須快速識別惡意程式並報告其中的關鍵行為。然而人工分析是緩慢且沒有效率的,我們相信在程式中識別基本功能的自動化方法是加速分析過程的關鍵。

這項研究提出了一個惡意程式偵測系統,可以將惡意程式與正常程式區分開來。同時,透過函式呼叫圖搭配函式嵌入向量和圖神經網路,我們的系統可以進一步識別程式中的基本功能,並將涉及的函數呼叫關係視覺化。

我們使用可以在 Windows 作業系統上執行的程式來評估我們提出的系統,該作業系統擁有最大的市佔率和最多的惡意程式。評估結果顯示,我們的系統具有與最先進的惡意程式偵測模型類似的檢測效能(準確率 97.0%,召回率 97.6%)。此外,它還透過視覺化和關聯基本函式的功能,對模型預測結果給出了直觀和易於理解的解釋。
Malicious binaries have caused both data and monetary loss to people, and the number of these binaries are kept increasing rapidly nowadays. With tons of unknown attack binaries, it is fundamental for security analysts to quickly identify malicious parts and report the critical behaviors within the binaries. While manual analysis is slow and ineffective, we believe an automated approach for identifying essential functions in binaries is the key to accelerating the analysis process.

This study proposes a malware detection system that differentiates malicious binaries from benign ones. In the meantime, by leveraging call graph-based function embeddings and graph neural networks, the proposed system further identifies essential functions in binaries and visualizes the relationships between involved parts.

We evaluate our proposed system using executable binaries in the Windows system, which has the largest market share and most malware binaries. The evaluation results show that our system has a similar detection performance (97.0% accuracy and 97.6% recall) to state-of-the-art malware detection models. Moreover, it also gives an intuitive and easy-to-understand explanation of the model prediction results by visualizing and correlating essential functions.
Verification Letter from the Oral Examination Committee i
Acknowledgements iii
摘要 v
Abstract vii
Contents ix
List of Figures xiii
List of Tables xv
Notations xvii
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2 Background 5
2.1 Binary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Dynamic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Representation Vector . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Embedding Methods . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Embedding Levels . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Graph Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 3 Related Work 11
3.1 Classical Machine Learning-Based Approaches . . . . . . . . . . . . 11
3.1.1 PE Header Information Feature . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Comprehensive Static Analysis Feature . . . . . . . . . . . . . . . 12
3.2 Neural Network Embedding-Based Approaches . . . . . . . . . . . . 13
3.2.1 Raw Binary Feature . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Binary Image Feature . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Graph Structure Feature . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 4 Methodology 17
4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.1 Self-Attentive Function Embeddings . . . . . . . . . . . . . . . . . 20
4.3.2 DeepGCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Chapter 5 Evaluation 25
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2.1 RQ1: Detection Performance . . . . . . . . . . . . . . . . . . . . . 29
5.2.2 RQ2: Detect Real-World Samples . . . . . . . . . . . . . . . . . . 30
5.2.3 RQ3: How Many Samples are Required . . . . . . . . . . . . . . . 31
5.2.4 RQ4: Impacts of Imbalanced Dataset . . . . . . . . . . . . . . . . . 32
5.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3.1 Samples in Goodware and VirusShare Datasets . . . . . . . . . . . 33
5.3.2 APT Group Sample in APTMalware Dataset . . . . . . . . . . . . . 36
5.3.3 Well-known Malware Samples . . . . . . . . . . . . . . . . . . . . 37
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4.1 Real-world Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4.2 Packed Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Chapter 6 Conclusion and Future Work 43
References 45
Appendix A — Implementation Details 51
A.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.2 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . . . . 56
Chocolatey - The package manager for Windows. https://chocolatey.org/.
Cygwin. https://www.cygwin.com/.
Desktop Operating System Market Share World. https://gs.statcounter.com/os-market-share/desktop/world.
GDB: The GNU Project Debugger. https://www.gnu.org/software/gdb.
Ghidra. https://ghidra-sre.org/.
IDA Freeware - Hex Rays. https://hex-rays.com/ida-free/.
radare. https://rada.re/n/.
VirusShare.com. https://virusshare.com/.
Top 13 Popular Packers Used in Malware. https://resources.infosecinstitute.com/topic/top-13-popular-packers-used-in-mal.
Aghakhani, H., Gritti, F., Mecca, F., Lindorfer, M., Ortolani, S., Balzarotti, D., Vigna, G., & Kruegel, C. (2020). When Malware is Packin’Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features. In Network and Distributed Systems Security (NDSS) Symposium 2020.
Alon, U., & Yahav, E. (2020). On the Bottleneck of Graph Neural Networks and its Practical Implications. arXiv preprint arXiv:2006.05205.
Alon, U., Zilberstein, M., Levy, O., & Yahav, E. (2019). code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3(POPL), 1–29.
Anderson, H., & Roth, P. (2018). Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637.
Alex Berry, Josh Homan, & Randi Eitzman. (2017). WannaCry Malware Profile. https://www.fireeye.com/blog/threat-research/2017/05/wannacry-malware-profile.html.
Broder, A. (1997). On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) (pp. 21–29).
Cyber-Research. cyber-research/APTMalware. https://github.com/cyber-research/APTMalware.
Dai, H., Dai, B., & Song, L. (2016). Discriminative Embeddings of Latent Variable Models for Structured Data. In Proceedings of The 33rd International Conference on Machine Learning (pp. 2702–2711). PMLR.
Ding, S., Fung, B., & Charland, P. (2019). Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In 2019 IEEE Symposium on Security and Privacy (SP) (pp. 472–489).
DOMARS, DCtheGeek, nitya, & garycentric. Download Debugging Tools for Windows - WinDbg - Windows drivers. https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/debugger-download-tools.
Gibert, D., Mateu, C., & Planes, J. (2020). The rise of machine learning for detection and classification of malware: Research developments, trends and challenges. Journal of Network and Computer Applications, 153, 102526.
Gibert, D., Mateu, C., Planes, J., & Vicens, R. (2019). Using convolutional neural networks for classification of malware represented as images. Journal of Computer Virology and Hacking Techniques, 15(1), 15–28.
Grohe, M. (2020). word2vec, node2vec, graph2vec, x2vec: Towards a theory of vector embeddings of structured data. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (pp. 1–16).
Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855–864).
Hashemi, H., Azmoodeh, A., Hamzeh, A., & Hashemi, S. (2017). Graph embedding as a new approach for unknown malware detection. Journal of Computer Virology and Hacking Techniques, 13(3), 153–166.
Hassen, M., & Chan, P. (2017). Scalable Function Call Graph-based Malware Classification. In Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy (pp. 239–248).
Horsicq. horsicq/Detect-It-Easy. https://github.com/horsicq/Detect-It-Easy.
Kalash, M., Rochan, M., Mohammed, N., Bruce, N., Wang, Y., & Iqbal, F. (2018). Malware classification with deep convolutional neural networks. In 2018 9th IFIP international conference on new technologies, mobility and security (NTMS) (pp. 1–5).
Kipf, T., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
Marek Krčál, Ondřej Švec, Martin Bálek, & Otakar Jašek. (2018). Deep Convolutional Malware Classifiers Can Learn from Raw Executables and Labels Only.
Le, Q., Boydell, O., Mac Namee, B., & Scanlon, M. (2018). Deep learning at the shallow end: Malware classification for non-domain experts. Digital Investigation, 26, S118–S126.
Li, G., Xiong, C., Thabet, A., & Ghanem, B. (2020). Deepergcn: All you need to train deeper gcns. arXiv preprint arXiv:2006.07739.
Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. (2015). Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493.
Massarelli, L., Di Luna, G., Petroni, F., Baldoni, R., & Querzoni, L. (2019). Safe: Self-attentive function embeddings for binary similarity. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (pp. 309–329).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mises, R., & Pollaczek-Geiringer, H. (1929). Praktische Verfahren der Gleichungsauflösung.. ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 9(1), 58–77.
Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., & Cavallaro, L. (2019). $\$TESSERACT$\$: Eliminating experimental bias in malware classification across space and time. In 28th $\$USENIX$\$ Security Symposium ($\$USENIX$\$ Security 19) (pp. 729–746).
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701–710).
Pham, H.D., Le, T., & Vu, T. (2018). Static PE malware detection using gradient boosting decision trees algorithm. In International Conference on Future Data and Security Engineering (pp. 228–236).
Raff, E., Barker, J., Sylvester, J., Brandon, R., Catanzaro, B., & Nicholas, C. (2018). Malware detection by eating a whole exe. In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence.
Raff, E., & Nicholas, C. (2020). A Survey of Machine Learning Methods and Challenges for Windows Malware Classification. arXiv preprint arXiv:2006.09271.
Raff, E., Sylvester, J., & Nicholas, C. (2017). Learning the pe header, malware detection with minimal domain knowledge. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (pp. 121–132).
Raff, E., Zak, R., Cox, R., Sylvester, J., Yacci, P., Ward, R., Tracy, A., McLean, M., & Nicholas, C. (2018). An investigation of byte n-gram features for malware classification. Journal of Computer Virology and Hacking Techniques, 14(1), 1–20.
Saxe, J., & Berlin, K. (2015). Deep neural network based malware detection using two dimensional binary program features. In 2015 10th International Conference on Malicious and Unwanted Software (MALWARE) (pp. 11–20).
Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE.. Journal of machine learning research, 9(11).
Vasan, D., Alazab, M., Wassan, S., Safaei, B., & Zheng, Q. (2020). Image-Based malware classification using ensemble of CNN architectures (IMCEC). Computers & Security, 92, 101748.
Veli\vckovi\'c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903.
Ying, R., Bourgeois, D., You, J., Zitnik, M., & Leskovec, J. (2019). Gnnexplainer: Generating explanations for graph neural networks. Advances in neural information processing systems, 32, 9240.
Zak, R., Raff, E., & Nicholas, C. (2017). What can N-grams learn for malware detection?. In 2017 12th International Conference on Malicious and Unwanted Software (MALWARE) (pp. 109–118).
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top