跳到主要內容

臺灣博碩士論文加值系統

(44.221.70.232) 您好!臺灣時間:2024/05/30 20:30
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:李向融
研究生(外文):LI,XIANG-RONG
論文名稱:網路機器人分類與行為剖析
論文名稱(外文):Classification and Behavior Analysis of Web Robot
指導教授:施東河施東河引用關係
指導教授(外文):SHIH,DONG-HER
口試委員:張怡秋張碩毅施東河
口試委員(外文):CHANG,I-CHIUCHANG,SHE-ISHIH,DONG-HER
口試日期:2017-07-04
學位類別:碩士
校院名稱:國立雲林科技大學
系所名稱:資訊管理系
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2017
畢業學年度:105
語文別:中文
論文頁數:44
中文關鍵詞:網路機器人爬蟲偵測隱私資訊安全k-平均演算法
外文關鍵詞:Web RobotCrawlerDetectionPrivacyInformation SecurityK-means
相關次數:
  • 被引用被引用:0
  • 點閱點閱:335
  • 評分評分:
  • 下載下載:5
  • 收藏至我的研究室書目清單書目收藏:1
現今資訊發達的年代,網路上充斥著大量且類型不同的資料,不可能由人手動蒐集,因此程式設計師開發網路機器人或稱爬蟲、蜘蛛,為特定的目的自動蒐集資料,例如比價網站利用爬蟲程式自動蒐集各大購物網站的產品與價格資訊。網路機器人的設計並沒有一個嚴格的規範或標準,因此不同類型或者設計邏輯的機器人存取行為就可能不盡相同,設計不良或帶有惡意的網路機器人容易導致網頁伺服器額外負擔,部分爬蟲程式可能會將網站潛藏未公開的文件爬取並且公開,造成隱私或著作權的侵害。本研究將利用教育單位真實資料集,萃取出網路機器的工作階段並利用K-means分群,根據特徵值特性加以命名,探討每一群集不同機器人的特性以及其行為是否會造成系統安全或者隱私與著作權的侵害,我們利用羅吉斯回歸法針對分群的結果建立分類模型,並且使用兩個不同領域的公開資料集以分群的結果建立分類模型,並且進行分類,在這個資料集當作所分類出來的網路機器人。我們的分群結果共有12群不同行為的網路機器人,分別依照不同的特性給予命名,並且探討知名網路機器人的行為與可能導致的問題,如系統安全、隱私與著作權等,提供網站管理者依據網路機器人的行為進行限制或阻擋的決策參考。

Web robots extends the application of more different data collection purposes cause excessive access, web attacks, privacy and copyright infringement. Identifying the behavior of web robots or malicious web robot can provide a site administrator to block or restrict a particular web robot. Classification of human and web robots has been carried out in progress, however, classification and behavior analysis of web robot are still a minority. Therefore, classification and highlight characteristics of different web robots by using real world public dataset have been concluded in this study.
摘要 i
ABSTRACT ii
目錄 iii
表目錄 v
圖目錄 vi
1. 緒論 1
2. 文獻探討 3
2.1 網路機器人 3
2.1.1 爬蟲程式分類 3
2.1.2 爬蟲的爬行策略 4
2.1.3 網路機器人使用目的 5
2.1.4 網路機器人排除協定 6
2.1.5 惡意網路機器人 7
2.2 網路機器人偵測技術分類 7
2.2.1 語法分析 8
2.2.2 流量模式 9
2.3 網路機器人分類 10
2.4 資料探勘方法 11
3. 研究方法 15
3.1 網路機器人存取網頁情境圖 15
3.2 工作階段識別 16
3.3 特徵萃取 18
3.4 網路機器人萃取 19
3.4.1 User-Agent 19
3.4.2 robots.txt 存取行為 19
3.4.3 Referrer空值率 20
3.4.4 網路機器人資料庫 20
4. 實驗設計與結果 21
4.1 資料集描述 21
4.1.1 教育單位 21
4.1.2 天津自由貿易試驗區天津機場片區 22
4.1.3 Secrepo 22
4.2 實驗步驟 22
4.2.1 資料前處理 22
4.2.2 分群分析 23
4.2.3 分類器訓練 23
4.3 實驗結果 24
4.3.1 網路機器人分群與命名 24
4.3.2 其他公開資料集分類 28
4.4 惡意機器人 28
4.4.1 系統安全問題 28
4.4.2 隱私與著作權問題 28
4.4.3 良性網路機器人 29
4.5 知名搜尋引擎爬蟲 29
4.6 研究結果與討論 30
4.6.1 惡意網路機器人 30
4.6.2 索引機器人 30
4.6.3 圖片機器人 30
4.6.4 文件機器人 30
5. 結論與貢獻 32
5.1 研究限制 32
5.2 未來研究 33
參考文獻 34


AWStats. (2008). Retrieved December 6, 2016, from https://awstats.sourceforge.io/
Balla, A., Stassopoulou, A., & Dikaiakos, M. D. (2011). Real-time web crawler detection. In 2011 18th International Conference on Telecommunications (ICT) (pp. 428–432). https://doi.org/10.1109/CTS.2011.5898963
Brin, S., & Page, L. (1998). The Anatomy of a Large-scale Hypertextual Web Search Engine. In Proceedings of the Seventh International Conference on World Wide Web 7 (pp. 107–117). Amsterdam, The Netherlands, The Netherlands: Elsevier Science Publishers B. V. Retrieved from http://dl.acm.org/citation.cfm?id=297805.297827
Buzikashvili, N. (2008). Query Log Analysis: Disruped Query Chains and Adaptive Segmentation. In ResearchGate (pp. 35–40). Retrieved from https://www.researchgate.net/publication/221147039_Query_Log_Analysis_Disruped_Query_Chains_and_Adaptive_Segmentation
Chu, Z., Gianvecchio, S., Wang, H., & Jajodia, S. (2012). Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? IEEE Transactions on Dependable and Secure Computing, 9(6), 811–824. https://doi.org/10.1109/TDSC.2012.75
Chung, H. M., & Gray, P. (1999). Special Section: Data Mining. Journal of Management Information Systems, 16(1), 11–16.
Dikaiakos, M. D., Stassopoulou, A., & Papageorgiou, L. (2005). An investigation of web crawler behavior: characterization and metrics. Computer Communications, 28(8), 880–897. https://doi.org/10.1016/j.comcom.2005.01.003
Doran, D., & Gokhale, S. S. (2008). Discovering New Trends in Web Robot Traffic Through Functional Classification. In Seventh IEEE International Symposium on Network Computing and Applications, 2008. NCA ’08 (pp. 275–278). https://doi.org/10.1109/NCA.2008.47
Doran, Derek, & Gokhale, S. (2009). Classifying Web Robots by K-Means Clustering. Proceedings of the Twenty-First International Conference on Software Engineering & Knowledge Engineering, 97–102.
Doran, Derek, & Gokhale, S. S. (2011). Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22(1–2), 183–210. https://doi.org/10.1007/s10618-010-0180-z
Duskin, O., & Feitelson, D. G. (2009). Distinguishing Humans from Robots in Web Search Logs: Preliminary Results Using Query Rates and Intervals. In Proceedings of the 2009 Workshop on Web Search Click Data (pp. 15–19). New York, NY, USA: ACM. https://doi.org/10.1145/1507509.1507512
Geens, N., Huysmans, J., & Vanthienen, J. (2006). Evaluation of Web Robot Discovery Techniques: A Benchmarking Study. In P. Perner (Ed.), Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining (pp. 121–130). Springer Berlin Heidelberg. https://doi.org/10.1007/11790853_10
Gržinić, T., Mršić, L., & Šaban, J. (2015). Lino - An Intelligent System for Detecting Malicious Web-Robots. In N. T. Nguyen, B. Trawiński, & R. Kosala (Eds.), Intelligent Information and Database Systems (pp. 559–568). Springer International Publishing. https://doi.org/10.1007/978-3-319-15705-4_54
Guo, W., Ju, S., & Gu, Y. (2005). Web robot detection techniques based on statistics of their requested URL resources. In Proceedings of the Ninth International Conference on Computer Supported Cooperative Work in Design, 2005. (Vol. 1, p. 302–306 Vol. 1). https://doi.org/10.1109/CSCWD.2005.194187
Ho, T. K. (1995). Random Decision Forests. In Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1 (p. 278–). Washington, DC, USA: IEEE Computer Society. Retrieved from http://dl.acm.org/citation.cfm?id=844379.844681
Huntington, P., Nicholas, D., & Jamali, H. R. (2008). Web robot detection in the scholarly information environment. Journal of Information Science, 34(5), 726–741. https://doi.org/10.1177/0165551507087237
Jansen, B. J., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing & Management, 36(2), 207–227. https://doi.org/10.1016/S0306-4573(99)00056-4
Kabe, T., & Miyazaki, M. (2000). Determining WWW user agents from server access log. In Parallel and Distributed Systems: Workshops, Seventh International Conference on, 2000 (pp. 173–178). https://doi.org/10.1109/PADSW.2000.884534
Kuze, N., Ishikura, S., Yagi, T., Chiba, D., & Murata, M. (2015). Crawler classification using ant-based clustering scheme. In 2015 10th International Conference for Internet Technology and Secured Transactions (ICITST) (pp. 84–89). https://doi.org/10.1109/ICITST.2015.7412063
Lee, J., Cha, S., Lee, D., & Lee, H. (2009). Classification of web robots: An empirical study based on over one billion requests. Computers & Security, 28(8), 795–802. https://doi.org/10.1016/j.cose.2009.05.004
Lin, X., Quan, L., & Wu, H. (2008). An Automatic Scheme to Categorize User Sessions in Modern HTTP Traffic. In IEEE GLOBECOM 2008 - 2008 IEEE Global Telecommunications Conference (pp. 1–6). https://doi.org/10.1109/GLOCOM.2008.ECP.290
Liu, H., & Kešelj, V. (2007). Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users’ future requests. Data & Knowledge Engineering, 61(2), 304–330. https://doi.org/10.1016/j.datak.2006.06.001
Lourenço, A. G., & Belo, O. O. (2006). Catching Web Crawlers in the Act. In Proceedings of the 6th International Conference on Web Engineering (pp. 265–272). New York, NY, USA: ACM. https://doi.org/10.1145/1145581.1145634
Lu, W. z, & Yu, S. z. (2006). Web Robot Detection Based on Hidden Markov Model. In 2006 International Conference on Communications, Circuits and Systems (Vol. 3, pp. 1806–1810). https://doi.org/10.1109/ICCCAS.2006.285024
Prince, M., Holloway, L., & Keller, A. (2005). Understanding How Spammers Steal Your E-Mail Address: An Analysis of the First Six Months of Data from Project Honey Pot. Presented at the Second conference on Email and Anti-Spam.
Sardar, T. H., & Ansari, Z. (2014). Detection and confirmation of web robot requests for cleaning the voluminous web log data. In IMpact of E-Technology on US (IMPETUS), 2014 International Conference on the (pp. 13–19). https://doi.org/10.1109/IMPETUS.2014.6775871
Shi, Y., Lepskiy, A., Aleskerov, F., Bai, Q., Xiong, G., Zhao, Y., & He, L. (2014). Analysis and Detection of Bogus Behavior in Web Crawler Measurement. Procedia Computer Science, 31, 1084–1091. https://doi.org/10.1016/j.procs.2014.05.363
Sisodia, D. S., Verma, S., & Vyas, O. P. (2015). Agglomerative Approach for Identification and Elimination of Web Robots from Web Server Logs to Extract Knowledge about Actual Visitors. Journal of Data Analysis and Information Processing, 03(01), 1–10. https://doi.org/10.4236/jdaip.2015.31001
Stassopoulou, A., & Dikaiakos, M. D. (2007). A Probabilistic Reasoning Approach for Discovering Web Crawler Sessions. In G. Dong, X. Lin, W. Wang, Y. Yang, & J. X. Yu (Eds.), Advances in Data and Web Management (pp. 265–272). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-72524-4_29
Stassopoulou, A., & Dikaiakos, M. D. (2009). Web robot detection: A probabilistic reasoning approach. Computer Networks, 53(3), 265–278. https://doi.org/10.1016/j.comnet.2008.09.021
Stevanovic, D., An, A., & Vlajic, N. (2012). Feature evaluation for web crawler detection with data mining techniques. Expert Systems with Applications, 39(10), 8707–8717. https://doi.org/10.1016/j.eswa.2012.01.210
Stevanovic, D., Vlajic, N., & An, A. (2013). Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Applied Soft Computing, 13(1), 698–708. https://doi.org/10.1016/j.asoc.2012.08.028
Suchacka, G., & Sobków, M. (2015). Detection of Internet robots using a Bayesian approach. In 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF) (pp. 365–370). https://doi.org/10.1109/CYBConf.2015.7175961
Tan, P.-N., & Kumar, V. (2002). Discovery of Web Robot Sessions Based on Their Navigational Patterns. Data Min. Knowl. Discov., 6(1), 9–35. https://doi.org/10.1023/A:1013228602957
Tran, M. C., & Nakamura, Y. (2016). Classification of HTTP automated software communication behaviour using NoSql database. In 2016 International Conference on Electronics, Information, and Communications (ICEIC) (pp. 1–4). https://doi.org/10.1109/ELINFOCOM.2016.7562957
Zabihi, M., Jahan, M. V., & Hamidzadeh, J. (2014). Fuzzy Inference for Intrusion Detection of Web Robots in Computer Networks. ResearchGate, 152–158.
Zeifman, I. (2015, December 9). 2015 Bot Traffic Report: Humans Take Back the Web, Bad Bots Not... Retrieved December 13, 2016, from https://www.incapsula.com/blog/bot-traffic-report-2015.html
Zhu, L., & Liu, S. (2011). An experimental comparative study on three classification algorithms on unknown malicious code identification. In 2011 International Conference on Multimedia Technology (pp. 4829–4832). https://doi.org/10.1109/ICMT.2011.6002063
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊