跳到主要內容

臺灣博碩士論文加值系統

(34.204.172.188) 您好!臺灣時間:2023/10/01 20:41
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:阮彥程
研究生(外文):Yen-Cheng Juan
論文名稱:探索使用者瀏覽行為於不當內容過濾
論文名稱(外文):Exploring User Browsing Behaviors for Objectionable Content Filtering
指導教授:陳信希陳信希引用關係
指導教授(外文):Hsin-Hsi Chen
口試委員:鄭卜仁郭俊桔蔡銘峰
口試委員(外文):Pu-Jen ChengJune-Jei KuoMing-Feng Tsai
口試日期:2013-07-16
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2013
畢業學年度:101
語文別:中文
論文頁數:68
中文關鍵詞:使用者意圖瀏覽行為資訊檢索網頁分類網頁點擊資料
外文關鍵詞:Objectionable Content FilteringInternet CensorshipUser Browsing LogUser Click Behavior
相關次數:
  • 被引用被引用:0
  • 點閱點閱:287
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
此篇論文研究探討網際網路使用者在瀏覽網頁時的行為及意圖,並用於預測使用者下一個瀏覽網頁所屬的類別,再將預測的結果應用於不當內容網頁過濾,例如:色情、賭博等。
使用者在一段時間內瀏覽的每個網頁都屬於一個類別,而這一連串的瀏覽行為可以表示成一個類別序列,用來預測使用者接下來瀏覽網頁的類別。除了類別序列的資訊外,為了能夠在第一時間判斷網頁的類別,本研究嘗試從網址中擷取更多可用的特徵來提高判斷的準確率。本研究使用的方法不需要讀取網址的網頁內容或任何原始碼中的資訊,如此才能在使用者點擊網頁的時候即時的判斷網頁的類別,並決定是否應該讓使用者瀏覽該網頁。
本研究首先針對過去研究所採用的類別預測模型,分析這些模型在分類上的限制以及可能遭遇到的問題,提出改善的方法以增加分類的準確率。在相同資料集下,本研究提出的改善方法可以達到約20%的準確率提升。
在本研究使用的序列模型,利用使用者的點擊類別序列以及多種網址相關的特徵,在TMUFE使用的78個類別多分類的判斷,可達到74.60%的分類準確率。而在不當內容過濾則可以達到93.97%的準確率、92.65%的阻擋率、以及將誤判率控制在5.71%。
此外,本篇論文也提出如何利用分類模型的結果來建立一個不當內容網頁的動態黑名單,此黑名單可以在不增加誤判率(4.94%)的情形下增加召回率(95.81%)。此動態黑名單可以達到和傳統黑名單相當的召回率,同時可以避免使用傳統黑名單時誤判率會大幅上升的問題。
最後,本論文針對研究中達到最好準確率的模型做多面向的錯誤分析,以了解造成分類錯誤的原因。同時,以一連串實驗來探討不同實驗設計是否能夠增加模型效能,並分析各種方法的優劣。


This research explores users’ browsing intents to predict the category of a user’s next access during web surfing, and applies the results to objectionable content filtering. A user’s access trail represented as a sequence of URLs reveals the contextual information of web browsing behaviors. We extract behavioral features of each clicked URL, i.e., hostname, gTLD, IP, port, and bag-of-words, to develop a linear chain CRF model for context-aware category prediction.
Large-scale experiments show that our method achieves high accuracy 74.60% for multiclass classification on a dataset with 78 different categories and promising accuracy 93.97% for objectionable access identification without requesting their corresponding page content. Our proposed model accomplishes blocking rate 92.65%, while maintaining a favorably low over-blocking rate 5.71% for collaboratively filtering objectionable content on the dynamic web.
Furthermore, this research proposes a method to generate a dynamic blacklist. It can achieve high blocking rate(95.81%) like traditional blacklists, but without their over-blocking problem(4.94%). In practices, it is complementary to intelligent content analysis for keeping up with rapidly changing variability of objectionable content from users’ behavioral perspectives.


口試委員會審定書 #
誌謝 i
摘要 ii
ABSTRACT iii
目錄 iv
圖目錄 vii
表目錄 viii
第一章 緒論 1
1.1 研究背景………………….. 1
1.2 研究動機與目的 1
1.2.1 使用者瀏覽行為 2
1.2.2 阻擋不當內容網頁問題 3
1.3 論文架構 4
第二章 相關研究 5
2.1 不當內容過濾相關研究 5
2.2 混合模型 7
2.2.1 混合模型架構 7
2.2.2 混合模型限制與問題 9
第三章 網頁類別預測與過濾 10
3.1 協同式混合模型 10
3.2 條件隨機域模型 12
3.2.1 條件隨機域模型介紹與公式推導 12
3.2.2 條件隨機域模型應用於「未知」類別網址預測 15
3.2.3 擷取網址特徵 16
3.3 動態黑名單 18
第四章 資料集、工具、評估準則 21
4.1 使用者點擊資料 21
4.1.1 資料集中的類別 22
4.1.2 訓練資料集及測試資料集 22
4.2 實驗使用的工具 25
4.3 實驗數據的評估準則 26
4.3.1 多分類評估方式 26
4.3.2 二元分類評估方式 26
第五章 實驗結果與過濾模擬 28
5.1 協同式混合模型的改善 28
5.2 不同特徵組合對條件隨機域的實驗結果 29
5.2.1 不同特徵組合對條件隨機域的效能影響 29
5.2.2 條件隨機域和支持向量機在相同特徵下的效能差別 31
5.3 不同情境長度對條件隨機域效能影響 31
5.4 不同模型的實驗結果比較 33
5.5 動態黑名單用於過濾模擬 35
第六章 延伸討論 37
6.1 相同域名網址的類別集中情形 37
6.2 不同通用頂級域類型的效能比較 38
6.3 不當內容各類表現 39
6.4 使用者個人以及使用者分群模型 40
6.4.1 個人模型 40
6.4.2 分群模型 44
第七章 結論及未來研究方向 46
7.1 結論 46
7.2 未來研究方向 47
參考文獻 48
附錄A 通用頂級域列表 54
附錄B 訓練資料集類別數量 56
附錄C 測試資料集類別數量 59
附錄D 測試用例類別數量 62
附錄E 各模型的二元分類結果 65
附錄F 域名策略動態黑名單每天二分法效能 66
附錄G CRF模型使用的資料集格式 67
附錄H CRF模型使用的模板文件 68


[1]Beitzel, S. M., Jensen, E. C., Chowdhury, A., Frieder, O., and Grossman, D. 2007. Temporal analysis of a very large topically categorized web query log. J. Am. Soc. Inf. Sci. Tec. 58, 2 (Jan. 2007), 166-178. DOI=http://dx.doi.org/10.1002/asi.v58:2.
[2]Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. 2004. Hourly analysis of a very large topically categorized web query log. In Proceedings of the 27th Annual ACM SIGIR Conference on Research and Development in Information Retrieval (Sheffield, UK, July 25-29, 2004). SIGIR’04. ACM, New York, NY, 321-328. DOI= http://dx.doi.org/10.1145/1008992.1009048.
[3]Caulkins, J. P., Ding, W., Duncan, G., Krishnan, R., and Nyberg, E. 2006. A method for managing access to web pages: filtering by statistical classification (FSC) applied to text. Decis. Support Syst. 42, 1 (Oct. 2006), 144-161. DOI=http://dx.doi.org/10.1016/j.dss.2004.11.015.
[4]Chau, M., and Chen, H. 2008. A machine learning approach to web page filtering using content and structure analysis. Decis. Support Syst. 44, 2 (Jan. 2008), 482-494. DOI=http://dx.doi.org/10.1016/j.dss.2007.06.002
[5]Chen, Z., Wu, O., Zhu, M., and Hu, W. 2006. A novel web page filtering system by combining texts and images. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (Hong Kong, China, December 18-22, 2006). WI’06. IEEE, Piscataway, NJ, 732-735. DOI=http://dx.doi.org/10.1109/WI.2006.21.
[6]Deselaers, T., Pimenidis, L., and Hey, H. 2008. Bag-of-visual-words models for adult image classification and filtering. In Proceedings of the 19th International Conference on Pattern Recognition (Tampa, Florida, USA, December 08-11, 2008). ICPR’08. IEEE, Piscataway, NJ, 1-4. DOI=http://dx.doi.org/10.1109/ICPR.2008.4761366.
[7]Eickhoff, C., Serdyukov, P., and Vries, A. P. 2010. Web page classification on child suitability. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (Toronto, Canada, October 26-30, 2010). CIKM’10. ACM, New York, NY, 1425-1428. DOI=http://dx.doi.org/10.1145/1871437.1871638.
[8]Eickhoff, C., Serdyukov, P., and Vries, A. P. 2011. A combined topical/non-topical approach to identifying web sites for children. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (Hong Kong, China, February 09-12, 2011). WSDM’11. ACM, New York, NY, 505-514. DOI=http://dx.doi.org/10.1145/1935826.1935900.
[9]Hammami, M., Chahir, Y., and Chen, L. 2006. WebGuard: a web filtering engine combining textual, structural, and visual content-based analysis. IEEE T. Knowl. Data En. 18, 2 (Feb. 2006), 272-284. DOI=http://dx.doi.org/10.1109/TKDE.2006.34.
[10]Ho, W. A., and Watters, P. A. 2004. Statistical and structural approaches to filtering Internet pornography. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (The Hague, The Netherlands, October 10-13, 2004). SMC’04. IEEE, Piscataway, NJ, 4792-4798. DOI=http://dx.doi.org/10.1109/ICSMC.2004.1401289.
[11]Jansen, B. J., Spink, A., and Tefko, S. 2000. Real life, real users, and real needs: a study and analysis of user queries on the web. Inform. Process. Manag. 36, 2 (Mar. 2000), 207-227. DOI=http://dx.doi.org/10.1016/S0306-4573(99)00056-4.
[12]Jansohn, C., Ulges, A., and Breuel, T. M. 2009. Detecting pornographic video content by combing image features with motion information. In Proceedings of the 17th ACM international Conference on Multimedia (Beijing, China, October 19-23, 2009). MM’09. ACM, New York, NY, 601-604. DOI=http://dx.doi.org/10.1145/1631272.1631366.
[13]Lee, L.-H., and Chen, H.-H. 2011. Collaborative blacklist generation via searches-and-clicks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (Glasgow, Scotland, UK, October 24-28, 2011). CIKM’11. ACM, New York, NY, 2153-2156. DOI=http://dx.doi.org/10.1145/2063576.2063914.
[14]Lee, L.-H., and Chen, H.-H. 2011. Collaborative cyberporn filtering with collective intelligence. In Proceedings of the 34th Annual ACM SIGIR Conference on Research and Development in Information Retrieval (Beijing, China, July 24-28, 2011). SIGIR’11. ACM, New York, NY, 1153-1154. DOI=http://dx.doi.org/10.1145/2009916.2010095.
[15]Lee, L.-H., and Chen, H.-H. 2012. Mining search intents for collaborative cyberporn filtering. J. Am. Soc. Inf. Sci. Tec. 63, 2 (Feb. 2012), 366-376. DOI=http://dx.doi.org/10.1002/asi.21668.
[16]Lee, L.-H., and Luh, C.-J. 2008. Generation of pornographic blacklist and its incremental update using an inverse chi-square based method. Inform. Process. Manag. 44, 5 (Sep. 2008), 1698-1706. DOI=http://dx.doi.org/10.1016/j.ipm.2008.05.001.
[17]Lee, L.-H., Luh, C.-J., and Yang, C.-J. 2008. A study on early decision making in objectionable web content classification. In Proceedings of the 6th IEEE International Conference on Intelligence and Security Informatics (Taipei, Taiwan, June 17-20, 2006). ISI’08. IEEE, Piscataway, NJ, 35-39. DOI=http://dx.doi.org/10.1109/ISI.2008.4565026.
[18]Lee, P. Y., Hui, S. C., and Fong, A. C. M. 2002. Neural networks for web content filtering. IEEE Intell. Syst. 17, 5 (Sep./Oct. 2002), 48-57. DOI=http://dx.doi.org/10.1109/MIS.2002.1039832.
[19]Lee, P.Y., Hui, S.C., and Fong, A.C.M. 2003. A structural and content-based analysis for web filtering. Internet Res. 13, 1(Jan. 2003), 27–37. DOI=http://dx.doi.org/10.1108/10662240310458350
[20]Lee, P. Y., Hui, S. C., and Fong, A. C. M. 2005. An intelligent categorization engine for bilingual web content filtering. IEEE T. Multimedia 7, 6 (Dec. 2005), 1183-1190. DOI=http://dx.doi.org/10.1109/TMM.2005.858414.
[21]Lin, Y.-D., Jan, C.-W., Lin, P.-C. and Lai, Y.-C. 2006. Designing an integrated architecture for network content security gateways. Computer, 39, 11 (Nov. 2006), 66–72. DOI=http://dx.doi.org/10.1109/MC.2006.379
[22]Polpinij, J., Sibunruang, C., Paungpronpitag, S., Chamchong, R., and Chotthanom, A. 2008. A web pornography patrol system by content-based analysis: in particular text and image. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (Suntec, Singapore, October 12-15, 2008). SMC’08. IEEE, Piscataway, NJ, 500-505. DOI= http://dx.doi.org/10.1109/ICSMC.2008.4811326.
[23]Szummer, M., and Craswell, N. 2008. Behavioral classification on the click graph. In Proceedings of the 17th International World Wide Web Conference (Beijing, China, April 21-25, 2008). WWW ''08. ACM, New York, NY, 1241-1242. DOI=http://dx.doi.org/10.1145/1367497.1367746.
[24]Trend Micro URL Filtering Module, available online at http://la.trendmicro.com/imperia/md/content/us/pdf/products/enterprise/interscanwebsecuritysuite/ds01urlf040913us.pdf
[25]Weitzner, D. J. 2007. Free speech and child protection on the web. IEEE Internet Comput. 11, 3 (May-Jun. 2007), 86-89. DOI=http://dx.doi.org/10.1109/MIC.2007.54.
[26]Wu, O., Zuo, H., Hu, W., Zhu, M., and Li, S. 2008. Recognizing and filtering web images based on people’s existence. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (Sydney, Australia, December 09-12, 2008). WI-IAT’08. IEEE, Piscataway, NJ, 648-654. DOI=http://dx.doi.org/10.1109/WIIAT.2008.48.
[27]Zhang, J., Qin, J., and Yan, Q. 2006. The role of URLs in objectionable web content categorization. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (Hong Kong, China, December 18-22, 2006). WI’06. IEEE, Piscataway, NJ, 277-283. DOI=http://dx.doi.org/10.1109/WI.2006.170.
[28]Zhu, Q., Wu, C.-T., Cheng, K.-T., and Wu, Y.-L. 2004. An adaptive skin model and its application to objectionable image filtering. In Proceedings of the 12th ACM international Conference on Multimedia (New York, NY, USA, October 10-16, 2004). MM’04. ACM, New York, NY, 56-63. DOI=http://dx.doi.org/10.1145/1027527.1027538.
[29]Zuo, H., Hu, W., and Wu, O. 2010. Patch-based skin color detection and its application to pornography image filtering. In Proceedings of the 19th International World Wide Web Conference (Raleigh, NC, USA, April 26-30, 2010). WWW’10. ACM, New York, NY, 1227-1228. DOI=http://dx.doi.org/10.1145/1772690.1772887.
[30]X. Shen, S. Dumais, and E. Horvitz. 2005. Analysis of topic dynamics in web search. Proceedings of the International Conference on World Wide Web, pages 1102–1103. DOI= http://dx.doi.org/10.1145%2F1062745.1062889.
[31]R. W. White, P. Bailey, and L. Chen. 2009. Predicting user interests from contextual information. SIGIR ''09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 363–370. DOI=http://dx.doi.org/10.1145/1571941.1572005.
[32]J. Huang, T. Lin, and R. W. White. 2012. No search result left behind: branching behavior with browser tabs. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining (Seattle, Washington, USA. February 08-12, 2012). WSDM’12. ACM, New York, NY, 203-212. DOI= http://dx.doi.org/10.1145%2F2124295.2124322.
[33]J. Huang, and R. W. White. 2010. Parallel browsing behavior on the web. In Proceedings of the 21st ACM International Conference on Hypertext and hypermedia (Toronto, Ontario, Canada. June 13-16, 2010). HT’10. ACM, New York, NY, 13-18. DOI=http://dx.doi.org/10.1145%2F1810617.1810622.
[34]J. Lafferty, A. McCallum, and F. Pereira. 2001. “Conditional random fields : probabilistic models for segmenting and labeling sequence data.” In International Conference on Machine Learning
[35]CRF++ : http://crfpp.googlecode.com/svn/trunk/doc/index.html
[36]Stochastic Gradient CRFs: http://leon.bottou.org/projects/sgd
[37]Trend Micro URL Filtering Module, available online at http://la.trendmicro.com/imperia/md/content/us/pdf/products/enterprise/interscanwebsecuritysuite/ds01urlf040913us.pdf
[38]Libsvm : http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[39]曾威箖,2011,以使用者瀏覽行為的情境感知學習於網頁類別預測,臺灣大學資訊工程學研究所學位論文


QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top