跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.14) 您好!臺灣時間:2025/12/27 05:31
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:李龍豪
研究生(外文):Lung-Hao Lee
論文名稱:一個蒐集色情黑名單的網路內容分類機制之研究
論文名稱(外文):A Web Content Classification System for Pornographic Blacklist Generation
指導教授:陸承志陸承志引用關係
指導教授(外文):Cheng-Jye Luh
學位類別:碩士
校院名稱:元智大學
系所名稱:資訊管理研究所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2005
畢業學年度:93
語文別:英文
論文頁數:56
中文關鍵詞:不當資訊過濾色情黑名單網路內容分級卡方分配遞增式更新機制
外文關鍵詞:Inappropriate Content FilteringPornographic BlacklistWeb Content RatingChi-Square DistributionIncremental Update Mechanism
相關次數:
  • 被引用被引用:1
  • 點閱點閱:578
  • 評分評分:
  • 下載下載:97
  • 收藏至我的研究室書目清單書目收藏:2
由於網際網路的普及與開放,使用者只要透過搜尋引擎輸入相關的關鍵字,就可以從搜尋結果中輕易地存取不當的網站內容,因此網路內容分類與管理已成為刻不容緩的議題。本研究針對不當資訊中的色情範疇,提出並且實作一個以卡方分配為基礎的色情分類方法來蒐集色情黑名單。這個方法針對色情網站中的文字內容部份,先求出個別字詞的色情傾向,再透過卡方分配計算出色情指標值 (Indicator Value),以便將網頁分成色情(Porn)、未確定(Unsure)與非色情(Non-Porn)三類。我們把色情類網頁的網址收錄為所謂的黑名單,可做為網路色情過濾的依據。此外,我們也設計一個遞增式更新的機制,可根據已蒐集到的色情Hub名單,有效率地蒐集新增的色情網頁。本研究實作的系統可判別中文與英文網頁,目前已蒐集到超過60萬筆色情網頁,實驗結果顯示我們的系統不僅比其他系統的精確度高,而且訓練時間明顯縮短。
The proliferation of the Web has allowed users to access a growing number of inappropriate materials (e.g. porn, drug, violence) on the Internet. Thus, web content rating and filtering has received intensive attention. This study proposes and implements a chi-square based method in classifying web pages to generate a pornographic blacklist. An indicator value of pornography is calculated for each web page under investigation using a chi-square combining scheme on the porn tendencies of tokens contained in each individual web page. A web page is classified into one of three categories: Porn, Unsure and Non-Porn according to its indicator value. The web pages in Porn category are put on a blacklist. An incremental update mechanism is also created for collecting newly added pornographic sites by recursively crawling on pornographic hubs. The present implementation can classify English and Chinese web pages and has collected more than 0.6 million pornographic URLs. Experimental results indicate that the present implementation achieves a higher precision rate in detecting pornographic web pages; while spending less training time, than related work.
書名頁 i
論文口試委員會審定書 ii
教育部授權書 iii
國科會授權書 iv
國家圖書館授權書 v
摘要 vi
Abstract vii
誌謝 viii
Table of Contents ix
List of Figures xi
List of Tables xii

Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Goals 3
Chapter 2 Literature Review 4
2.1 Detection of Pornographic Web Pages 4
2.1.1 Content-based Approach 4
2.1.2 Log-based Approach 7
2.2 Hub Sites 8
2.3 Statistical Text Analysis 9
Chapter 3 Collection of Pornographic Web Pages 11
3.1 System Architecture 11
3.2 Web Crawling 12
3.3 Training 13
3.4 Classification 17
3.5 Incremental Update Mechanism 21
Chapter 4 Simulation of System Configurations 22
4.1 Simulation Setup 22
4.2 Thresholds Configuration 22
4.3 Number of Effective Tokens 24
4.4 The Best Configuration 26
Chapter 5 System Implementation and Experiments 29
5.1 System Implementation 29
5.2 Experiment Setting 30
5.3 Experimental Results 31
5.3.1 False Positive Rate 31
5.3.2 Precision Rate 33
5.3.3 Incremental Ratio 34
Chapter 6 Conclusion and Future Directions 36
6.1 Conclusion 36
6.2 Future Directions 37
Reference 38
Appendix 43
1.Anthony (2003), “SpamBayes Background Reading,” available online at http://spambayes.sourceforge.net/background.html
2.Arentz, W. A., and Olstad, B., “Classifying Offensive Sites Based on Image Content,” Computer Vision and Image Understanding, 94, 2004, pp.295-310.
3.Baeza-Yates, R. and Ribeiro-Neto, B. (1999), Modern Information Retrieval, ACM press, a Division of the Association for Computing Machinary Inc.
4.Balkin, J. M., Noveck, B. S., and Roosevelt, K., “Filtering the Internet: A Best Practices Model,” Information Society Project at Yale Law School, September 1999.
5.Bertino, E., Ferrari, E., and Perego, A., “MaX: An Access Control System for Digital Libraries and the Web,” Proceedings of the 26th Annual International Computer Software and Applications Conference (COMPSAC), 2002, pp.945-950.
6.Bertino, E., Ferrari, E., and Perego, A., “Content-based Filtering of Web Documents: the MaX system and the EUFORBIA project,” International Journal of Information Security, Vol. 2, No. 1, 2003, pp. 45-58.
7.Bogofilter, available online at http://bogofilter.sourceforge.net/
8.Bosson, A., Cawley, G.. C., Chan, Y., and Harvey, R., “Non-retrieval: Blocking Pornographic Images,” Proceedings of the International Conference on Image and Video Retrieval, 2002, pp.50-60.
9.Brin, S., and Page, L., “The Anatomy of a Large Scale Hypertextual Web Search Engine,” Computer Networks and ISDN Systems, Vol. 30, Issue 1-7, 1998, pp.107-117.
10.Bushman, B. J., and Cantor, J., “Media Ratings for Violence and Sex,” American Psychologist, Vol. 58, No. 2, 2003, pp. 130-141.
11.Cao, L. L., Li, X. L., Yu, N. H. and Liu, Z. K., “Naked People Retrieval Based on Adaboost Learning,” IEEE Proceedings of the First International Conference on Machine Learning and Cybernetics, 2002, pp.1133-1138.
12.Casella, G.., and Berger, R. L. (2001), Statistical Inference (2nd edition), Wadsworth Pub. Co.
13.Chakrabarti, S., Dom, B. E., Gibson, D., Kumar, R., Raghavan, P., Rajagopalam, S., and Tomkins, A., “Experiments in Topic Distillation,” ACM SIGIR Workshop on Hypertext Information Retrieval, 1998, pp.1-7
14.Chakrabarti, S., Dom, B. E., Gibson, D., Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalam, S., and Tomkins, A., “Mining the Link Structure of the World wide Web,” IEEE Computer, Vol.32, Issue 2, 1999.
15.Chakrabarti, S., Dom, B. E., Kumar, S. R., Raghavan, P., Rajagopalam, S., Tomkins, A., Gibson, D., and Kleinberg, J., “Mining the Web’s Link Structure,” IEEE Computer, Vol.32, Issue 8, 1999, pp. 60-67.
16.Chan, Y., Harvey, R., and Smith, D., “Building Systems to Block Pornography,” Challenge of Image Retrieval, 1999, pp.1-9.
17.Chen, M. S., Park, J. S., and Yu, P. S., “Efficient Data Mining for Path Traversal Patterns,” IEEE Transactions on Knowledge and Data engineering, Vol. 10, No. 2, 1998, pp.209-221.
18.Chen, R.C. (2003), “Combined Text and Image Features for Web Content Rating: An Application in Pornographic Web Pages Filtering, ” National Science Council Research Project (NSC92-2626-E-324-001) (in Chinese)
19.Chiu, C.C., Wang, M. H., and Lai, H. S., “Analysis of Inappropriate Information Prevention in TANET,” Proceedings of the Taiwan Academic Network Conference (TANET), 2003, pp. 919 – 924. (in Chinese)
20.Chiu, C.C., Wang, M. H., and Lai, H. S., “Analysis of Inappropriate Information Prevention,” Proceedings of the Taiwan Academic Network Conference (TANET), 2004, pp. 591-596. (in Chinese)
21.Chiu, C. C. (1999), A Study of Using Criminal Linguistics and Data Retrieval – Applied to Erotic Literature on Internet, A Thesis Submitted to Department of Information Management, Central Police University.
22.Chiu, J. M. (2004), Internet Pornography Filtering With Combination of Image- Based and Text-Based Classification, A Thesis Submitted to Department of Computer Science and Information Engineering, National Central University.
23.Ding, C., Chi, C. H., Deng, J. and Dong, C. L., “Centralized Content Based Web Filtering and Blocking: How Far Can It Go,” IEEE International Conference on Systems, Man, and Cybernetics (SMC), Vol. 2, 1999, pp. 115-119.
24.Duan, L., Cui, G.., Gao, W., and Zhang, H., ”Adult Image Detection Method Base-On Skin Color Model and Support Vector Machine,” The fifth Asian Conference on Computer Vision(ACCV), 2002, pp.797-780.
25.Forsyth, D. A., and Fleck, M. M., “Identifying Nude Pictures,” Proceedings of the Third IEEE Workshop on Applications of Computer Vision(WACV), 1996, pp.103-108
26.Forsyth, D. A., and Fleck, M. M., “Automatic Detection of Human Nudes,” Journal of Computer Vision, Vol. 32, Issue 1, 1999, pp.63-77.
27.Goodwin, S., and Vidgen, R., “Content, Content, Everywhere…..Time to Stop and Think? The Process of Web Content Management,” Computing and Control Engineering Journal, Vol. 13, Issue 2, 2002, pp. 66-70.
28.Government Information Office(GIO), Republic of China, available online at http://www.gio.gov.tw/
29.Government Information Office, Republic of China , Project For Promoting Internet Content Rating System, available online at http://info.gio.gov.tw/public/Attachment/451214545571.doc
30.Graham, P. (August, 2002), “A Plan for Spam,” available online at http://www.paulgraham.com/spam.html
31.Hammami, M., Chahir, Y., and Chen, L., “WebGuard : Web Based Adult Content Detection and Filtering System”, IEEE/WIC International Conference on Web Intelligence(WI), 2003, pp.574 – 578.
32.Hammami, M., and Chen, L., “WebGuard: Web Adult Content Detection and Filtering System,” International Journal of Business Data Communication and Networking, Vol. 1, No. 1, 2005, pp.17-32.
33.Henzinger, M. R., “Hyperlink Analysis for the Web,” IEEE Internet Computing, Vol. 5, No. 1, 2001, pp. 45-50.
34.Ho, C. T. (2004), A Study of Web Pages Classification Based on Image and Text Features, A Thesis Submitted to Department of Information Management, Chaoyung University
35.Hu, G..Y. (2004), The Study on Naked People Image Detection Based on Skin Color, A Thesis Submitted to Department of Engineering Science, National Cheng Kung University.
36.Internet Facts, available online at http://www.virteches.net/internet-facts.htm
37.Jiao, F., Gao, W., Duan, L., and Cui, G.., “Detecting Adult Images Using Multiple Features,” Info-tech and Info-net Proceedings (ICII), Vol.3, 2001, pp.378 – 383.
38.Kleinberg, J. M., “Authoritative Sources in a Hyperlinked Environment,” Journal of the ACM, Vol. 46, No. 5, 1999, pp. 604 – 632.
39.Kuo, Y. M. (2001), Pornographic Image Detection using Neural Network for the Determination Skin Chroma Chart, A Thesis Submitted to Department of Electrical Engineering, National Cheng Kung University.
40.Lee, P.Y., Hui, S. C., and Fong, A.C. M., “Neural Networks for Web Content Filtering,” IEEE Intelligent Systems, Vol. 17, Issue 5, 2002, pp.48-57.
41.Lee, P.Y., Hui, S.C., and Fong, A. C. M., “A Structural and Content-based Analysis for Web Filtering,” Internet Research: Electronic Networking Applications and Policy, Vol. 13, Issue 1, 2003, pp. 27-37.
42.Lyman, P. , and Hal, R. V. (2003), "How Much Information", available online at http://www.sims.berkeley.edu/how-much-info-2003
43.Meagher, P. (October 7, 2003), “Apply Prob. Models to Web Data Using PHP,” available online at http://www-106.ibm.com/developerworks/library/wa-probab/?ca=dgr-lnxw16PDL .
44.Meyer, T.A., and Whateley, B., “SpamBayes: Effective Open-source, Bayesian Based, Email Classification System,” First Conference on Email and Anti-Spam (CEAS), 2004, pp. 1-8.
45.Perkowitz, M., and Etzioni, O., “Adaptive Web Site: An AI Challenge,” International Joint Conference on Artificial Intelligence (IJCAI), 1997, pp. 1-6.
46.Platform for Internet Content Selection (PICS), available online at http://www.w3c.org/PICS/
47.Robinson, G.., “A Statistical Approach to the Spam Problem,” Linux journal, Vol. 2003, issue 107. pp. 1-9.
48.Robinson, G. (April 28, 2004), “Handling Redundancy in Email Token Probabilities, Version 0.94,” available online at http://www.garyrobinson.net/2004/04/improved_chi.html
49.Robinson, G. (May 3, 2004), “Why Chi? Motivations for the Use of Fisher’s Inverse Chi-Square Procedure in Spam Classification, Version 0.93,” available online at http://www.garyrobinson.net/2004/05/why_chi.html
50.Ross, S. M. (2004), Introduction to Probability and Statistics for Engineers and Scientists (3rd edition), Elsevier Inc.
51.Schafer, J. B., Konstan, J. and Riedl, J., “Electronic Commerce Recommender Applications,” Journal of Data Mining and Knowledge Discovery, Vol. 5, No. 1/2, 2000, pp. 115-152.
52.Schettini, R., Brambilla, C., Cusano, C., and Ciocca, G.., “On the Detection of Pornographic Digital images,” Proceedings of SPIE, Visual Communications and Image Processing, 2003.
53.Smith, D., Harvey, R., Chen, Y., and Bangham, A., “Classifying Web Pages by Content,” IEE European Workshop on Distributed Imaging, 1999, Vol. 99/109, pp.8/1-8/7.
54.Sobek, M. (2002), “Additional Factors Influencing PageRank,” available online at http://pr.efactory.de/e-further-factors.shtml.
55.SpamBayes : Bayesian anti-spam classifier written in Python, available online at http://spambayes.sourceforge.net/index.html
56.Taiwan Internet Content Rating Promotion Foundation (TICRPF), available online at http://ticrpf.iicm.org.tw/
57.The Internet Contenting Rating Association (ICRA), available online at http://www.icra.org/
58.The MathWorks, available online at http://www.mathworks.com/
59.The PHP Math Library Project, available online at http://www.phpmath.com/home
60.Torres, L., and Vila, J., “Automatic Face Recognition for Video Indexing Application,” Pattern Recognition, Vol. 35, Issue 3, 2002, pp. 615-625.
61.Wang, T. H., Chen, S. H., Tsai, H. M., Lin, C. N., Lee, H. L., “Application on Inappropriate Information Prevention Using Majority Effect” Proceedings of the Taiwan Academic Network Conference (TANET), 2004, pp. 602 – 607. (in Chinese)
62.Weinberg, J. (1997), ”Rating the Net,” available online at http://www.law.wayne.edu/weinberg/rating.htm
63.World Wide Web Consortium (W3C), available online at http://www.w3c.org/
64.Yang, L.C. (2001), A Study on Log-Based Web Access Filtering, A Thesis Submitted to Department of Computer Science and Information Engineering, National Taiwan University.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top