跳到主要內容

臺灣博碩士論文加值系統

(44.200.194.255) 您好!臺灣時間:2024/07/23 05:51
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:黃福祥
研究生(外文):Fu-Hsiang Huang
論文名稱:一個針對多語系網頁內容過濾的快速精確之代理伺服器
論文名稱(外文):A Fast Accurate Proxy for Multi-Language Text Webpage Classification
指導教授:林盈達林盈達引用關係
指導教授(外文):Ying-Dar Lin
學位類別:碩士
校院名稱:國立交通大學
系所名稱:資訊科學系所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2004
畢業學年度:92
語文別:英文
論文頁數:26
中文關鍵詞:內容過濾文件分類N-gram及早阻擋及早通過
外文關鍵詞:content filteringtext classificationN-gramearly blockingearly bypassing
相關次數:
  • 被引用被引用:0
  • 點閱點閱:131
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
即時性的內容分析具有低維護成本及低空間需求性的特色,因此對網頁內容過濾來說是一種非常重要的技巧,但其同時也有準確度較低及處理時間過長的問題。由於多語系網頁的影響,相對也影響了準確度,因此我們嘗試以N-gram的演算法訓練樣本並找出關鍵字加入到內容過濾器中,評估以加入關鍵字的方式影響準確度的程度。此外,我們提出及早決策的演算法,此演算法包含兩部份,分別稱為及早阻擋和及早通過。前者在分類過程中一旦有足夠條件證明標的網頁屬於禁止類別便予以阻擋。反之,後者在發現標的網頁應屬於正常類別時,就會做出及早通過的決定。實驗結果顯示,在使用Pentium III 1GHZ CPU及NetBSD 1.6的作業系統環境下,我們提出的方式較原始的方式在傳輸效能上提升六倍,而在傳輸延遲上改善了三倍以上。同時在阻擋率從原來70%提升到99%。
Real-time content analysis is an important technique in Web content filtering and has two advantages: low maintenance cost and low storage requirement. However, it may also suffer lower accuracy and longer processing time. Because Web pages in different languages can complicate content analysis, we try to extract keywords from training samples by the N-gram algorithm and evaluate the accuracy. To shorten the processing time, we propose the early decision algorithm that has two parts: early blocking and early bypassing. The former algorithm allows making the blocking decision as early as we have enough confidence that the Web page should belong to a forbidden category, while the latter helps to make the bypassing decision as soon as the Web page is considered a normal one. Experiments performed on NetBSD 1.6 with Pentium III 1GHZ CPU show our algorithm can improve the throughput about six times higher than the original and reduce the latency by two thirds. Furthermore, the blocking ratio is raised from 70% to 99%.
CHAPTER 1. INTRODUCTION --------------------1
CHAPTER 2. RELATED WORKS--------------------4
CHAPTER 3. METHODOLOGY----------------------7
3.1 Improving the accuracy------------------7
3.2 The language issue----------------------7
3.3 Accelerating the filtering--------------10
CHAPTER 4. IMPLEMENTATION-------------------12
4.1 Architecture of the DansGuardian--------12
4.2 Possible problems and improvement in DG-13
4.3 Implementation Details------------------16
CHAPTER 5. BENCHMARKING---------------------17
5.1 Benchmarking methodology----------------17
5.2 External benchmarking results-----------17
5.3 Internal benchmarking results-----------20
CHAPTER 6. CONCLUSIONS AND FUTURE WORKS-----22
REFERENCES----------------------------------24
[1] Paul Resnick and Jim Miller, PICS: Internet access controls without censorship. Communications of the ACM, 39(10):87-93, 1996.
[2] Harold Kester, Websense Web Catcher White Paper, http://www.websense.com/products/resources/wp/, 2001
[3] Pui Y. Lee, Siu C. Hui, Alvis Cheuk M. Fong, “Neural Networks for Web Content Filtering,” in IEEE Intelligent Systems, Sept.-Oct., 2002, pp. 48-57.
[4] Internet Filter Reviews 2004, http://www.internetfilterreview.com/?engine=adwords
!883&keyword=%28internet+filter%29.
[5] DansGuardian. http://dansguardian.org.
[6] Cavnar, William B. and John M. Trenkle, “NGram Based Text Categorization,” in Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, 11-13 April 1994, pp. 161-169.
[7] Stochastic Language Models (N-Gram) Specification, W3C Working Draft 3. http://www.w3.org/TR/n-gram-spec/
[8] F. Sebastiani, “Machine Learning in Automated Text Categorization,” in ACM Computing Surveys, 34(1):1-47, 2002.
[9] Yang, Y., Pedersen, J.O., “A Comparative Study on Feature Selection in Text Categorization,” in Proceedings of the 14th International Conference on Machine Learning ICML97, 1997, pp. 412-420.
[10] K. Tzeras and S. Hartman, “Automatic indexing based on Bayesian inference networks,” in Proceedings of the 16th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93), pp. 22-34, 1993.
[11] A. McCallum and K. Nigam, “A comparison of event models for naïve bayes text classification,” in AAAI-98 Workshop on Learning for Text Categorization, 1998.
[12] Tom Mitchell, Machine Learning, McGraw Hill, 1996
[13] Thorsten Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” in European Conference on Machine Learning (ECML), pages 137-142, Berlin, 1998. Springer.
[14] A.-H. Tan. Adaptive Resonance Associative Map. Neural Networks, 8(3):437-446, 1995.
[15] Schapire, R.E., Singer, Y., “Boostexter: a boosting-based system for text categorization,” Mach. Learn. 39, 2/3, 135-168, 2000.
[16] N. Fuhr, S. Hartmanna. G. Lustig, M. Schwantner, and K. Tzeras, “Air/x – a rule-based multistage indexing system for large subject fields,” in 606-623, editor, Proceedings of RIAO’91, 1991.
[17] C. Apte, F. Damerau, and S. Weiss, “Towards language independent automated learning of text categorization models,” in Proceedings of the 17th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94), 1994.
[18] William W. Cohen. And Yoram Singer, “Context-sensitive learning methods for text categorization,” in Proceedings of the 19th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96), pp. 307-315, 1994.
[19] I. Moulinier, G. Raskinis, and J.Ganascia, “Text categorization: a symbolic approach,” in Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, 1996.
[20] Erik D. Wiener, Jan O. Pedersen, and Andreas S. Weigend., “A neural network approach to topic spotting,” in Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, 317-332, 1995.
[21] Hsin-Hsi Chen and Jen-Chang Lee, “Identification and Classification of Proper Nouns in Chinese Texts,” in Proceedings of 16th International Conference on Computational Linguistics, Aug. 1996.
[22] 曾慧馨, 劉昭麟, 高照明, 陳克健, “A Hybrid Approach for Automatic Classification of Chinese Unknown Verbs,” in International Journal of Computational Linguistics & Chinese Language Processing. Vol.7, no. 1, Feb. 2002.
[23] Fuchun Peng, Dale Schuurmans, “Combining Naïve Bayes and n-Gram Language Models for Text Classification,” in The 25th European Conference on Information Retrieval Research (ECIR), Dec. 2003.
[24] Web Protocols and Practice, Balachander Krishnamurthy & Jennifer Rexford, pp. 380, 2001.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
1. 王鳳敏 (民91)。溝通式教學活動的設計和應用。國教新知,49 (1),29-40。
2. 吳俊憲 (民89)。建構主義的教學理論與策略及其在九年一貫課程之相關性探討。人文及社會學科教學通訊,11 (4),73-88。
3. 林素娥 (民89)。何謂溝通式教學?英語教學,34 (3),29-33。
4. 林殿傑、林春雄 (1993)。基隆市建德國中學生英語科學習態度與學業成就之相關研究。傳習,11,29-41。
5. 林蕙蓉 (民88)。以「學習者為中心」的國小英語教育。語文教育通訊,17,56-88。
6. 施玉惠 (民90)。溝通式英語教學法:針對九年一貫新課程。英語教學,25 (3),5-21。
7. 陳秀英 (1997)。簡說口說能力測驗的方式。英語教學,21 (3),5-13。
8. 陳東陞 (民83)。兒童口語表達能力測驗編製研究。臺北市立師範學院學報,25,151-178。
9. 陳淳麗 (1998)。適用於國小英語教學的教學法…-聽說教學法、肢體反應教學法、溝通式教學法之介紹。國民教育,39 (1),6-12。
10. 單文經,張碧玲 (2001)。國中英語教師對溝通式教學觀之教學信念研究。教育研究月刊,84,37-48。
11. 詹餘靜 (民89)。談兒童英語教育與以溝通教學觀為主軸的折衷教學法。國民教育,40 (3),37-43。
12. 鄒文莉(2002)。課堂學習與學習趣味性、學習效果及口語參與度之關連。英語教學,26 (4),39-65。
13. 鄭錦桂 (2001)。國小英語教師對溝通式教學的詮釋與看法。人文及社會學科教學通訊,11 (5),61-76。
14. 謝淑敏 (民85)。以學生為中心的教法…談如何營造開放的師生關係。竹縣文教,13,14-18。
15. 謝麗雪、蕭雅萍 (2002)。國小實施英語口語實作評量之初探。教育研究月刊,101,82-89。