跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.172) 您好!臺灣時間:2025/02/10 03:09
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:柯凱量
研究生(外文):Kai-Liang Ko
論文名稱:WebsitesClassificationUsingWebPageLayoutWeightingSchemeandMulti-classSupportVectorMachine
論文名稱(外文):Websites Classification Using Web Page Layout Weighting Scheme and Multi-class Support Vector Machine
指導教授:吳榮訓吳榮訓引用關係
指導教授(外文):Roung-Shiunn Wu
學位類別:碩士
校院名稱:國立中正大學
系所名稱:資訊管理所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2006
畢業學年度:95
語文別:英文
論文頁數:112
中文關鍵詞:核函式網站分類支向機資料探勘機器學習
外文關鍵詞:Support Vector MachineData MiningMachine LearningKernel FunctionWebsites Classification
相關次數:
  • 被引用被引用:0
  • 點閱點閱:342
  • 評分評分:
  • 下載下載:42
  • 收藏至我的研究室書目清單書目收藏:3
  隨著全球資訊網(World Wide Web, WWW)的快速發展,數以萬計的大量資訊藉由網際網路作為傳遞媒介已經蔚為風潮,進而導致網際網路上行之有年的資訊過載現象日趨嚴重。這使得人們在利用網際網路獲取廣博資訊的同時,也因為不知如何有效整理龐雜無章的資訊而困擾。因此,為了能夠有效地整理網際網路上的鉅量資訊,網站分類(website classification)遂成為整理資訊的過程中基本且首要必須被解決的問題。

  本研究提出一個藉由分析版面來賦予資訊權重的加權機制-web page layout weighting scheme。Web page layout weighting scheme藉由使用者的視覺角度,來判斷網站資訊的重要性,並賦予權重來區別資訊的重要性,進而提高網站分類的準確度。我們以超文件(Hypertext)結構中的區塊(block)為一個資訊單位,透過連續型的右偏分配,來將資訊區塊在網站中的位置、大小、以及深度予以量化,以更準確地估計該資訊在網站中的重要性。

  本研究並設計了一個分類架構將web page layout weighting scheme與多元支向機(multi-class Support Vector Machine)做結合。由於二元支向機(binary SVM)僅能處理二元類別,在處理多元化的全球資訊網時並不適用,因此為了符合問題的本質與方法的適當性,本研究遂採用多元支向機作為分類架構中的分類演算法。基於文獻探討的結論,本研究認為在不同領域的問題中,適用的的多元支向機以及核函式會有所不同。因此在本研究所提的分類架構中,除了結合web page layout weighting scheme以更準確地估計資訊重要性之外,亦將實驗數種多元支向機以及核函式,以期能獲得最佳的分類架構來解決網站分類的問題。

  實驗結果顯示,web page layout weighting scheme能有效的改善了分類的準確度並有效降低了支向元的數量。在多元支向機的選擇上,實驗結果顯示One-Against-One之多元支向機用以處理網站分類能擁有最高的準確率。在核函式的選擇上,RBF函式能擁有最高的準確率,而Polynomial函式能擁有最低的支向元數量。
With the rapid development of the World Wide Web (WWW), spectacular ascent in size and popularity of information are published on the Internet. This makes information-overload problem from bad to worse. It becomes more complicate to deal with the enormous information space on the WWW. Hence, web classification is the essential and primary task for manage information on the Internet.

We develop a web page layout weighting scheme for HTML pre-process to deal with degree of importance of information from user’s visual perception point of view. By mapping from position, size, and depth of a web page to a degree of freedom, the layout weighting scheme generates a continuous right skew distribution to estimate the importance of a block on a web page of a website, so that we can weight importance of information components in a block for making classification more accuracy.

Along with this layout weighting scheme, multi-class support vector machine classifiers are applied to implement classification architecture for websites classification. SVM can only deal with binary classification problem. This is oversimplified, since website content may be involved in more than one category. It can be categorized into multiple classes. Therefore, a multi-class classifier is needed for classification. We consider that selection of multi-class SVM and kernel functions should be different due to the different application domains. In this research work, we proposed a layout weighting scheme and with different kernel functions used by multi-class support vector machine classifiers in experiments in order to find the best performance approach.

Experiments show that our web page layout weighting scheme improve both classification accuracy and percentage of support vectors. In addition, experimental results suggest that, among multi-class SVM classifiers, One-Against-One classifier is the best for websites classification. Classifiers with RBF kernel demonstrate higher accuracy and F1 score, while polynomial kernel produces lower percentage of support vectors.
ABSTRACT II
TABLE OF CONTENT IV
LIST OF FIGURES VII
LIST OF TABLES X
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 1
1.3 Objectives 2
1.4 Contribution 3
Chapter 2 Related Works 4
2.1 Overview of HTML Pre-process 4
2.1.1 Text Document Generation 4
2.1.2 HTML Tag Generation and Weighting Scheme 5
2.1.3 Layout Analysis Generation 7
2.2 Websites Classification 19
2.2.1 Research of Websites Classification in Academic Community 20
2.2.2 Industrial Trend of Websites Classification 23
Chapter 3 Multi-class Support Vector Machines 25
3.1 Multi-class SVMs Based On Combinatorial of Binary SVM 25
3.1.1 One-Against-All SVM 25
3.1.2 One-Against-One SVM 27
3.1.3 Directed Acyclic Graph SVM 28
3.2 Multi-class SVMs Based On Solving Single Optimization Problem 30
3.2.1 Vapnik and Weston’s Method 30
3.2.2 Crammer and Singer’s Method 31
3.3 Comparison of Multi-class SVM 32
3.4 Summary 33
Chapter 4 Web Page Layout Weighting Scheme 34
4.1 Introduction 34
4.2 Block Location Weight 37
4.2.1 Location Weight 38
4.2.2 Window Segmentation Weight 48
4.3 Block Size Weight 51
4.4 Page Level Weight 53
4.5 Summary 55
Chapter 5 Websites Classification Architecture 57
5.1 Introduction 57
5.2 Pre-process Phase 59
5.3 Training Phase 66
5.4 Validation Phase 67
5.5 Testing Phase 68
Chapter 6 Experiments 69
6.1 Experimental Environment 69
6.2 Parameters Optimization of SVMs and Kernel Functions 70
6.3 Experimental Design 80
6.4 Experimental Results 81
6.5 Testing Results 87
Chapter 7 Conclusions 91
Reference 93
Appendix A An Instance of Page Profile 96
[1]N. E. Ayat, M. Cheriet, and C. Y. Suen, "Empirical Error based Optimization of SVM Kernels: Application to Digit Image Recognition," Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR'02): IEEE Computer Society, 2002.
[2]R. Baeza-Yates and B. Ribeirop-Neto, Modern Information Retrieval: Addison Wesley Longman, 1999.
[3]S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Proceedings of the 7th World-Wide Web Conference (WWW7), 1998.
[4]C. J. C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition," Data Mining and Knowledge Discovery, vol. 2, pp. 121-167, 1998.
[5]D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, "Extracting content structure for web pages based on visual representation," Procceding 5th Asia Pacific Web Conference, Xi'an, China, 2003.
[6]D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, "VIPS:a Vision-based Page Segmentation Algorithm," Microsoft Technical Report MSR-TR-2003-79, 2003.
[7]C.-C. Chang and C.-J. Lin, "LIBSVM: a library for support vector machines."
[8]C. Cortes and V. Vapnik, "Support Vector Networks," Machine Learning, vol. 20, pp. 273~297, 1995.
[9]M. Ester, H.-P. Kriegel, and M. Schubert, "Web site mining: a new way to spot competitors, customers and suppliers in the world wide web," Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. Edmonton, Alberta, Canada: ACM Press, 2002.
[10]J. Friedman, "Another approach to polychotomous classification," Department of Statistics, Stanford University, Standford, CA 1996.
[11]J. Fürnkranz, "Exploiting structural information for text classification on the www," Proceedings of the Third International Symposium on Advances in Intelligent Data Analysis, Amsterdam, Netherlands, 1999.
[12]G.Camps-Valls, J.D.Martín-Guerrero, J.L.Rojo-Álvarez, and E.Soria-Olivas, "Fuzzy sigmoid kernel for support vector classifiers," Neurocomputing, vol. 62, pp. 501-506, 2004.
[13]J.-W. Han and M. Kamber, Data Mining: Concept and Technique: Morgan Kaufmann, 2001.
[14]C.-W. Hsu and C.-J. Lin, "A comparison of methods for multiclass support vector machines," Neural Networks, IEEE Transactions on, vol. 13, pp. 415, 2002.
[15]M.-Y. Kan, "Web page classification without the web page," Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters. New York, NY, USA: ACM Press, 2004.
[16]C. Koby and S. Yoram, "On the algorithmic implementation of multiclass kernel-based vector machines," J. Mach. Learn. Res., vol. 2, pp. 265-292, 2002.
[17]U. Kreßel, "Pairwise classification and support vector machines," Advances in kernel methods: support vector learning: MIT Press, 1999, pp. 255-268.
[18]O.-W. Kwon, S.-H. Jung, J.-H. Lee, and G. Lee, "Evaluation of Category Features and Text Structural Information on a Text Categorization Using Memory Based Reasoning," Proceedings of the 18th international conference on computer processing of oriental languages (ICCPOL'99), University of Tokushim, Japan, 1999.
[19]O.-W. Kwon and J.-H. Lee, "Text categorization based on k-nearest neighbor approach for Web site classification," Information Processing & Management, vol. 39, pp. 25, 2003.
[20]C. Lijuan and H. Thomas, "Hierarchical document categorization with support vector machines," Proceedings of the thirteenth ACM conference on Information and knowledge management. Washington, D.C., USA: ACM Press, 2004.
[21]H.-T. Lin and C.-J. Lin, "A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods," National Taiwan Univ, Tech. rep. 2003.
[22]G. Lise, S. Eran, T. Ben, and K. Daphne, "Probabilistic Models of Text and Link Structure for Hypertext Classification," IJCAI Workshop on Text Learning: Beyond Supervision, 2001.
[23]M. N. Nguyen and J. C. Rajapakse, "Multi-Class Support Vector Machines for Protein Secondary Structure Prediction," Genome Informatics, vol. 14, pp. 218-227, 2003.
[24]C. Pável, C. Marco, M. Edleno, Z. Nivio, R.-N. Berthier, and G. Marcos André, "Combining link-based and content-based methods for web document classification," Proceedings of the twelfth international conference on Information and knowledge management. New Orleans, LA, USA: ACM Press, 2003.
[25]J. Platt, N. Cristianini, and J. Shawe-Taylor, "Large Margin DAGs for Multiclass Classification," Advances in Neural Information Processing Systems 12, pp. 547-553, 2000.
[26]M. Porter, "An algorithm for suffix stripping," Program, vol. 14, pp. 130-137, 1980.
[27]D. Riboni, "Feature Selection for Web Page Classification," In EURASIA-ICT 2002 Proceedings of the Workshop, 2002.
[28]B. Schölköpf, C. Burges, and V. Vapnik, "Extracting support data for a given task," Proceedings, First International Conference on Knowledge Discovery & Data Mining. Menlo Park, CA: AAAI Press, 1995.
[29]D. Shen, Z. Chen, Q. Yang, H.-J. Zeng, B. Zhang, Y. Lu, and W.-Y. Ma, "Web-page classification through summarization," Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Sheffield, United Kingdom: ACM Press, 2004.
[30]L. K. Shih and D. R. Karger, "Using urls and table layout for web classification tasks," Proceedings of the 13th international conference on World Wide Web. New York, NY, USA: ACM Press, 2004.
[31]R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma, "Learning block importance models for web pages," Proceedings of the 13th international conference on World Wide Web. New York, NY, USA: ACM Press, 2004.
[32]A. Sun, E.-P. Lim, and W.-K. Ng, "Web classification using support vector machine," Proceedings of the 4th international workshop on Web information and data management. McLean, Virginia, USA: ACM Press, 2002.
[33]D. Susan and C. Hao, "Hierarchical classification of Web content," Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. Athens, Greece: ACM Press, 2000.
[34]V. Vapnik, The Nature of Statistical Learning Theory. New York: Springer Verlag, 1995.
[35]V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
[36]L. Wang, K. L. Chan, and P. Xue, "A criterion for optimizing kernel parameters in KBDA for image retrieval," Systems, Man and Cybernetics, Part B, IEEE Transactions on, vol. 35, pp. 556-562, 2005.
[37]J. Weston and C. Watkins, "Multi-class Support Vector Machines," Department of Computer Science, Royal Holloway, University of London, Egham, TW20 0EX, UK 1998.
[38]C. C. Yang, J. Yen, and H. C. Chen, "Intelligent internet searching agent based on hybrid simulated annealing," Decision Support Systems, vol. 28, pp. 269-277, 2000.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
1. 文榮光(1985)。<使慢性精神分裂病人免於機構化照顧的可能因素>。《中華心理衛生學刊》,3(1);p57-67。
2. 文榮光、鄭夙芬(1994)。<精神病患與家庭政策>。《社區發展季刊》,68;p96-107
3. 牟秀善(1994)。<社會工作者如何協助慢性精神病患家屬之疾病管理>。《社區發展季刊》,76;p113--119
4. 宋麗玉(1998)。<精神病照顧者之憂鬱程度與其相關因素探討>。《公共衛生》
5. 宋麗玉(1999)。<精神病患照顧者之探究:照顧負荷之程度與其相關因素>。《中華心理衛生學刊》,12(1);p1-30
6. 李文瑄、葉英、劉蓉台、劉運康、張世靜(1984)。<慢性精神病患出院後社會適應的評估>。《中華心理衛生學刊》,1;P41-47
7. 李選、葉美玉(1992)。<家屬治療對治療對改善躁鬱症家屬疾病認知與因應能力成效之探討>。《護理雜誌》,39(3);p41--52
8. 沈志仁、張素凰(1993)。<精神分裂病患主要照顧家屬的需要及相關因素>。《中華心理衛生學刊》,6(1);P49-66
9. 沈淑華、沈秀娟、張達人、顏妙芬(2005)。<社區慢性精神病患主要照顧者心理衛生教育需求之探討>。《健康促進暨衛生教育雜誌》,25;P89-108
10. 胡海國(1995)。<精神分裂症患者早期病程的社會弁遄痋C《當代醫學》,22(11);p86-89
11. 胡海國(1996)。<精神分裂症患者家屬對精神分裂之態度>。《當代醫學》,23(6);0-94&p513-517
12. 陳快樂、呂孟穎、吳聖良(1999)。<精神病患社區醫療照顧方案之評價研究>。《公共衛生》,26(1);p49--58
13. 陳美碧、尹祚芊、蔡欣玲(1999)。<台北市北區慢性精神病患心理衛生需求未滿足相關因素之探討>。《護理研究》,7(1);p71--89
14. 陳珠璋(1989)。<綜說台灣現代精神分裂病研究>。《中華精神醫學》,3(2);p64-72
15. 楊素端、楊佩琪(1992)。<慢性精神病患回歸社區因素之探討>。《當代社會工作學刊》,2;p85-99