跳到主要內容

臺灣博碩士論文加值系統

(44.200.82.149) 您好!臺灣時間:2023/06/02 17:54
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:葉柏毅
研究生(外文):Po-Yi Yen
論文名稱:以樣版分群方法評估網頁區塊重要性-應用於多樣版網站之研究
論文名稱(外文):Block Importance Evaluation for Multi-Template Web Sites by Using Template Clustering
指導教授:李漢銘李漢銘引用關係
指導教授(外文):Hahn-Ming Lee
學位類別:碩士
校院名稱:國立臺灣科技大學
系所名稱:資訊工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2005
畢業學年度:93
語文別:英文
論文頁數:58
中文關鍵詞:網路探勘分群資訊擷取
外文關鍵詞:Web MiningClusteringInformative Extraction
相關次數:
  • 被引用被引用:0
  • 點閱點閱:246
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:2
對於商業性網站,網頁上每個區塊的資訊可能有不同的重要程度。因此,為了使不重要的區塊可以被移除對於網路探勘或者在小螢幕裝置上瀏覽網路時,評估網頁區塊重要性是一個重要的工作。當應用目前的評估網頁區塊方法於多樣版網站時,我們發現二個問題,分別是多樣版問題和較少內容之重要性區塊的問題。這二個問題均會降低評估網頁區塊的準確性。
在此篇論文中,我們提出了一種新的評估網頁區塊技術用以解決上述的二個問題。我們藉由樣版分群方法來群聚相似樣版的網頁區塊,接著個別對網頁區塊群作分析使得多樣版網站轉換成單一樣版的網站。藉由實驗的分析結果,證明我們所提出的評估技術能應用於多樣版網站中且確實提升品質。
The information of each block from the web pages might not be equally importance, especially in the commercial web site. Therefore, block importance evaluation is a important task such that the noisy blocks can be cleared for web mining and web browsing on small screen devices. For current block importance evaluating approaches, we discover two problems occurring while web site use several predefined templates. These two problems are the multi-template problem and the problem of informative blocks with fewer contents. These two problems both reduce the precision of block importance evaluation.
In the thesis, we proposed a novel block importance evaluating method, named as Block Analyzer, to solve the multi-template problem and the problem of informative blocks with fewer contents. This method is based on template clustering to cluster blocks with similar template and then analyzing each cluster individually to transform a multi-template web site to a single template web site. Experiments on several news web sites with multiple predefined templates show that Block Analyzer can work well in the multi-template web site and lead to performance improvement.
Abstract II
Acknowledgements IV
Content V
List of Figures VII
List of Tables IX

Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Problems of current block importance evaluating methods 3
1.2.1 Multi-template problem 3
1.2.2 The problem of informative block with fewer contents 4
1.3 Goals 4
1.4 Outlines of the thesis 5

Chapter 2 Background 6
2.1 Block importance evaluation 6
2.2 Approaches for block importance evaluation 8
2.2.1 Web site based approaches 8
2.2.1.1 Data-rich Subtree Extraction (DSE) 8
2.2.1.2 Site Style Tree based approaches 9
2.2.1.3 Link Analysis of Mining Informative Structure (LAMIS) and Discovering informative content blocks (InfoDiscoverer) 10
2.2.2 Web page based approaches 13
2.2.2.1 Presentational layout analysis 13
2.2.2.2 Block feature based approach 13
2.3 Summary for related work 14

Chapter 3 Block Analyzer 16
3.1 Concept of Block Analyzer 16
3.2 System architecture of Block Analyzer 19
3.2.1 Page Segmentation Unit 21
3.2.2 Structure Based Clustering Unit 22
3.2.3 Cluster Importance Degree Analyzer 25
3.2.4 Pattern Matching Unit 27
3.3 Characteristics of Block Analyzer 28

Chapter 4 Experiment 30
4.1 Experimental design 30
4.1.1 Ranking criterions 32
4.2 Experimental results 33

Chapter 5 Conclusion 48
5.1 Discussion 48
5.2 Conclusion 49
5.3 Further work 50

References 52
[1]. S. Brin and L. Page, ”The Anatomy of a Large-Scale Hypertextual Web Search Engine,” In Proceedings of Seventh World Wide Web Conference, pp. 107-117, 1998.
[2]. M.S. Chen, J.S. Park, and P.S. Yu, “Efficient Data Mining for Path Traversal Patterns,” IEEE Transactions on Knowledge and Data Engineering, vol. 10, no. 2, pp. 209-221, April 1998.
[3]. R. Song, H. Liu, J.R. Wen, and W.Y. Ma, “Learning Block Importance Models for Web Pages,” In Proceedings of the 13th World Wide Web Conference, 2004.
[4]. J. Han and K.C.C. Chang, “Data Mining for Web Intelligence,” IEEE Computer, vol. 35, no. 2, pp. 64-70, November 2002.
[5].Y. Yang, S. Slattery, and R. Ghani, “A Study of Approaches to Hypertext Categorization,” Journal of Intelligent Information Systems, 2002.
[6].H. Yu, J. Han, and K.C.C. Chang, “PEBL: Web Page Classification without Negative Examples,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 1, January 2004.
[7].A. Sun, E.P. Lim, and W.K. Ng, “Web Classification Using Support Vector Machine,” In Proceedings of the Fourth International Workshop on Web Information and Data Management, pp. 96-99, 2002.
[8].J. Furnkranz, “Exploiting Structural Information for Text Classification on the WWW,” In Proceedings of the Third Symposium on Intelligent Data Analysis, 1999.


[9].H.J. Oh, S.H. Myaeng, and M.H. Lee, “A Practical Hypertext Categorization Method Using Links and Incrementally Available Class Information,” In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval, pp. 264-271, 2000.
[10].E. Glover, K. Tsioutsiouliklis, S. Lawrence, D. Pennock, and G. Flake, “Using Web Structure for Classifying and Describing Web Pages,” In Proceedings of the 11th World Wide Web Conference, 2002.
[11].L.K. Shih and D.R. Karger, “Using URLs and Table Layout for Web Classification Tasks,” In Proceedings of the 13th World Wide Web Conference, 2004.
[12].L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” Technical Report, Department of Computer Science, Stanford University, 1998.
[13].S. Chakrabarti, M. Berg, and B. Dom, “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery,” In Proceedings of 8th World Wide Web Conference, 1999.
[14].M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, and M. Gori, “Focused Crawling Using Context Graphs,” In Proceedings of 26th International Conference on Very Large Databases, pp. 527-534, 2000.
[15].S. Chakrabarti, K. Punera, and M. Subramanyam, “Accelerated Focused Crawling Through Online Relevance Feedback,” In Proceedings of 11th World Wide Web Conference, pp. 148-159, 2002.
[16].C. Cardie, “Empirical Methods in Information Extraction,” Journal of AI Magazine, vol. 18, no. 4, pp. 5-79, 1997.

[17].D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” In Proceedings of the International Conference on Distributed Computing Systems, pp. 361-370, May 2001.
[18].C.H. Chang, and S.C. Lui, “IEPAD: Information Extraction based on Pattern Discovery,” In Proceedings of 10th World Wide Web Conference, pp. 681-688, 2001.
[19].D. Embley, Y. Jiang, and Y.K. Ng., “Record-Boundary Discovery in Web Documents,” In Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 467-478, 1999.
[20].N. Kushmerick, D.Weld, and R. Doorenbos, “Wrapper Induction for Information Extraction,” In Proceedings of 15th International Joint Conference on Artificial Intelligence, 1997.
[21].I. Muslea, S. Minton, and C. Knoblock, “A Hierarchical Approach to Wrapper Induction,” In Proceedings of Third International Conference on Autonomous Agents, 1999.
[22].K. Wang and H. Liu, “Discovering Structural Association of Semistructured Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 353-371, 2000.
[23].J. Hou and Y. Zhang, “Effectively Finding Relevant Web Pages from Linkage Information,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 4, 2003.
[24].S. Chakrabarti, “Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction,” In Proceedings of 10th World Wide Web Conference, pp. 210-220, 2001.

[25].K. Bharat and M.R. Henzinger, “Improved Algorithms for Topic Distillation in a Hyperlinked Environment,” In Proceedings of 21st ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 104-111, 1998.
[26].S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg, ”Automatic Resource Compilation by Analyzing Hyperlink Structure and Associate Text,” In Proceedings of Seventh World Wide Web Conference, pp. 65-74, 1998.
[27].J. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” Journal of the ACM, 1999.
[28].S. Chakrabarti, M. Joshi, and V. Tawde, “Enhanced Topic Distillation Using Text, Markup Tags, and Hyperlinks,” In Proceedings of 24th ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 208-216, 2001.
[29].D. Gibson, J. Kleinberg, and P. Raghvan, “Inferring Web Communities from Link Topology,” In Proceedings of 9th ACM Conference on Hypertext and Hypermedia, pp. 225-234, 1998.
[30].C. Clifton, “TopCat: Data Mining for Topic Identification in a Text Corpus,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 8, August 2004.
[31].O. Buyukkokten, H. Garcia-Molina, A. Paepcke, and T. Winograd, “Power Browser: Efficient Web Browsing for PDAs,” In Proceedings of the ACM SIGCHI Special Interest Group on Computer-Human Interaction Conference on Human factors in computing systems, pp. 430-437, 2000.

[32].S.H. Lin and J.M. Ho, “Discovering Informative Content Blocks from Web Documents,” In Proceedings of Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.
[33].H.Y. Kao, S.H. Lin, J.M. Ho, and M.S. Chen, “Mining Web Informative Structures and Contents Based on Entropy Analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 1, January 2004.
[34].L. Yi, B. Liu, and X. Li, “Eliminating Noisy Information in Web Pages for Data Mining,” In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003.
[35].L. Yi and B. Liu, “Web Page Cleaning for Web Mining Through Feature Weighting,” In Proceedings of Eighteenth International Joint Conference on Artificial Intelligence, August 2003.
[36].J. Wang and F.H. Lochovsky, “Data-Rich Section Extraction from HTML pages,” In Proceedings of IEEE International Conference on Web Information Systems Engineering, 2002.
[37].B.Y. Ziv and R. Sridhar, “Template Detection via Data Mining and its Applications,” In Proceedings of the 11th World Wide Web Conference, 2002.
[38]. M. Kovacevic, M. Diligenti, M. Gori, and V. Milutinovic, “Searching for Web Information More Efficiently Using Presentational Layout Analysis,” Journal of Electronic Business, vol. 1, no. 3, pp. 310-326, 2003.
[39].N. Kushmerick, “Learning to remove Internet Advertisements,” In Proceedings of 3rd International Conference on Autonomous Agents, pp. 175-181, 1999.
[40].T. Mitchell, Machine Learning, McGraw Hill, 1997.


[41].D. Cai, S. Yu, J.R. Wen, and W.Y. Ma, “Extracting Content Structure for Web Pages Based on Visual Representation,” In Proceedings of Fifth Asia Pacific Web Conference, 2003.
[42].V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995.
[43].J. Chen, B. Zhou, J. Shi, H. Zhang, and Q. Wu, “Function-Based Object Model Towards Website Adaptation,” In Proceedings of 10th World Wide Web Conference, 2001.
[44].C. Shannon, “A Mathematical Theory of Communication,” Journal of Bell System, vol. 27, pp. 398-403, 1948.


URL List:
[45].W3C DOM, Document Object Model (DOM), http://www.w3c.org/DOM/, 2003.
[46].CNN web site, http://www.cnn.com, 2005.
[47].BBC news web site, http://news.bbc.co.uk, 2005.
[48].ABC news web site, http://abcnews.go.com, 2005.
[49].Yahoo news web site, http://news.yahoo.com, 2005.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top