跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.172) 您好!臺灣時間:2024/12/03 06:10
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:朱軍柏
研究生(外文):Kuan-Pak Chu
論文名稱:以網頁區塊辨識和超鏈結分析為基礎之網站地圖自動產生系統
論文名稱(外文):Automatic Site Map Generator based on Page-Block Identification and Hyperlink Analysis
指導教授:林宣華林宣華引用關係
指導教授(外文):Shian-Hua Lin
學位類別:碩士
校院名稱:國立暨南國際大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2006
畢業學年度:94
語文別:英文
論文頁數:49
中文關鍵詞:Web資料探斟超鏈結分析網站地圖階層式結構內容區塊結構區塊多餘區塊
外文關鍵詞:Web data miningHyperlink analysisSite mapHierarchical structureContent blocksStructure blocksRedundant blocks
相關次數:
  • 被引用被引用:0
  • 點閱點閱:256
  • 評分評分:
  • 下載下載:37
  • 收藏至我的研究室書目清單書目收藏:3
從網站開發者設計出的網站地圖不但表示了網站的主要使用流程,同時也整理出網站的階層式概念架構。然而,很少網站有提供它們的網站地圖,就算是有提供,也只不過是為了方便給使用者去瀏覽,而不是為了給機器去解讀。在本論文中,我們開發了一個能自動產生網站的階層式網站地圖的系統,本系統(SMG - Site Map Generator)主要由三個模組所組成。首先,Page Partitioner把網頁的HTML原始碼轉換成蛋白質序列,並根據序列的複雜度切割網頁成多個區塊。其次Block Identifier把區塊分類成三類:內容區塊(content blocks)、結構區塊(structure blocks)及多餘區塊(redundant blocks)。最後Hyperlink Analyzer把網頁間的超鏈結轉換成區塊間的超鏈結並套用Kleinberg的HITS演算法來評量區塊的權威值及中心值,同時也使用區塊的熵來提升HITS效用。一個被廣泛使用的序列搜尋工具BLAST,其用來計算各區塊間的相似度,相似的區塊會聚合成一個叢集並給予一個出現頻率作為產生網站地圖的一個理據。在數個網站的實驗結果顯示,SMG平均的查全率(recall)和查準率(precision)分別為69%及58%。還有數個實驗也證實了SMG各個模組的效力。
Site maps designed by Web site developers are not only presenting the main usage flows for users, but also organizing the hierarchical concept of Web sites. However, Web sites seldom provide pages of site maps on the Internet. Even provided, these site maps are usually for user-friendly browsing, not for machine-understanding. In this thesis, we develop a system to automatically generate the hierarchical site map of a Web site. The system, Site Map Generator (SMG), consists of three components. First, Page Partitioner translates a page HTML source into a protein sequence and then separates the page into blocks based on analyzing the sequence complexity. Second, Block Identifier categorizes each block into one of three block types: content blocks, structure blocks and redundant blocks. Finally, Hyperlink Analyzer transforms page-to-page hyperlinks into block-to-block links and applies Kleinberg’s HITS algorithm to estimate authority and hub values of each block. Block entropy derived from features entropies of the block is used to improve the HITS. The widely used sequence searching tool, BLAST, is also employed to calculate similarities between blocks so that similar blocks are clustered with occurrence frequency that is considered as an argument to generate site maps. Experiments on several systematic Web sites show that the average recall and precision of SMG are 69% and 58%, respectively. Several experiments are also performed to prove the effectiveness of each component of SMG.
中文摘要 I
ABSTRACT II
CONTENTS III
LIST OF TABLES V
LIST OF FIGURES VI
1. INTRODUCTION 1
2. RELATED WORKS 5
2.1 BLOCK EXTRACTION 5
2.1.1. Tag-based Method 5
2.1.2. Vision-based Method 6
2.2 BLOCK IDENTIFICATION 7
2.2.1. Content-based Method 7
2.2.2. Classification-based Method 8
2.3 HYPERLINK ANALYSIS 9
2.3.1. Hyperlink Induced Topics Search (HITS) 9
2.3.2. PageRank 10
2.3.3. Entropy-Based Link Analysis 11
2.3.4. Block-level Link Analysis 12
3. CONCEPTS AND IDEAS 13
3.1 CONCEPTS AND DEFINITIONS 13
3.2 IDEAS AND METHODS 14
4. METHODS AND THE IMPLEMENTATION 17
4.1 THE SYSTEM ARCHITECTURE 17
4.2 THE CRAWLER AND THE HTML PARSER 18
4.3 SEQUENCE TRANSLATOR 19
4.4 PAGE PARTITIONER 22
4.4.1. SEG 22
4.4.2. Partitioning Pages into Blocks 23
4.5 BLOCK IDENTIFIER 25
4.6 BLOCK CLUSTER 28
4.6.1. BLAST Environment 28
4.6.2. Hierarchically applying BLAST 30
4.7 HYPERLINK ANALYZER 31
4.7.1. Hyperlink Expansions and Weight Distributions 31
4.8 SITE MAP ANALYZER 33
5. EXPERIMENTS AND EVALUATIONS 35
5.1 THE DATASETS 35
5.2 EVALUATIONS OF PAGE PARTITIONER 36
5.3 EVALUATIONS OF BLOCK IDENTIFIER 39
5.4 EVALUATIONS OF SITE MAP ANALYZER 41
6. CONCLUSION AND FUTURE WORKS 44
7. REFERENCES 46
APPENDIX A – THE MODIFIED SUBSTITUTION SCORING MATRIX OF BLAST 49
[1]Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J., “Basic local alignment search tool,” J. Mol. Biol. 215, 1990, pp. 403-410.
[2]Bharat, K., Henzinger, M. R. “Improved Algorithms for Topic Distillation in a Hyperlinked Environment,” Proc. of 21th ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998.
[3]Brin, S. and Page, L., “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” In Proceedings of the 7th international World Wide Web Conference Vol.7, 1998.
[4]Cai, D., He, X., Wen, J.-R. and Ma, W.-Y., “Block-level Link Analysis,” In Proceedings of the 27th Annual International ACM SIGIR Conference, Sheffield, UK, July 2004.
[5]Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y., “VIPS: a vision-based page segmentation algorithm,” Microsoft Technical Report, MSR-TR-2003-79, 2003.
[6]CGI, “The Common Gateway Interface,” http://hoohoo.ncsa.uiuc.edu/cgi/.
[7]Chakrabarti, S., Joshi, M. and Tawde, M., “Enhanced Topic Distillation using Text, Markup Tags, and Hyperlinks,” Proc. of 24th ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.
[8]Chakrabarti, S., Dom, B., Kumar, S., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson, D. and Kleinberg, J. M. “Mining the Web's link structure,” IEEE Computer, 32(8), pages 60-67, August 1999.
[9]Chakrabarti, S., van den Berg, M. and Dom, B., “Focused Crawling: A New Approach for Topic-Specific Resource Discovery,” Proc. of the 8th International World-Wide Web Conference, 1999.
[10]Debnath, S., Mitra, P., Pal, N. and Giles, C. L., “Automatic Identification of Informative Sections of Web Pages,” IEEE Trans. Knowledge and Data Eng., 2005.
[11]Gruber, T. R., “A Translation Approach to Portable Ontology Specifications,” Knowledge Acquisition, 1993.
[12]Kao, H.-Y., Lin, S.-H., Ho, J.-M. and Chen, M.-S., “Entropy-Based Link Analysis for Mining Web Informative Structures,” Proc. ACM 11th Int’l Conf. Information and Knowledge Management (CIKM), 2002.
[13]Kao, H.-Y., Lin, S.-H., Ho, J.-M. and Chen, M.-S., “Mining Web Information Structures and Contents based on Entropy Analysis,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 1, Jan. 2004.
[14]Kleinberg, J. M., “Authoritative sources in a hyperlinked environment,” In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
[15]Lin, S.-H. and Ho, J.-M., “Discovering Informative Content Blocks from Web Documents,” The Eighth ACM SIGKDD, 2002.
[16]Mayoraz, E. and Alpaydin, E., “Support vector machines for multiclass classification,” In the proceedings of the international workshop on artificial intelligence neural networks, 1999.
[17]Page, L., Brin, S., Motwani, R., and Winograd, T., “The pagerank citation ranking: Bringing order to the web,” Tech. Rep. Computer Systems Laboratory, Stanford University, Stanford, 1998.
[18]Song, R., Liu, H., Wen, J.-R. and Ma, W.-Y., “Learning Block Importance Models for Web Pages,” Proceedings of the 13th conference on World Wide Web, 2004.
[19]W3C CSS, “Cascading Style Sheets (CSS),” http://www.w3.org/Style/CSS/.
[20]W3C DOM, “Document Object Model (DOM),” http://www.w3.org/DOM/.
[21]W3C HTML, “HyperText Markup Language (HTML),” http://www.w3.org/MarkUp/.
[22]W3C XML, “Extensible Markup Language (XML),” http://www.w3.org/XML/.
[23]W3C XSL, “Extensible Stylesheet Language (XSL),” http://www.w3.org/Style/XSL/.
[24]Wootton, J. C., and Federhen, S., “Statistics of local complexity in amino acid sequences and sequence databases,” Computers & Chemistry 17, 1993, pp. 149-163.
[25]Wootton, J. C., and Federhen, S., “Analysis of compositionally biased regions in sequence databases,” Methods in Enzymology 266, 1996, pp. 554-571.
[26]Yi, L. and Liu, B., “Eliminating Noisy Information in Web Pages for Data Mining,” In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Washington, DC, USA, August, 2003.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文