(3.237.178.91) 您好!臺灣時間:2021/03/02 21:26
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:卓威廷
研究生(外文):Wei-Ting Cho
論文名稱:基於視覺樹評估之區塊擷取
論文名稱(外文):Visual Tree Evaluation on Block Extraction
指導教授:高宏宇高宏宇引用關係
指導教授(外文):Hung-Yu Kao
學位類別:碩士
校院名稱:國立成功大學
系所名稱:資訊工程學系碩博士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2008
畢業學年度:96
語文別:英文
論文頁數:52
中文關鍵詞:視覺樹資訊擷取串聯樣式表區塊擷取資訊熵
外文關鍵詞:Visual TreeCSSBlock ExtractionEntropyInformation Extraction
相關次數:
  • 被引用被引用:0
  • 點閱點閱:124
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:18
  • 收藏至我的研究室書目清單書目收藏:0
有越來越多的人使用串聯樣式表來管理他們的網頁,因為串聯樣式表對於排版是方便且容易。但在網頁資料擷取上,串聯樣式表網頁通常會造成網頁結構模糊,這使得利用比對網頁結構來擷取資料的系統發生誤判。為了解決這個問題,本篇論文描述了一個系統能利用串聯樣式表網頁上的特質來擷取資料區塊。系統包含了三個部份,分別是樹生成、資訊熵計算模型和區塊識別。首先,在樹生成的步驟中,系統利用網頁節點的視覺資訊和標籤名稱去轉換網頁成為一棵視覺樹。視覺樹描述資料在視覺上是如何呈現在瀏覽器上。在資訊熵計算模型的步驟中,視覺樹中的每一個節點會依照特定的模型去計算其資訊熵,接著這些資訊熵在區塊識別的步驟中是被拿來作為標定區塊形態的準則。資訊熵非常適合用來測量一區塊內的資訊量;如果某區塊含有各式各樣的內容,則該區塊的資訊熵就會相對的高。由實驗結果可以得知我們的系統比其他系統在擷取容器區塊上是有較好的效能,節點特徵與視覺樹也的確能提升在串聯樣式表網頁中區塊擷取的效果。
More and More people use Cascading Style Sheets (CSS) to manage their Web pages, because CSS is easy and convenient to typesetting. However, CSS makes a Web pages displayed in an ambiguous structure. The data extraction systems based on Web page structure matching could generate more mistaken judgments. Furthermore, they only identify blocks with similar structures. Some systems use specific HTML tags such as TABLE, TR, TD and P, to partition Web pages, but the proportion of the tags is generally less than DIV tags in CSS Web pages. In this paper, for solving the limitations, we present a system that applies properties of CSS Web page to extract data block. The system comprises three modules: Visual Tree Generation (VTG), Entropy Evaluation Model (EEM) and Block Identification (BI). Web pages are first converted into tree objects in the VTG module. The module transforms DOM trees into visual trees by using the visual information and HTML tag name of nodes to modify tree structure. The proposed visual tree presents the arrangement of data displayed on Web Browser, which meet the visual intention for evaluating informative blocks. If a block consists of diverse content, the block entropy will be relatively high. Thus entropy is proper to measure the information content of blocks for distinguishing presentation blocks from others. In the EEM module, the entropy attributes of each node in a visual tree is calculated. These attributes are used to identify block types by the BI module which comprises the block marking and block Refining. In the experiment, the result shows the node attributes and the visual tree are useful to extract blocks on CSS Web pages. Our system also outperforms with other systems on container block extraction.
中文摘要 III
ABSTRACT IV
CONTENT V
FIGURE LISTING VII
TABLE LISTING IX
1. INTRODUCTION 1
2. RELATED WORK 5
2.1 DOM-BASED PAGE SEGMENTATION 5
2.2 VISION-BASED PAGE SEGMENTATION 6
2.3 HYBRID APPROACH 7
2.4 OTHER 8
2.5 IB APPLICATION 8
3. THE PROPERTIES OF CSS WEB PAGE 11
3.1 CSS SELECTOR ENTROPY 11
3.2 LAYER CONTAINMENT 12
3.2.1 Structural Containment 12
3.2.2 Visual Containment 13
3.3 CSS DEFINITION 14
4. SYSTEM ARCHITECTURE 16
4.1 VISUAL TREE GENERATION 16
4.1.1 DOM Parser 17
4.1.2 Tree Constructer 17
4.1.3 Tag Filter 18
4.1.4 Visual Information Module 19
4.2 ENTROPY EVALUATION MODEL 20
4.2.1 Attribute Entropy 20
4.2.2 Node Attributes 21
4.2.3 Child Entropy Evaluation 21
4.2.4 Path Entropy Evaluation 23
4.2.5 Partial Path Entropy Evaluation 25
4.2.6 Aggregation Function 26
4.3 BLOCK IDENTIFICATION 27
4.3.1 Block Marker 27
4.3.2 Block Refinement 27
4.3.3 The relationship between EEM and Block Identification 31
5. EXPERIMENT AND RESULT 33
5.1 TAG ANALYSIS 33
5.1.1 Block Tag Analysis 33
5.1.2 CSS Tag Analysis 36
5.2 SYSTEM MODULES PERFORMANCE EVALUATION 38
5.2.1 Visual Tree Evaluation 39
5.2.2 Entropy Evaluation Model Comparison 39
5.2.3 Block Non-refinement Evaluation 41
5.3 THE PERFORMANCE OF CB EXTRACTION 44
5.4 SEARCH ENGINE RESULTS EVALUATION 46
5.5 BLOCK THRESHOLDS EVALUATION 47
5.6 ITERATIVE AND NON-ITERATIVE EVALUATION 49
6. CONCLUSIONS AND FUTURE WORK 50
7. REFERENCES 51
[1]Cascading Style Sheets (CSS). - http://www.w3.org/Style/CSS/.
[2]Document Object Model (DOM). - http://www.w3.org/DOM/.
[3]BERGMAN, M.K., The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing, 2001. 7(1): p. 07-01.
[4]Buyukkokten, O., H. Garcia-Molina, and A. Paepcke, Accordion summarization for end-game browsing on PDAs and cellular phones. Proceedings of the SIGCHI conference on Human factors in computing systems, 2001: p. 213-220.
[5]Buyukkokten, O., H. Garcia-Molina, and A. Paepcke, Seeing the whole in parts: text summarization for web browsing on handheld devices. Proceedings of the 10th international conference on World Wide Web, 2001: p. 652-662.
[6]Cai, D., X. He, J.R. Wen, and W.Y. Ma, Block-level link analysis. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004: p. 440-447.
[7]Cai, D., S. Yu, J.R. Wen, and W.Y. Ma, Block-based web search. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004: p. 456-463.
[8]Cai, D., S. Yu, J.R. Wen, and W.Y. Ma, Extracting content structure for web pages based on visual representation. Proc. 5 thAsia Pacific Web Conference, 2003.
[9]Chia-Hui Chang and S.-C. Lui, IEPAD: information extraction based on pattern discovery. Proceedings of the 10th international conference on World Wide Web, 2001: p. 681-688.
[10]Fernandes., D., E.S.d. Moura., B. Ribeiro-Neto., A.S.d. Silva., and M.A. Gonçalves., Computing Block Importance for Searching on Web Sites, in CIKM. 2007.
[11]Gupta, S., G. Kaiser, D. Neistadt, and P. Grimm, DOM-based content extraction of HTML documents. Proceedings of the 12th international conference on World Wide Web, 2003: p. 207-214.
[12]Hearst, M.A., Multi-paragraph segmentation of expository text. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994: p. 9-16.
[13]Kao, H.Y., M.S. Chen, S.H. Lin, and J.M. Ho, Entropy-based link analysis for mining web informative structures. Proceedings of the eleventh international conference on Information and knowledge management, 2002: p. 574-581.
[14]Lin, S.H. and J.M. Ho, Discovering informative content blocks from Web documents. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002: p. 588-593.
[15]Liu, B., R. Grossman, and Y. Zhai, Mining data records in Web pages. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003: p. 601-606.
[16]Ponte, J.M. and W.B. Croft, Text Segmentation by Topic. Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries 1997: p. 113 - 125.
[17]Salton, G., A. Singhal, C. Buckley, and M. Mitra, Automatic text decomposition using text segments and text themes. Proceedings of the the seventh ACM conference on Hypertext, 1996: p. 53-65.
[18]Shannon, C.E., A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 2001. 5(1): p. 3-55.
[19]Song, R., H. Liu, J.R. Wen, and W.Y. Ma, Learning block importance models for web pages. Proceedings of the 13th international conference on World Wide Web, 2004: p. 203-211.
[20]Tseng, Y.F. and H.Y. Kao, The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, 2006: p. 370-373.
[21]Vadrevu, S., F. Gelgi, and H. Davulcu, Semantic partitioning of web pages. The 6th International Conference on Web Information Systems Engineering (WISE), 2005.
[22]Wang, J. and F.H. Lochovsky, Data extraction and label assignment for web databases. Proceedings of the 12th international conference on World Wide Web, 2003: p. 187-196.
[23]Wong., W.-c. and A.W.-c. Fu., Finding structure and characteristic of web documents for classification. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), 2000.
[24]Yi, L. and B. Liu, Web Page Cleaning for Web Mining through Feature Weighting. Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Aug, 2003: p. 9-15.
[25]Yi, L., B. Liu, and X. Li, Eliminating noisy information in Web pages for data mining. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003: p. 296-305.
[26]Zhai, Y. and B. Liu, Web data extraction based on partial tree alignment. Proceedings of the 14th international conference on World Wide Web, 2005: p. 76-85.
[27]Zhao, H., W. Meng, Z. Wu, V. Raghavan, and C. Yu, Fully automatic wrapper generation for search engines. Proceedings of the 14th international conference on World Wide Web, 2005: p. 66-75.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔