( 您好!臺灣時間:2021/03/01 18:31
字體大小: 字級放大   字級縮小   預設字形  


研究生(外文):Shui-lung Chuang
論文名稱(外文):Automatic Generation of Tree-Structured Templates for Information Extraction from HTML Documents
指導教授(外文):Jane Yung-jen Hsu
外文關鍵詞:template-based information extractiontemplate generationgrammatical inference
  • 被引用被引用:1
  • 點閱點閱:126
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
網際網路的快速成長已經改變了人們處理日常生活資訊的方法及習慣。有愈來愈豐富的資料是以 HTML 文件的格式呈現在 Web 上,為了使這些大量的線上資料能夠被有效地利用,各式各樣的資訊擷取系統被發展出來。然而面對著日益龐大的資料量以及應用程式需求,過去以人工分析來手動建構所需之資訊擷取系統已無法滿足現階段大量的需求,因而許多的研究人員正極力發展各種可行的方法來自動建構所需之資訊擷取系統。
我們採取的資訊擷取方法是樣板式資訊擷取法 (Template-based Information Extraction)。一份 HTML 文件可以根據它的標籤而被表達成一棵文件樹,以期能表達出該文件的結構資訊。而相似的文件通常具有相同的文件結構,因此我們利用一個樹狀結構樣板來表達這個相同的文件結構特性。透過一個樹狀配對法,我們可以決定樣板和文件之間的對應關係,進而從文件中擷取出所要的資訊。

The rapid growth of the World Wide Web has changed the way in which people exchange and share information. As the Internet serves as an important source of information, answers to questions are often scattered over a multitude of Web pages. To make huge amounts of on-line documents available and manageable, the various information extraction systems are unexpendable. However, manually constructing such information extraction systems is a laborious task. Automatic methods have the potential to help this development process.
This thesis follows a structure-based approach to extracting target information from HTML documents. Each document can be transformed into a unique ``document tree,'' which captures the structural properties defined by its HTML tags. On the other hand, a class of documents can be characterized as sharing a common tree-structured template. Through an approximate tree matching approach, the mapping between a document tree and a template tree can be established. According to the matching result, the target information can be determined and extracted.
Writing the required tree-structured templates manually is tedious and error-prone. To alleviate this engineering bottleneck, this thesis seeks to automate the process of constructing an appropriate template for a set of similar Web pages. With the idea from grammatical inference, we have developed an algorithm to automatically generate a template from a set of annotated training Web pages.
We have applied the proposed generation approach on several search engines and on-line news sources. The experimental results show that our method can effectively and efficiently generate appropriate templates for the test sites with a handful of training sample pages.
As the result of combining the structure-based information extraction approach and the automatic template generation method, we do make developing an information extraction process by providing a set of training pages, exceedingly better than constructing it manually.

Abstract i
List of Figures vi
List of Tables viii
List of Algorithms ix
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Overview 5
1.3 Organization 11
Chapter 2 Template-based Information Extraction 13
2.1 The Basic Idea 14
2.2 Formalism 18
2.2.1 HTML Document Tree 19
2.2.2 Template Tree 21
2.3 Extraction Approach 23
2.3.1 Extracting Information 24
2.3.2 Building HDT 25
2.3.3 Matching Template 25
2.4 Summary 32
Chapter 3 Template Generation 33
3.1 Template Generation Problem 34
3.2 The Basic Idea 36
3.3 Definition 39
3.3.1 Labeled HTML Document Tree 39
3.3.2 Generated Template Tree 41
3.4 System Overview 43
3.5 Labeling Training Pages 44
3.6 Preprocessing Training Instances 46
3.7 Summary 48
Chapter 4 Template Generation Algorithm 49
4.1 Preliminary 50
4.1.1 Restricted Regular Expression 50
4.1.2 Multiple Sequence Alignment 52
4.1.3 Inducing Pattern from an Alignment 58
4.2 Generating Template 65
4.2.1 Generating one Node 65
4.2.2 Generating Child Nodes 67
4.2.3 Complexity Analysis 71
4.3 An Illustrative Example 72
4.4 Summary 76
Chapter 5 Experimental Results 77
5.1 Query-based Information Resources 78
5.2 Periodical Information Resources 82
5.3 Discussion 84
Chapter 6 Related Work 89
6.1 Systems for Wrapper Induction 90
6.2 Systems for Learning Information Extraction Rules 93
Chapter 7 Future Work and Conclusions 99
7.1 Thesis Summary 100
7.2 Future Work 101
7.3 Conclusions 103
Bibliography 105

[Altschul and Lipman, 1989] Altschul, S. F. and Lipman, D. J. (1989). Trees, stars, and multiple biological sequence alignment. SIAM Journal on Applied Mathematics, 49(1):197-209.
[Ashish and Knoblock, 1997] Ashish, N. and Knoblock, C. (1997). Semi-automatic wrapper generation for Internet information sources. In Proceedings of Cooperative Information Systems.
[Bunke and Sanfeliu, 1990] Bunke, H. and Sanfeliu, A., editors (1990). Syntactic and Structure Pattern Recognition: Theory and Application. World Scientific.
[Califf and Mooney, 1997] Califf, M. and Mooney, R. (1997). Relational learning of pattern-match rules for information extraction. Working Papers of the ACL-97 Workshop in Natural Language Learning, pages 9-15.
[Cardie, 1997] Cardie, C. (1997). Empirical methods in information extraction. AI Journal, 18(4):65-79.
[Clark, 1997] Clark, J. (1997). ``SP'' - an SGML system conforming to international standard ISO 8879 - standard generalized markup language. http://www.jclark.com/sp/index.htm.
[Cowie and Lehnert, 1996] Cowie, J. and Lehnert, W. (1996). Information extraction. Communications of the ACM, 39(1):80-91.
[Doorenbos et al., 1997] Doorenbos, R., Etzioni, O., and Weld, D. (1997). A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of Autonomous Agents, pages 39-48.
[Freitag, 1998] Freitag, D. (1998). Information extraction from HTML: Application of a general machine learning approach. In Proceedings of the Fifteenth National Conference on Artifical Intelligence (AAAI-98), pages 517-523.
[Fukuda and kamata, 1984] Fukuda, H. and Kamata, K. (1984). Inference of tree automata from sample set of trees. International Journal of Computer and Information Sciences, 13(3):177-196.
[Hao et al., 1996] Hao, X., Wang, J. T. L., and Ng, P. A. (1996). Information extraction from the structured part of office documents. Information Sciences, 91:245-274.
[Hsu and Dung, 1998] Hsu, C.-N and Dung, M.-T. (1998). Generating finite-state transducers for semi-structured data extraction from the Web. Journal of Information Systems, 23(8):521-538. Special issue on Semi-structured Data.
[Hsu and Yih, 1997] Hsu, J. Y.-J and Yih, W.-T. (1997). Template-based information mining from HTML documents. In Proceedings of AAAI-97, pages 256-262.
[Huffman, 1995] Huffman, S. (1995). Learning information extraction patterns from examples. Workshop on new approaches to learning for natural language processing (ijCai-95), pages 127-142.
[Kushmerick, 1997] Kushmerick, N. (1997). Wrapper induction for information extraction. PhD thesis, Department of Computer Science, University of Washington. Technical Report UW-CSE-97-11-04.
[Kushmerick et al., 1997] Kushmerick, N., Weld, D., and Doorenbos, R. (1997). Wrapper induction for information extraction. In Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), pages 729-735.
[Leu, 1998] Leu, C.-H (1998). Implementation and application of approximate tree matching for information extraction from HTML documents. Master's thesis, Department of Computer science and Information Engineering, National Taiwan University.
[Muslea et al., 1998] Muslea, I., Minton, S., and Knoblock, C. A. (1998). STALKER: Learning extraction rules for semi-structured, Web-based information sources. Technical Report WS-98-01, AAAI Press, Menlo Park, CA.
[Myers et al., 1997] Myers, G., Selznick, S., Zhang, Z., and Miller, W. (1997). Progressive multiple alignment with constraints. In Proceedings of the First Annual International Conference on Computational Molecular Biology, pages 220-225.
[Pevzner, 1992] Pevzner, P. A. (1992). Multiple alignment, communication cost, and graph matching. SIAM Journal on Applied Mathematics, 52(6):1763-1779.
[Reinert et al., 1997] Reinert, K., Lenhof, H.-P., Mutzel, P., Mehlhorn, K., and Kececioglu, J. (1997). Branch-and-cut algorithm for multiple sequence alignment. In Proceedings of the First Annual International Conference on Computational Molecular Biology, pages 241-250.
[Riloff, 1993] Riloff, E. (1993). Automatically constructing a dictionary for information extraction tasks. In Proceedings of the Eleventh Annual Conference on Artificial Intelligence, pages 811-816.
[Riloff, 1996] Riloff, E. (1996). Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth Annual Conference on Artificial Intelligence, pages 1044-1049.
[Rus and Subramanian, 1997] Rus, D. and Subramanian, D. (1997). Customizing information capture and access. ACM Transactions on Information Systems, 15(1):67-101.
[Shibuya and Imai, 1997] Shibuya, T. and Imai, H. (1997). New flexible approaches for multiple sequence alignment. In Proceedings of the First Annual International Conference on Computational Molecular Biology, pages 267-276.
[Soderland, 1997] Soderland, S. (1997). Learning to extract text-based information from the World-Wide Web. In Proceedings of Third International Conference on Knowledge Discovery and Data Mining (KDD-97), pages 251-254.
[Soderland, 1999] Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine Learning, pages 1-44.
[Soderland et al., 1995] Soderland, S., Fisher, D., Aseltine, J., and Lehnert, W. (1995). CRYSTAL: Inducing a conceptual dictionary. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI-95), pages 1314-1319.
[Waterman, 1995] Waterman, M. S. (1995). Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall.
[Yih, 1997] Yih, W.-T. (1997). Template-based information extraction from tree-structured HTML documents. Master's thesis, Department of Computer science and Information Engineering, National Taiwan University.
[Zhang et al., 1994] Zhang, K., Shasha, D., and Wang, J. (1994). Approximate tree matching in the presence of variable length don't care. Journal of Algorithms, 16:33-66.

第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔