跳到主要內容

臺灣博碩士論文加值系統

(44.192.20.240) 您好!臺灣時間:2024/02/25 01:19
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:范綱岷
研究生(外文):Fung, Kung-Ming
論文名稱:使用超本文標記語言剖析樹建構多網頁資訊萃取及融合代理人
論文名稱(外文):Multi-Page Information Extraction and Fusion Agent Using HTML Parse Tree
指導教授:李漢銘李漢銘引用關係
指導教授(外文):Lee, Hahn-Ming
學位類別:碩士
校院名稱:國立臺灣科技大學
系所名稱:電子工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2001
畢業學年度:89
語文別:英文
論文頁數:71
中文關鍵詞:資訊萃取資訊融合包裝代理人智慧型代理人超本文標記語言剖析樹資訊檢索
外文關鍵詞:Information ExtractionInformation FusionWrapper AgentIntelligent AgentHTML Parse TreeInformation Retrieval
相關次數:
  • 被引用被引用:0
  • 點閱點閱:240
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:2
資訊被成千上萬擁有不同背景、不同專業知識及為了不同目的人散佈且組織於網際網路(Internet)上。所以,人們必須使用一些如搜尋引擎(Search Engine)、資訊包裝代理人(Wrapper Agent)等工具在網際網路上找到符合自己所須的資訊。然而,目前的搜尋引擎及資訊包裝代理人存在著一些限制,也因此,對使用者而言,這些工具並不足以讓使用者在網際網上取得更詳細的資訊。舉例而言,大部份的網頁文件,如:新聞網頁、求職求才網頁等,都擁有一些有用的資訊欄位。在一個求職求才網站上的詳細工作描述網頁通常會有一些有用的資訊欄位,如:職稱、公司名稱、公司地址、工作地點、薪資待遇等。這些欄位資訊可以被用來提升搜尋結果的精準度,再者,有些與工作相關的資訊被散佈於不同的網站及網頁上。因此,我們提出了一個多網頁資訊萃取及融合代理人(Multi-Page Information Extraction and Fusion Agent)去建構一個更有效的網際網路資訊整合工具。在作法上,基於超本文標記語言剖析樹(HTML Parse Tree)方法容易使用程式語言來實作,所以,在此論文中我們嘗試去修改這個方法並結合一些物件導向的概念去達成我們的目標。在這一段研究期間,我們提出一個結合關聯式資料庫及輕量的超本文標記語言剖析樹(Lightweight HTML Parse Tree)的萃取及融合規則表示法,去取代傳統資訊包裝代理人中複雜的規則描述語言,同時,我們也建構了一個有用的工具去輔助定義這些規則。最後,我們製作了一個資訊萃取及融合代理人,我們稱之為OOWrapper。而且,我們也給一個OOWrapper的應用實例,去說明它如何在網際網路上工作,透過這個應用實例及實驗,我們可以了解在網際網路上這是一個有效且有用的工作。
On the Internet, information is organized and distributed by millions of different people, each with different purposes, backgrounds and knowledge. Therefore, people must use some tools to locate information on the Internet, such as search engines, wrapper agents and so on. However, current search engines and wrapper agents have some limitations, therefore they are not good enough for people to locate more detail information on the Internet. For instances, most documents such as news web pages, human resource web pages and so on, have some useful information fields. In general, a job description page of the human resource website has some useful fields, e.g. job title, company name, company address, working place, salary, etc. These fields can be used to increase the precision of search results. Moreover, some job related information is distributed in the other websites or web pages. Hence we propose a multi-page information extraction and fusion agent to construct a more efficient tool for integrating information on the Internet. As the HTML parse tree approach can be implemented by program languages easily, therefore we attempt to modify the HTML parse tree approach with some object-oriented (OO) concepts to achieve our purpose. During this research, we propose an extraction and fusion rules representation with a lightweight HTML parse tree and a relational database instead of using complex rules script language of traditional wrapper agents. Also we construct a useful tool to assist system constructor to build extraction and fusion rules easily. Finally, we implement a multi-page information extraction and fusion agent — OOWrapper. We give an application with OOWrapper to explain how it works on the Internet. Also we show its efficient and useful workings via the given application and experiments.
Contents
Abstract.........................................................................................II
Contents.......................................................................................III
List of Figures............................................................................V
List of Tables.............................................................................VI
Chapter 1 Introduction..................................................................................1
1.1 Motivation..................................................................................1
1.1.1Limitations of Search Engines……………..…………2
1.1.2Limitations of Wrapper Agents………..……….........4
1.2 Challenge of Information Extraction and Fusion...................4
1.3 Goal.…………………….........................................................6
1.4 Summary of the Thesis............................................................9
Chapter 2 Background Review..................................................................10
2.1 Intelligent Agents on the Internet.............................................11
2.2 Information Retrieval and Information Extraction...................14
2.2.1Information Retrieval..................................................14
2.2.2Information Extraction................................................18
2.3 Information Fusion...................................................................19
2.4 Wrapping Semi-structured Web Page.......................................20
Chapter 3 Multi-Page Information Extraction and Fusion.............26
3.1 Overview..................................................................................26
3.2 Spider and Tag Normalizer………………………………….32
3.3 HTML Parse Tree………………………………………….…36
3.3.1Structure of HTML Documents…………..…………37
3.3.2Constructing HTML Parse Tree…………………..…39
3.4 Multi-Page Extraction and Fusion Rules..……………….…..43
3.5 Extraction, Fusion and Alarm Agent…………………..…….49
Chapter 4 Application: Jobbot - A Job''s Information Integration Website..........................................................................................52
4.1 System Architecture.................................................................53
4.2 Discussion with Jobbot………………………………………55
Chapter 5 Discussion and Conclusion....................................................58
5.1 Discussion................................................................................58
5.2 Conclusion................................................................................60
5.3 Future Work..............................................................................61
References..............................................................................................................63
[1]N. Ashish, C.A. Knoblock, “Semi-automatic Wrapper Generation for Internet Information Sources,” Cooperative Information Systems, 1997. COOPIS ''97. Proceedings of the 2nd IFCIS International Conference, pp. 160 —169, 1997.
[2]L. Bogoni, M. Hansen, “Pattern-Selective Color Image Fusion,” Pattern Recognition, Vol. 34 (8), pp. 1515-1526, 2001.
[3]S. Brin, L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Proceedings of the 7th International World Wide Web Conference, 1998.
http://www7.scu.edu.au/programme/fullprog.html.
[4]C.H. Chang, C.C. Hsu, “Enabling Concept-Based Relevance Feedback for Information Retrieval on the WWW,” IEEE Transactions on Knowledge and Data Engineering, Vol. 11 (4), July/August 1999.
[5]C.Y. Chao, Association Thesaurus Construction for Interactive Query Expansion based on Association Rules Mining, Master Thesis submitted to the Department of Electronic Engineering of National Taiwan University of Science and Technology, June 2001.
[6]J. Cowie, W. Lehnert, “Information Extraction,” Communications of the ACM, Vol. 39 (1), pp. 80-91, January 1996.
http://www.acm.org/pubs/citations/journals/cacm/1996-39-1/p80-cowie/.
[7]M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery, “Learning to Construct Knowledge Bases from the World Wide Web,” Artificial Intelligence, Vol. 118, pp. 69-113, 2000.
[8]A. Crespo, J. Jannink, E. Neuhold, M. Rys, R. Studer, “A Survey of Semi-automatic Extraction and Transformation,” Information Systems, pp. 1-19, 1994.
http://www-db.stanford.edu/~crespo/publications/.
[9]F. Crimmins, A.F. Smeaton, “TetraFusion: Information Discovery on the Internet,” IEEE Intelligent Systems, Vol. 14 (4), July/August 1999.
http://www.computer.org/intelligent/ex1999/x4055abs.htm.
[10]K. Decker, A. Pannu, K. Sycara, M. Williamson, “Designing Behaviors for Information Agents,” Proceedings of the 1st International Conference on Autonomous Agents, February 1997.
http://www.ri.cmu.edu/pubs/pub_2172.html.
[11]V. Devedzic, “A Survey of Modern Knowledge Modeling Techniques,” Expert Systems with Applications, Vol. 17, pp. 275-294, 1999.
[12]R.B. Doorenbos, O. Etzioni, D.S. Weld, “A Scalable Comparison-Shopping Agent for the World Wide Web,” Proceedings of the 1st International Conference on Autonomous Agents, California, February 1997.
[13]L. Eikvil, “Information Extraction from World Wide Web — A Survey —,” SAMBA, July 1999.
http://www.nr.no/research/samba/.
[14]D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.K. Ng, R.D. Smith, “Conceptual-model-based Data Extraction From Multiple-record Web Pages,” Data and Knowledge Engineering, Vol. 31 (3), pp. 227-251, 1999.
[15]S. Fabre, A. Appriou, X. Briottet, “Presentation and Description of Two Classification Methods Using Data Fusion Based on Sensor Management,” Information Fusion, Vol. 2 (1), March 2001.
[16]R. Gaizauskas, Y. Wilks, “Information Extraction: Beyond Document Retrieval,” Computational Linguistics and Chinese Language Processing, Vol. 3 (2), pp. 17-60, August 1998.
[17]E.J. Glover, S. Lawrence, M.D. Gordon, W.P. Birmingham, C.L. Giles, “Web Search -- Your Way,” Accepted for publication in Communications of the ACM, 2000.
http://www.eecs.umich.edu/~compuman/cv2000.html.
[18]C.C. Hayes, “Agents in a Nutshell — A Very Brief Introduction,” IEEE Transactions on Knowledge and Data Engineering, Vol. 11 (1), pp. 127-132, January/February 1999.
[19]A.E. Howe, D. Dreilinger, “SavvySearch: A Metasearch Engine that Learns which Search Engines to Query,” AI Magazine, Vol. 18 (2), pp. 19-25, 1997.
http://www.cs.colostate.edu/~howe/pubs.html.
[20]C.H. Hsu, “Initial Results on Wrapping Semistructured Web Pages with Finite-state Transducers and Contextual Rules,” Workshop on AI and Information Integration, in conjunction with the 15th National Conference on Artificial Intelligence (AAAI-98), Madison, Wisconsin, July 1998.
[21]C.H. Hsu, M.T. Dung, “Generating Finite-state Transducers for Semistructured Data Extraction from the Web,” Information Systems, Vol. 23 (8), pp. 521-538, 1998.
[22]J.Y.J. Hsu, W.T. Yih, “Template-Based Information Mining from HTML Documents,” American Association for Artificial Intelligence, 1997.
http://hugo.csie.ntu.edu.tw/paper.html.
[23]L. Huang, “A Survey on Web Information Retrieval Technologies,” 1999.
http://www.cs.sunysb.edu/~lanhuang/.
[24]B.J. Jansen, U. Pooch, “A Review of Web Searching Studies and a Framework for Future Research,” Journal of the American Society of Information Science and Technology, Vol. 52 (3), pp. 235 — 246, 2001.
[25]C. Jenkins, M. Jackson, P. Burden, J. Wallis, “Searching the World Wide Web: an Evaluation of Available Tools and Methodologies,” Information and Software Technology, Vol. 39, pp. 985-994, 1998.
[26]N.R. Jennings, K. Sycara, M. Wooldridge, “A Roadmap of Agent Research and Development,” Autonomous Agents and Multi-Agent Systems, Vol. 1, pp. 275-306, 1998.
[27]T.T. Kao, Personalized Courseware Navigation Using Grey Ranking Analysis, Master Thesis submitted to the Department of Electronic Engineering of National Taiwan University of Science and Technology, June 2001.
[28]M. Klusch, “Information Agent Technology for the Internet: A Survey,” Data and Knowledge Engineering, Vol. 36, pp. 337-372, 2001.
[29]C.A. Knoblock, S. Minton, J.L. Ambite, N. Ashish, “Modeling Web Sources for Information Integration,” American Association of Artificial Intelligence, 1997.
[30]C. Knoblock, S. Minton, J.L. Ambite, N. Ashish, I. Muslea, A. Philpot, S. Tejada, “The ARIADNE Approach to Web-based Information Integration,” Cooperative Information Systems, Vol. 10 (1-2), pp. 145-169, 2001.
[31]R. Kosala, H. Blockeel, “Web Mining Research: A Survey,” ACM SIGKDD Explorations, Vol. 2 (1), pp. 1-15, July 2000.
http://lans.ece.utexas.edu/course/ee380l/2001sp/readinglist/kosala.pdf.
[32]N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness,” Artificial Intelligence, Vol. 118, pp. 15-68, 2000.
[33]N. Kushmerick, D.S. Weld, R. Doorenbos, “Wrapper Induction for Information Extraction,” the 15th International Joint Conference on Artificial Intelligence (IJCAI-97), Nagoya, August 1997.
[34]S. Lawrence, “Context in Web Search,” IEEE Data Engineering Bulletin, Vol. 23, Number 3, pp. 25-32, 2000.
[35]S. Lawrence, C.L. Giles, “Searching the World Wide Web,” Science magazine, Vol. 280, pp. 98-100, April 1998.
[36]S. Lawrence, C.L. Giles, “Context and Page Analysis for Improved Web Search,” IEEE Internet Computing, pp. 38-46, August 1998.
[37]V. Lesser, B. Horling, F. Klassner, A. Raja, T. Wagner, S.X. Zhang, “BIG: An Agent for Resource-Bounded Information Gathering and Decision Making,” Artificial Intelligence, Vol. 118, pp. 197-244, 2000.
[38]A.Y. Levy, D.S. Weld, “Intelligent Internet Systems,” Artificial Intelligence, Vol. 118, pp. 1-14, 2000.
[39]S.K. Lin, “Interactive Query Expansion Based on Association Thesaurus for Web Information Retrieval,” Master Thesis submitted to the Department of Electronic Engineering of National Taiwan University of Science and Technology, pp. 10-16, June 2000.
[40]C.H. Liu, Implementation and Application of Approximate Tree Matching for Information Extraction from HTML Documents, Thesis Submitted to the Graduate Institute of Computer Science and Information Engineering of National Taiwan University, 1998.
http://hugo.csie.ntu.edu.tw/thesis.html.
[41]L. Liu, C. Pu, W. Han, “XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources,” Proceedings of the 16th International Conference on Data Engineering, pp. 611-621, 2000.
[42]P. Martin, and P.W. Eklund, “Knowledge Retrieval and the World Wide Web,” IEEE Intelligent Systems, Vol. 15 (3), May/June 2000.
http://www.computer.org/intelligent/ex2000/x3018abs.htm.
[43]M. Montebello, “Wrapping WWW Information Sources,” International Database Engineering and Applications Symposium, pp. 431 —436, 2000.
[44]I. Muslea, S. Minton, C.A. Knoblock, “Hierarchical Wrapper Induction for Semistructured Information Sources,” Autonomous Agents and Multi-Agent Systems, Vol. 4, pp. 93-114, 2001.
http://www.isi.edu/~muslea/papers.html.
[45]H.S. Nwana, “Software Agents: An Overview,” Knowledge Engineering Review, Vol. 11 (3), pp. 1-40, September 1996.
http://agents.umbc.edu/introduction/ao/.
[46]A. Sahuguet, F. Azavant, “Building Intelligent Web Applications Using Lightweight Wrappers,” Data and Knowledge Engineering, Vol. 26, pp. 283-316, 2001.
http://www.elsevier.nl/gej-ng/29/18/show/.
[47]L. Serafini, C. Ghidini, “Using Wrapper Agents to Answer Queries in Distributed Information Systems,” Proceedings of the 4th International Conference on MultiAgent Systems (ICMAS 2000), pp. 429-430, 2000.
http://cyber.felk.cvut.cz/gerstner/dai/repository/demos/barcelona/ASD-24/details.html.
[48]N. Sundaresan, J. Yi, “Mining the Web for Relations,” the 9th International World Wide Web Conference, 2000.
http://www9.org/w9cdrom/363/363.html.
[49]A. Vellido, P.J.G. Lisboa, J. Vaughan, “Neural Networks in Business: A Survey of Applications (1992-1998),” Expert Systems with Applications, Vol. 17, pp. 51-70, 1999.
[50]W.T. Yih, Template-Based Information Extraction from Tree-structured HTML Documents, Thesis Submitted to the Graduate Institute of Computer Science and Information Engineering of National Taiwan University, 1997.
http://hugo.csie.ntu.edu.tw/thesis.html.
Books
[51]S. Abiteboul, P. Buneman, D. Suciu, Data on the Web: From Relations to Semistructured Data and XML, Morgan Kaufmann, October 1999.
[52]L.L. Beck, System Software — An Introduction to Systems Programming, Addison Wesley, the 2nd Edition, 1990.
[53]J.P. Bigus, J. Bigus, Constructing Intelligent Agents with Java, John Wiley & Sons, 1998.
[54]A. Caglayan, C. Harrison, Agent Sourcebook, John Wiley & Sons, 1997.
[55]D. Clark, S.J. Stuple, Developing for Agent, Microsoft Press, 1997.
[56]D.A. Grossman, O. Frieder, Information Retrieval: Algorithms and Heuristics, Kluwer Academic, 1998.
[57]S. Haykin, Neural Networks — A Comprehensive Foundation, Prentice Hall, the 2nd Edition, 1999.
[58]E. Horowitz, S. Sahni, S. Anderson-Freed, Fundamentals of Data Structures in C, W.H. Freeman, 1993.
[59]R. Mullen, HTML Quick Reference, QUE, 1996.
[60]C.J.V. Rijsbergen, Information Retrieval, the 2nd Edition, 1979.
http://www.dcs.glasgow.ac.uk/Keith/Preface.html.
[61]S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, Upper Saddle River, N.J., 1995.
[62]G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison Wesley, 1989.
[63]G. Salton, M.J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, 1983.
URLs
[64]104, http://www.104.com.tw/.
[65]AltaVista, http://www.altavista.com/.
[66]Clean up your web pages with HTML Tidy,
http://www.w3.org/People/Raggett/tidy/.
[67]CTCareer, http://www.ctcareer.com.tw/.
[68]Excite, http://www.excite.com/.
[69]HotBot, http://www.hotbot.com/.
[70]Graphic, Visualization, and Usability Center, “GVU’s 10th WWW User Survey,” 1998.
http://www.gvu.gatech.edu/user_surveys/survey-1998-10/.
[71]Google, http://www.google.com/.
[72]InfoSeek, http://www.infoseek.com/.
[73]Lycos, http://www.lycos.com/.
[74]Northern Light, http://www.nlsearch.com/.
[75]Search Engine Watch, http://www.searchenginewatch.com/.
[76]UDNJob, http://udnjob.com/.
[77]W3C HTML Validation Service, http://validator.w3.org/.
[78]World Wide Web Consortium (W3C), http://www.w3c.org/.
[79]Yahoo, http://www.yahoo.com/.
[80]Yam, http://www.yam.com/.
[81]網路燈塔, http://www.haiyan.com/steelk/navigator/b5index.htm.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊