(3.238.7.202) 您好!臺灣時間:2021/02/25 10:19
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:何信良
研究生(外文):Sin-Liang Ho
論文名稱:利用資訊擷取於建立領域入口網站-以旅遊為例
論文名稱(外文):Applying Information Extraction to Construct Domain Portals - A Case Study on Travel
指導教授:林宣華林宣華引用關係
指導教授(外文):Shian-Hua Lin
學位類別:碩士
校院名稱:國立暨南國際大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2008
畢業學年度:96
語文別:英文
論文頁數:40
中文關鍵詞:資訊擷取資料探勘入口網站序列分析
外文關鍵詞:Information ExtractionWeb Data MiningPortalSequence Analysis
相關次數:
  • 被引用被引用:0
  • 點閱點閱:189
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:18
  • 收藏至我的研究室書目清單書目收藏:2
隨著Web的快速成長和普及,使用者會利用搜尋引擎從數百億的網頁中找尋相關資訊以供決策。以使用者想到南投旅遊為例:到Google搜尋 “南投 旅遊” 的關鍵字,結果出現874,000筆。當中絕大多數網頁可能都是和使用者目的無關的資訊。使用者必須「再搜尋」或「點選許多網頁」以蒐集並整理大量的資訊,最後完成此次的旅遊決策資訊。問題關鍵在於大多數Web使用者所依賴的搜尋引擎 (Google),僅限於提供「大量相關資訊」,無法『整合資訊』成為更適合使用者的『知識』。使用者通常需要的是『跨領域知識整合需求』。以上述旅遊領域的問題為例:使用者想要的『知識』是完整整合地圖、官方網站、旅遊網站、新聞、部落格等資訊源『有價資訊』,並非散在大量網頁的『片段資訊』。在此論文,我們開發Intelligent Internet Information System (I3S) 做為快速及有效建立領域入口的平台 (a general platform for Domain Portals)。利用資料探勘和序列分析等技術,發展半自動化資訊擷取系統 (I3 Metadata Extractor, I3ME),並制訂人機介面流程和系統整合,盡量減少人力於建立domain portals所花的時間。以『旅遊』領域為研究實例,I3S以I3ME可擷取出台灣主要旅遊網站的結構化資訊 (metadata information),並整合成7,768個旅遊資訊。對照「全人工 (fully manual)」設定擷取的結果,本系統擷取metadata的precision (準確率) 最高可以到84.98%,Recall (回收率) 可以到達94.01%。
As the exponential growth of the Web, users rely on search engines to find information needs for making decisions. Using the example of searching “南投 旅遊 (Nantou Travel)” on the Web, the user obtained 874,000 target pages from the Google’s search results, in which only a few pages are valuable for the user. However, the user must perform click-and-click operations to integrate little information from large amounts of pages. The key problem is that the user requires not only “relevant pages” retrieved by search engines, but also “useful knowledge” integrated to solve problems in the travel domain. The “knowledge” presents the complete information integrated from different domains’ (websites’) pages. In the case of “travel”, the user should need following information resources: maps, metadata information for scene, hotel, transportation, etc., related news and blog information. That is a domain portal must integrate these information objects at least. To achieve the goal, we develop the Intelligent Internet Information System (I3S) as a general platform to construct various domain portals efficiently and effectively. Based on Data Mining and Sequence Analysis methods, we carry out the I3S’s Metadata Extractor (I3ME) to semi-automatically extract metadata from web pages and propose the procedure of human-computer interactions to reduce the manual cost while building domain portals. In the case study of travel domain, our system, Travel@I3S, extracts metadata information from famous travel websites in Taiwan and integrates into 7,768 travel objects. In par with fully-manual operations, precision and recall rates of I3ME (applied to Travel@I3S) are 84.98% and 94.01%, respectively.
中文摘要 I
Abstract II
Contents III
List of Tables VI
List of Figures VII
List of Figures VII
1. Introduction 1
2. Related Work 5
2.1. Focused Crawler 5
2.2. Page Partition 6
2.2.1. Tag-based Method 6
2.2.2. Vision-based Method 7
2.3. Information Extraction 8
3. Concepts and Ideas 8
3.1. Definitions of Blocks 9
3.2. Metadata Block (MB) Identification 10
3.3. Metadata Attributes Recognition 11
3.4. Concept of Mashup 12
3.5. Mashup Method 13
4. Intelligent Internet Information System 15
4.1. Domain Data Collection 15
4.2. Domain Metadata Extraction 16
4.3. Domain Integration (Mashup) 17
5. Methods and the Implementation 18
5.1. I3S’s Metadata Extraction 18
5.2. Attribute Extractor 19
5.2.1. Significance Estimation 20
5.2.2. Substring filter 20
5.3. Metadata Block Extractor 20
5.3.1. Sequence translator 21
5.3.2. SEG 23
5.3.3. Block Extractor 25
5.3.4. Metadata Block Identifier 27
5.3.5. Metadata Block Clustering 27
5.3.6. BLAST Environment 27
5.3.7. Value Extractor 30
6. Experiments and Evaluations 30
6.1. Data Source 31
6.2. Evaluations of Attribute Extractor 32
6.3. Evaluation of Metadata Blocks Extractor 35
6.4. Results of Mashup 36
7. Conclusion & Future Work 36
8. Reference 39
[1]Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J., “Basic local alignment search tool,” J. Mol. Biol. 215, 1990, pp. 403-410.
[2]Baker, T. and Birlinghoven, S., “A Grammar of Dublin Core,” D-Lib Magazine, October 2000, Vol. 6 NO. 10, http://www.dlib.org/dlib/october00/baker/10baker.html.
[3]Cai, D., He, X., Wen, J.-R. and Ma, W.-Y., “Block-level Link Analysis,” In Proceedings of the 27th Annual International ACM SIGIR Conference, Sheffield, UK, July 2004.
[4]Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y., “VIPS: a vision-based page segmentation algorithm,” Microsoft Technical Report, 2003
[5]Chakrabarti, S., van den Berg, M. and Dom, B., “Focused crawling: A new approach to topic-specific web resource discovery,” Proceedings of the 8th World Wide Web Conference, Toronto, 1999.
[6]Chen, M. S., Han, J., and Yu, P. S., “Data mining: An overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering,” 8(6):866–883, 1996.
[7]Chen, M. S., Jong, S. P., and Yu, P. S., “Efficient Data Mining for Path Traversal Patterns,” IEEE Transactions on Knowledge and Data Engineering, 10(2): 209-221, 1998.
[8]Chien, L. F., “PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval,” ACM SIGIR 1997.
[9]Chu K.P., “Automatic Site Map Generator based on Page-Block Identification and Hyperlink Analysis,” 2006.
[10]Debnath, S., Mitra, P., Pal, N. and Giles, C. L., “Automatic Identification of Informative Sections of Web Pages,” IEEE Trans. Knowledge and Data Eng., 2005.
[11]Dublin Core Metadata Element Set, Version 1.1: Reference Description, http://dublincore.org/documents/1999/07/02/dces/.
[12]Hu, Y., Li H., Cao Y., Meyerzon D., Zheng Q., “Automatic Extraction of Titles from General Documents using Machine Learning,” JCDL’05, 2005.
[13]Kao, H.-Y., Lin, S.-H., Ho, J.-M. and Chen, M.-S., “Entropy-Based Link Analysis for Mining Web Informative Structures,” Proc. ACM 11th Int’l Conf. Information and Knowledge Management (CIKM), 2002.
[14]Kao, H.-Y., Lin, S.-H., Ho, J.-M. and Chen, M.-S., “Mining Web Information Structures and Contents based on Entropy Analysis,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 1, Jan. 2004.
[15]Lin, S.-H. and Ho, J.-M., “Discovering Informative Content Blocks from Web Documents,” The Eighth ACM SIGKDD, 2002.
[16]W3C, “The World Wide Web Consortium (W3C),” http://www.w3.org/.
[17]W3C CSS, “Cascading Style Sheets (CSS),” http://www.w3.org/Style/CSS/
[18]W3C DOM, “Document Object Model (DOM),” http://www.w3.org/DOM/.
[19]W3C HTML, “HyperText Markup Language (HTML),” http://www.w3.org/MarkUp/.
[20]W3C Semantic Web, http://www.w3.org/2001/sw/.
[21]Wootton, J. C., and Federhen, S., “Statistics of local complexity in amino acid sequences and sequence databases,” Computers & Chemistry 17, 1993, pp. 149-163.
[22]Wootton, J. C., and Federhen, S., “Analysis of compositionally biased regions in sequence databases,” Methods in Enzymology 266, 1996, pp. 554-571.
[23]Y. Hu, H. Li, Y. Cao, D. Meyerzon, Q. Zheng,“Automatic Extraction of Titles from General Documents using Machine Learning,” JCDL’05, 2005.
[24]Yang, L. C. and Lin, S. H., “A Workflow Generation System for Databases Based on XML Web Services,” Master Thesis of Department of Computer Science and Information Engineering, National Chi-Nan University, 2005.
[25]Yi, L. and Liu, B., “Eliminating noisy information in Web pages for data mining,” ACM SIGKDD, 2003
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊
 
系統版面圖檔 系統版面圖檔