跳到主要內容

臺灣博碩士論文加值系統

(44.220.184.63) 您好!臺灣時間:2024/10/04 06:19
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:王信傑
研究生(外文):Xin-Jie Wang
論文名稱:使用視覺化訊息的資料記錄探勘及屬性標記於網際網路資料擷取之研究
論文名稱(外文):Mining Data Records and Attribute Labeling in Web Data Extraction with Visual Information
指導教授:蔡志忠蔡志忠引用關係
指導教授(外文):Jyh-Jong Tsay
學位類別:碩士
校院名稱:國立中正大學
系所名稱:通訊工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2009
畢業學年度:97
語文別:英文
論文頁數:50
中文關鍵詞:視覺化訊息屬性標記資料記錄網際網路資料探勘
外文關鍵詞:Visual InformationAttribute LabelingWeb Data ExtractionData Records
相關次數:
  • 被引用被引用:1
  • 點閱點閱:272
  • 評分評分:
  • 下載下載:29
  • 收藏至我的研究室書目清單書目收藏:1
網際網路的快速發展使得網路上的交易變得更方便。越來越多的人使用網路購物來購買所需商品,而購物網站也越來越普及與多元化。而跨網站的商品比較也越來越被人們所需求,所以跨網站的資訊擷取技術也日益重要。我們提出一種更接近使用者瀏覽資訊的方法,即是在尋找網頁的資料區塊(data records)時加入了視覺化訊息(visual information)。藉由找尋相鄰的區塊(block)組合與其他相鄰的區塊組合比較。找出這些相同組合類型的區塊後,再建立一個視覺化模型(visual model)來對這些區塊作描述。回到瀏覽器中的網頁上,找出剩下與視覺化模型相符的區塊。最後再將這些資料區塊用Tag的形式來表示(tag model)。當我們在建立一個購物網站搜尋引擎的Wrapper時,則可以整合這些tag模型來得到此搜尋引擎的Wrapper。當同一個搜尋引擎有其他網頁要擷取商品時,可以使用已經建立好的Wrapper來擷取。
The rapid development of the Internet makes online transactions more convenient. More and more people are buying necessary goods online, which means online shopping and diversification have also become increasing popular. Meanwhile cross-site comparison of the goods is more and more demanding, so crosssite information extraction technology is, therefore, increasingly essential. We propose a method much closer to user browsing information, which is to add visual information for searching for data records. The chosen combination of adjacent blocks is then compared with another combination of adjacent blocks. After we find the matching combination of blocks, then these blocks can build a visual model to describe themselves. We go back the page in the browser to find the rest of the blocks in line with the visual model. Finally, these blocks can be described by using the structure of the HTML tag as a tag model. When we set up a wrapper for a shopping search engine, we can integrate these tag models to obtain the data records for the search engine. When a search engine with any other result pages needs to capture the information of the goods, we can use the Wrapper which has been built to do it.
1 Introduction 4
2 System Overview and Data Reprocessing 8
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Data Reprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 HTML DOM . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Building the HTML Vision Tree . . . . . . . . . . . . . . 13
3 Method Fundamentals 17
3.1 Finding Data Regions . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Comparing of Generalized Nodes . . . . . . . . . . . . . 19
3.1.2 Similarity of Generalized Nodes . . . . . . . . . . . . . . 21
3.1.3 Comparing of Candidate Data Regions . . . . . . . . . . 25
3.1.4 Finding Data Region . . . . . . . . . . . . . . . . . . . . 26
3.2 Building Visual Model . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Attribute Labeling and Transforming to Tag Model . . . . . . . . 30
4 Wrapper Building 34
5 Experiments 36
5.1 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Shopping Search Engine . . . . . . . . . . . . . . . . . . . . . . 37
5.3 General Search Engine . . . . . . . . . . . . . . . . . . . . . . . 39
6 Conclusion 41
[1] Chia-Hui Chang, Mohammed Kayed, Moheb Ramzy Girgis and Khaled F.
Shaalan. A Survey of Web Information Extraction Systems, IEEE Transactions
on Knowledge and Data Engineering, v.18 n.10, p.1411-1428, October
2006
[2] Liu, B., Grossman, R. and Zhai, Y., Mining data records in Web pages. KDD,
601-606, 2003.
[3] Zhai, Y. and Liu, B. Web Data Extraction Based on Partial Tree Alignment.
Proceedings of the 14th International Conference on World Wide Web
(WWW), Japan, pp. 76-85, 2005.
[4] Chang, C.-H. and Lui, S.-C., IEPAD: Information extraction based on pattern
discovery. Proceedings of the Tenth International Conference on World Wide
Web (WWW), Hong-Kong, pp. 223-231, 2001.
[5] Chang, C.-H. and Kuo, S.-C. OLERA: A semi-supervised approach for Web
data extraction with visual support. IEEE Intelligent Systems, 19(6):56-64,
2004.
[6] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, Fully automatic wrapper
generation for search engines. In Proceedings of the 14th International
conference on World Wide World, 2005.
[7] Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y. Block-based Web Search. In Proc.
of SIGIR, 2004.
[8] Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y. 2D Conditional Random
Fields for Web Information Extraction. In Proc. of ICML, 2005.
[9] Lafferty, J., McCallum, A., and Pereira, F. . Conditional random fields: Probabilistic
models for segmenting and labeling sequence data. In Proc. ICML,
2001.
[10] Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang and Wei-Ying Ma. Simultaneous
record detection and attribute labeling in web data extraction, Proceedings
of the 12th ACM SIGKDD international conference on Knowledge
discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[11] F. Jousse, R. Gilleron, I. Tellier, and M. Tommasi. Conditional Random
Fields for XML trees. In Proc. ECML Workshop on Mining and Learning
in Graphs, Berlin, Germany, Sept. 2006.
[12] D. Cai, S. Yu, J.-R.Wen, andW.-Y.Ma, Extracting content structure for web
pages based on visual representation, Proc.5th Asia Pacific Web Conference,
Xian China, 2003.
[13] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, VIPS: a visionbased page segmentation
algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003.
[14] J. Kocibova, K. Klos, O. Lehecka, M. Kudelka, V. Snasel. Web Page Analysis:
Experiments Based on Discussion and Purchase Web Patterns, IEEE
ICDE, 2007.
[15] Y. Lu, H. He, H. Zhao, W. Meng, C. Yu. Annotating Structured Data of the
Deep Web. IEEE ICDE, 2007.
[16] B. Liu and Y. Zhai. NET - A System for Extracting Web Data from Flat and
Nested Data Records. WISE Conference, 2005.
[17] Ching-Liang Kang, Jyh-Jong Tasy. Design and Development of an Integrated
Product Search System. Master’s thesis, 2006
[18] Gusfield, D. Algorithm on strings, tree, and sequence. 1997.
[19] DOM http://www.w3.org/DOM/
[20] DOM http://www.w3.org/XML/
[21] MozillaFirefox http://www.mozilla.com/
[22] Yahoo http://tw.yahoo.com/
[23] Yahoo-bid http://tw.bid.yahoo.com/
[24] Yahoo-Shopping http://shopping.yahoo.com/
[25] PChome-shopping http://shopping.pchome.com.tw/
[26] ruten http://www.ruten.com.tw/
[27] PChome-store http://store.pchome.com.tw/
[28] eBay http://www.ebay.com.hk/
[29] Books.com http://www.books.com.tw/
[30] kingstone http://www.kingstone.com.tw/
[31] Costco http://www.costco.com/
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top