跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.40) 您好!臺灣時間:2026/06/17 00:52
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:姚文鋒
研究生(外文):Wen-Feng Yao
論文名稱:網站內網頁之區塊等級分析
論文名稱(外文):Block-level Ranking for Intra-Website Pages
指導教授:吳毅成
指導教授(外文):I-Chen Wu
學位類別:碩士
校院名稱:國立交通大學
系所名稱:資訊學院碩士在職專班資訊組
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2007
畢業學年度:95
語文別:中文
論文頁數:37
中文關鍵詞:網站區塊等級鏈結分析
外文關鍵詞:Intra-WebSiteBlock-levelLink Analysis
相關次數:
  • 被引用被引用:0
  • 點閱點閱:249
  • 評分評分:
  • 下載下載:35
  • 收藏至我的研究室書目清單書目收藏:0
依據統計資料,截自2007 年6 月為止全世界的網頁數量有超過140 億個之多,面對這樣龐大的資料庫,如何有效地使用是一件很重要的事。對於未知路徑的資料,通常尋求搜尋引擎的協助來正確定位資料;對於已知路徑的資料,為了增加使用效率,則會使用資料萃取的技術。
本實驗室所開發的BODE (Browser Oriented Data Extraction)系統即是一套網頁資料萃取系統,使用者可以透過人性化的操作介面點選所要萃取的資料,再由系統產生萃取所需的腳本(BODE script),並進行萃取的動作。
然而在建構BODE script 的過程中,使用者必須要對BODE script 語法、XPath 及HTML Tag 有一定程度的了解才能順利進行。因此為了降低BODE系統的使用門檻,本論文提出了自動辨識單一網站內有用資料區塊的演算法,以便協助達成自動建立BODE script 的目標。
According to the statistical data, there are more than 14 billion web pages in whole world by June of 2007. It’s a important thing that how to use this huge database efficiently. For the information that we do not know its location, we usually use search engines to help us to find it out. And for the information that we do know where it is, we use data extraction to increase the efficiency.
BODE (Browser Oriented Data Extraction), developed by our laboratory, is such a web data extraction system. Its GUI can be used to indicate the data they want to retrieve, and the system will generate the BODE script that is used in the extraction process, and then start to extract.
However, people must have the basic knowledge about the syntax of BODE script, XPath and HTML Tag to build the BODE script. To reduce the threshold of using BODE system, this
thesis proposes an algorithm to distinguish the useful information blocks from a single web site, so as to accomplish the goal of automatically generating BODE script.
中文摘要 ........i
英文摘要 ...... ii
誌謝 ............ iii
目錄 .............iv
表列 .............vi
圖目錄 ........ vii
第一章 緒論 ................. 1
1.1 研究背景及動機 ..... 1
1.2 論文內容概述及大綱 ...... 2
第二章 相關研究 ............. 4
2.1 網頁區塊分割 ........ 4
2.2 鏈結分析 ................ 6
2.3 區塊層級的鏈結分析 ...... 8
第三章 演算法介紹 .......... 10
3.1 區塊合併及鄰接矩陣之建立 11
3.1.1 區塊合併 .... 11
3.1.2 鄰接矩陣之建立 ...... 12
3.2 權重設定 .............. 12
3.2.1 HUB權重 ..... 12
3.2.2 AUTHORITY權重 ............ 15
3.3 加入合併的區塊層級鏈結分析 . 16
第四章 系統實作與實驗 .. 17
4.1 系統實作概述 . 17
4.1.1 開發環境 ... 17
4.1.2 模組介紹 ... 17
4.1.3 系統流程 ... 18
v
4.2 實驗 .................. 19
4.2.1 資料來源 ... 19
4.2.2 實驗結果 ... 22
第五章 結論 ............... 32
5.1 總結 .................. 32
5.2 相關應用 .............. 32
5.3 未來工作 .............. 34
參考文獻 ......... 36
[1] I-Chen Wu, Jui-Yuan Su, Loon-Been Chen, "A Web Data Extraction Description
Language and Its Implementation," compsac, pp. 293-298, 29th Annual International
Computer Software and Applications Conference (COMPSAC'05) Volume 1, 2005
[2] http://www.worldwidewebsize.com
[3] http://www.google.com
[4] http://www.google.com/products
[5] http://www.scholar.com
[6] Cai, D., Yu, S., Wen, JR, Ma, WY, 2003. “VIPS: A Vision-base Page Segmentation
Algorithm.”, Technical Report, MSR-TR-2003-79, Microsoft Research Asia.
[7] J. Kleinberg, “Authoritative sources in a hyperlinked environment”, Journal of the ACM, Vol. 46,
No. 5, pp. 604-622,1999.
[8] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation ranking:
Bringing order to the web”, Technical report, Stanford University, Stanford, CA, 1998.
[9] Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma., “Block-level Link Analysis”,
Microsoft Technical Report MSR-TR-2004-50, 2004.
[10] Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma., “Block-based Web Search.”, In
Proc. of the SIGIR’04 Conf., pages 456-463, 2004.
[11] Z. Nie, Y. Zhang, JR Wen, and WY Ma., “Object- level ranking: Bringing order to web
objects.”, In Proceedings of WWW Conference, 2005.
[12] Ruihua Song, Haifeng Liu, Ji-Rong Wen and Wei-Ying Ma, “Learning Block Importance
Models for Web Pages[A].” In proceeding of the Thirteenth World Wide Web
conference[C], New York, NY: ACM Press, 2004, 203-211.
[13] Shian-Hua Lin,Jan-Ming Ho. “Discovering Informative Content Blocks from Web
Documents”, KDD-02, 2002
[14] http://www.websiteoptimization.com/speed/tweak/clickstream/
[15] Chang, C.-H., and Shao-Chen, L. IEPAD: Information extraction based on pattern
discovery. In Proceedings of the tenth international conference on World Wide
Web(2001)
[16] Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proceedings of
ACM SIGMOD International Conference on Management of Data (SIGMOD 2003), San Diego,
California, USA, ACM Press (2003)
[17] Hung-Yu Kao, Shian-Hua Lin, Jan-Ming Ho, Ming-Syan Chen, Mining Web Information
Structures and Contents based on Entropy Analysis, IEEE Transactions on Knowledge and Data
Engineering , volume 16, issue 1, pages 41-55, Jan 2004.
[18] Hung-Yu Kao, Jan-Ming Ho, Ming-Syan Chen, WISDOM : Web Intra-page Informative Structure
Mining based on Document Object Model, IEEE Transactions on Knowledge and Data
Engineering, volume 17, issue 5, pages 614- 627, May 2005.
[19] Mendez-Torreblanca, A., Montes-y-Gomez, M., and Lopez-Lopez, A.: A Trend Discovery.
System for Dynamic Web Content Mining. Proceedings of the 11. th. International Confer-. ence
on Computing, Mexico City, Mexico (2002)
[20] S. Debnath, P. Mitra, N. Pal, and C. L. Giles, “Automatic Identification of Informative Sections of
Web Pages,” IEEE Transactions on Knowledge and Data Engineering 17, 9, Sep. 2005.
[21] Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma and Ji-Rong Wen, “Hierarchical Clustering of
WWW Image Search Results Using Visual, Textual and Link Analysis”,12th ACM International
Conference on Multimedia, Oct. 2004 .
[22] Zaiqing Nie, Yunxiao Ma, Shuming Shi, Ji-Rong Wen and Wei-Ying Ma, Web Object Retrieval,
The 16th international World Wide Web conference (WWW 2007)
[23] CHEN, Z, LI, T, WANG, J, LIU, W Y and MA, W Y, "A Unified Framework for Web Link
Analysis", Proceedings of the 3rd International Conference on Web Information Systems
Engineering (WISE 2002), Singapore, December 2002, pp 63-72.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top