跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.152) 您好!臺灣時間:2025/11/06 18:36
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:張毅民
研究生(外文):Yi Min Chang
論文名稱:基於服務導向架構之Web Crawler的設計與實作
論文名稱(外文):Design and Implementation of a Web Crawler Based on Service Oriented Architecture
指導教授:張賢宗張賢宗引用關係
指導教授(外文):H. T. Chang
學位類別:碩士
校院名稱:長庚大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2012
畢業學年度:100
論文頁數:70
中文關鍵詞:網路爬蟲系統重複網址服務導向架構服務模組
外文關鍵詞:Web Crawler SystemURLs OverlappingSOAservice module
相關次數:
  • 被引用被引用:3
  • 點閱點閱:1074
  • 評分評分:
  • 下載下載:133
  • 收藏至我的研究室書目清單書目收藏:0
現在我們熟悉的網路資訊,從 World Wide Web 概念被提出到現在,其內容正以驚人的速度快速的成長著,而商業的模式、人們的閱讀習慣甚至是生活習慣,也都逐漸的被其所影響,這個龐大且豐富的資訊平台,透過搜尋引擎的幫助更讓資訊的流通產生了巨大的改變,由於其有動態的特性,透過搜尋引擎的幫助也才能讓這些資訊能夠有效的為人們所用。
近代的搜尋引擎都是採用 Crawler Based 的搜尋引擎,而一個搜尋引擎的好壞主要是看其資料收集(Data Collection)的好壞來做決定,Web Crawler System 即負責此項工作,因此也可以說一個 Web Crawler System 的好壞決定了一個搜尋引擎的好壞也不為過。
Web Crawler System 其架構可分成兩種,一種是 Centralized Distributed 的架構,另一種是 Non-Centralized Distributed 的架構,近代 Crawler Based 的搜尋引擎大多是採用第一種的設計架構,而此種設計架構會會將大部分的工作(如DNS Lookup、URL Filter等)交由其控制中心(Control Center)來負責,當下載數量過於龐大的網頁(Web Page)時將會導致 Control Center 遭遇到瓶頸像是URLs Overlapping,使得該 Web Crawler System 中其他台機器因 Control Center 遇到瓶頸而未被指派工作,造成機器閒置的狀況發生以及資源的浪費,因此我設計了一個基於服務導向架構(SOA)的 Web Crawler System ,目的是簡化 Control Center 的工作並將大型 Web Crawler System 會牽涉到的功能切割成數個不同的服務模組,使得系統下的子伺服器(Slave Server)閒置的機率降低以達到資源有效的利用。這部分我將會本篇論文第三章做更詳細的介紹。
而在第四章的部分,將會展示我實作出來的 Web Crawler Based on Service Oriented Architecture 其工作的效能,並且統計我所設計的系統一天內能夠擷取多少個 Web Pages,另外還會測試我所設計的URL Filter Module是否能在快速的時間內過濾掉重複的URLs,而關於URL Filter Module會在3.3節做詳細的介紹。
Now we are familiar with the network information from the World Wide Web concept is proposed to the present, its contents at an alarming rate, rapid growth, the mode of business, people's reading habits, even habits are gradually be affected by this large and rich information platform, through the help of search engine leaving the flow of information is changing dramatically, due to its dynamic characteristics, through the help of search engines to make this information can be effective for people to use .
Modern search engines are based on the Crawler based search engines, a search engine is good or bad is to look at the data collection (Data Collection) is good or bad to make a decision, Web Crawler System is responsible for this work, so it can be said A Web Crawler System good or bad decision a search engine is good or bad is not excessive.
Web Crawler System of its architecture can be divided into two, a Centralized Distributed architecture and the other is Non-Centralized Distributed architecture, modern times the Crawler based search engines are mostly based on the design architecture of the first, and such a design framework most of the work (such as DNS Lookup, URL Filter, etc.) by the Control Center (Control Center) is responsible for, when the download is too large number of pages (Web Page) will result in the Control Center encounter bottlenecks such as URLs Overlapping , making the Web Crawler other machine in the System Control Center encounter a problem, but has not been assigned duties, resulting in the machine idle status and a waste of resources, so I designed a service-oriented architecture (SOA)-based Web Crawler System The aim is to work and large-scale Web Crawler System will simplify Control Center involve features into several different service modules, making the sub-server (Slave Server) system idle lower risk of the use of resources to be effective. In Chapter III of this paper there are more detailed description.
Part in the fourth chapter, will show you I really made to the Web Crawler Based on Service Oriented Architecture of its work performance and statistics I have designed the system one day be able to retrieve the number of Web Pages, In addition, we will test my URL Filter Module is designed to filter in a quick time out duplicate URLs and URL Filter Module 3.3, described in detail.
目錄
指導教授推薦書 i
論文口試委員審定書 ii
國家圖書館博碩士論文電子檔案上網授權書 iii
長庚大學博碩士論文著作授權書 iv
誌謝 v
中文摘要 vi
ABSTRACT viii
目錄 x
圖目錄 xi
表目錄 xiii
第一章 簡介 - 1 -
1.1 研究背景與動機 - 1 -
1.2 研究目的 - 7 -
第二章 相關研究 - 9 -
2.1 Non-Centralized Distributed Web Crawler System - 12 -
2.2 Centralized Distributed Web Crawler System - 16 -
第三章 系統設計與實作 - 22 -
3.1 系統架構 - 22 -
3.2 系統運作流程 - 35 -
3.3 URL Filter Module - 39 -
第四章 實驗設計與分析 - 44 -
4.1 Time Cost of URL Filter by Using Bloom Filter - 45 -
4.2 New URLs V.S. False Positive URLs ─ 47 ─
4.3 Web Crawler System之運行結果 - 49 -
第五章 結論 - 52 -
第六章 參考文獻 - 54 -

圖目錄
圖 一 全球Web Site以及Hostname從1995.8 – 2012.4的成長曲線圖 - 2 -
圖 二 Control Center工作簡化圖 - 8 -
圖 三Web Crawler架構圖 - 9 -
圖 四 Web Crawler運作流程圖 - 11 -
圖 五 Non-Centralized Distributed Web Crawler System Architecture with Communication - 12 -
圖 六 Non-Centralized Distributed Web Crawler System Architecture without Communication - 13 -
圖 七 Three Modes of Parallel Web Crawler - 15 -
圖 八Centralized Distributed Web Crawler System Architecture - 16 -
圖 九 KapokCrawler系統架構 - 17 -
圖 十 SOA Based Web Crawler System Architecture - 22 -
圖 十一 SOA Based Web Crawler System Architecture with Important Modules - 24 -
圖 十二 Information of HeaderMetaData - 31 -
圖 十三 Information of ExtractLinks - 34 -
圖 十四 SOA Based Web Crawler System Work Flow - 36 -
圖 十五A hierarchical method for solving URLs overlapping - 41 -
圖 十六 Bloom Filter Table - 42 -
圖 十七 Total time cost of URLs uniqueness in 1 million URLs – 10 million URLs. (With no old URLs in disk at first) - 45 -
圖 十八 Number of URLs in 24 hours. - 49 -

表目錄
表 一 紀錄個服務模組的相關資訊(Reg.log) - 25 -
表 二 HeaderMetaData XML Tags定義 - 29 -
表 三 ExtractLinks XML Tags - 33 -
表 四 Overhead time in our strategy. - 46 -
表 五 New URLs V.S. False Positive URLs. - 48 -
表 六 Number of URLs per hours. - 50 -
表 七 系統運行之結果 - 51 -
[1] A. Dixit, A. K. Sharma, “A Mathematical Module for Crawler Revisit Frequency”, Advance Computing Conference (IACC), 2010 IEEE 2nd
[2] J. Cho, G. M. Hector , “Parallel Crawlers “, Proceedings of the 11th international conference on World Wide Web Pages 124 - 135, WWW '02
[3] A. Susumu , K. Yoshikiyo, K. Daisuke, S. Keiji, I. Kentaro, K. Sadao, K. Yutaka, “Development of a Large-scale Web Crawler and Search Engine Infrastructure”, Proceedings of the 3rd International Universal Communication Symposium Pages 126-131, IUCS '09
[4] K. Sharma , J.P. Gupta , D. P. Agarwal, “PARCAHYD: An Architecture of a Parallel Crawler based on Augmented Hypertext Documents “, IJOAT 2010
[5] D. Shoubin, X. F. Lu, L. Zhang, and H. Kejing, “An Efficient Parallel Crawler in Grid Environment”, GCC 2003, LNCS 3032, pp. 229–232, 2004.
[6] M. Debajyoti, M. Sajal, G. Soumya, K. Saheli, Y. C. Kim, “Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine”, 6th International Workshop on MSPT Proceedings MSPT 2006
[7] S. Akansha, K. K. Singh, “Faster and Efficient Web Crawling with Parallel Migrating Web Crawler”, IJCSI International Journal of Computer Science Issues, Vol. 7,Issue 3, No 11, May 2010
[8] Q.C. Chen, X. H. Yang, X. L. WANG, “A Peer-to-Peer Based Passive Web Crawling System”, International Conference on Machine Learning and Cybernetics on 2011 IEEE, Guilin, 10-13 July, 2011
[9]. A. Heydon, M. Najork, “ Mercator: A scalable, extensible web crawler”, World Wide Web Volume 2 219–229, Number 4 1999
[10]. M. Najork, A. Heydon, “High-performance web crawling”, Technical report, SRC Research Report 173, Compaq Systems Research, Palo Alto, CA , 2001
[11]. B. Pinkerton, “Finding what people want: Experiences with the web crawler”, WWW, pp. 30–40 , 1994
[12]. H. T. Lee, D. Leonard, X. Wang, D. Loguinov, ” Irlbot: Scaling to 6 billion pages and beyond”, ACM Transactions on the Web 3(3), 1–34 , 2009
[13]. V. Shkapenyuk, T. Suel, ”Design and implementation of a high-performance distributed web crawler.” In: ICDE, pp. 357–368, 2002
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top