臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.152) 您好！臺灣時間：2025/11/06 18:36

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
電子全文
紙本論文
論文連結
QR Code

本論文永久網址:

研究生:

張毅民

研究生(外文):

Yi Min Chang

論文名稱:

基於服務導向架構之Web Crawler的設計與實作

論文名稱(外文):

Design and Implementation of a Web Crawler Based on Service Oriented Architecture

指導教授:

張賢宗

指導教授(外文):

H. T. Chang

學位類別:

碩士

校院名稱:

長庚大學

系所名稱:

資訊工程學系

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2012

畢業學年度:

100

論文頁數:

中文關鍵詞:

網路爬蟲系統、重複網址、服務導向架構、服務模組

外文關鍵詞:

Web Crawler System、URLs Overlapping、SOA、service module

相關次數:

被引用:3
點閱:1074
評分:
下載:133
書目收藏:0

現在我們熟悉的網路資訊，從 World Wide Web 概念被提出到現在，其內容正以驚人的速度快速的成長著，而商業的模式、人們的閱讀習慣甚至是生活習慣，也都逐漸的被其所影響，這個龐大且豐富的資訊平台，透過搜尋引擎的幫助更讓資訊的流通產生了巨大的改變，由於其有動態的特性，透過搜尋引擎的幫助也才能讓這些資訊能夠有效的為人們所用。
近代的搜尋引擎都是採用 Crawler Based 的搜尋引擎，而一個搜尋引擎的好壞主要是看其資料收集（Data Collection）的好壞來做決定，Web Crawler System 即負責此項工作，因此也可以說一個 Web Crawler System 的好壞決定了一個搜尋引擎的好壞也不為過。
Web Crawler System 其架構可分成兩種，一種是 Centralized Distributed 的架構，另一種是 Non-Centralized Distributed 的架構，近代 Crawler Based 的搜尋引擎大多是採用第一種的設計架構，而此種設計架構會會將大部分的工作（如DNS Lookup、URL Filter等）交由其控制中心（Control Center）來負責，當下載數量過於龐大的網頁（Web Page）時將會導致 Control Center 遭遇到瓶頸像是URLs Overlapping，使得該 Web Crawler System 中其他台機器因 Control Center 遇到瓶頸而未被指派工作，造成機器閒置的狀況發生以及資源的浪費，因此我設計了一個基於服務導向架構（SOA）的 Web Crawler System ，目的是簡化 Control Center 的工作並將大型 Web Crawler System 會牽涉到的功能切割成數個不同的服務模組，使得系統下的子伺服器（Slave Server）閒置的機率降低以達到資源有效的利用。這部分我將會本篇論文第三章做更詳細的介紹。
而在第四章的部分，將會展示我實作出來的 Web Crawler Based on Service Oriented Architecture 其工作的效能，並且統計我所設計的系統一天內能夠擷取多少個 Web Pages，另外還會測試我所設計的URL Filter Module是否能在快速的時間內過濾掉重複的URLs，而關於URL Filter Module會在3.3節做詳細的介紹。

Now we are familiar with the network information from the World Wide Web concept is proposed to the present, its contents at an alarming rate, rapid growth, the mode of business, people's reading habits, even habits are gradually be affected by this large and rich information platform, through the help of search engine leaving the flow of information is changing dramatically, due to its dynamic characteristics, through the help of search engines to make this information can be effective for people to use .
Modern search engines are based on the Crawler based search engines, a search engine is good or bad is to look at the data collection (Data Collection) is good or bad to make a decision, Web Crawler System is responsible for this work, so it can be said A Web Crawler System good or bad decision a search engine is good or bad is not excessive.
Web Crawler System of its architecture can be divided into two, a Centralized Distributed architecture and the other is Non-Centralized Distributed architecture, modern times the Crawler based search engines are mostly based on the design architecture of the first, and such a design framework most of the work (such as DNS Lookup, URL Filter, etc.) by the Control Center (Control Center) is responsible for, when the download is too large number of pages (Web Page) will result in the Control Center encounter bottlenecks such as URLs Overlapping , making the Web Crawler other machine in the System Control Center encounter a problem, but has not been assigned duties, resulting in the machine idle status and a waste of resources, so I designed a service-oriented architecture (SOA)-based Web Crawler System The aim is to work and large-scale Web Crawler System will simplify Control Center involve features into several different service modules, making the sub-server (Slave Server) system idle lower risk of the use of resources to be effective. In Chapter III of this paper there are more detailed description.
Part in the fourth chapter, will show you I really made to the Web Crawler Based on Service Oriented Architecture of its work performance and statistics I have designed the system one day be able to retrieve the number of Web Pages, In addition, we will test my URL Filter Module is designed to filter in a quick time out duplicate URLs and URL Filter Module 3.3, described in detail.

目錄
指導教授推薦書 i
論文口試委員審定書 ii
國家圖書館博碩士論文電子檔案上網授權書 iii
長庚大學博碩士論文著作授權書 iv
誌謝 v
中文摘要 vi
ABSTRACT viii
目錄 x
圖目錄 xi
表目錄 xiii
第一章簡介 - 1 -
1.1 研究背景與動機 - 1 -
1.2 研究目的 - 7 -
第二章相關研究 - 9 -
2.1 Non-Centralized Distributed Web Crawler System - 12 -
2.2 Centralized Distributed Web Crawler System - 16 -
第三章系統設計與實作 - 22 -
3.1 系統架構 - 22 -
3.2 系統運作流程 - 35 -
3.3 URL Filter Module - 39 -
第四章實驗設計與分析 - 44 -
4.1 Time Cost of URL Filter by Using Bloom Filter - 45 -
4.2 New URLs V.S. False Positive URLs ─ 47 ─
4.3 Web Crawler System之運行結果 - 49 -
第五章結論 - 52 -
第六章參考文獻 - 54 -

圖目錄
圖一全球Web Site以及Hostname從1995.8 – 2012.4的成長曲線圖 - 2 -
圖二 Control Center工作簡化圖 - 8 -
圖三Web Crawler架構圖 - 9 -
圖四 Web Crawler運作流程圖 - 11 -
圖五 Non-Centralized Distributed Web Crawler System Architecture with Communication - 12 -
圖六 Non-Centralized Distributed Web Crawler System Architecture without Communication - 13 -
圖七 Three Modes of Parallel Web Crawler - 15 -
圖八Centralized Distributed Web Crawler System Architecture - 16 -
圖九 KapokCrawler系統架構 - 17 -
圖十 SOA Based Web Crawler System Architecture - 22 -
圖十一 SOA Based Web Crawler System Architecture with Important Modules - 24 -
圖十二 Information of HeaderMetaData - 31 -
圖十三 Information of ExtractLinks - 34 -
圖十四 SOA Based Web Crawler System Work Flow - 36 -
圖十五A hierarchical method for solving URLs overlapping - 41 -
圖十六 Bloom Filter Table - 42 -
圖十七 Total time cost of URLs uniqueness in 1 million URLs – 10 million URLs. (With no old URLs in disk at first) - 45 -
圖十八 Number of URLs in 24 hours. - 49 -

表目錄
表一　紀錄個服務模組的相關資訊（Reg.log） - 25 -
表二 HeaderMetaData XML Tags定義 - 29 -
表三 ExtractLinks XML Tags - 33 -
表四 Overhead time in our strategy. - 46 -
表五 New URLs V.S. False Positive URLs. - 48 -
表六 Number of URLs per hours. - 50 -
表七系統運行之結果 - 51 -

[1] A. Dixit, A. K. Sharma, “A Mathematical Module for Crawler Revisit Frequency”, Advance Computing Conference (IACC), 2010 IEEE 2nd
[2] J. Cho, G. M. Hector , “Parallel Crawlers “, Proceedings of the 11th international conference on World Wide Web Pages 124 - 135, WWW '02
[3] A. Susumu , K. Yoshikiyo, K. Daisuke, S. Keiji, I. Kentaro, K. Sadao, K. Yutaka, “Development of a Large-scale Web Crawler and Search Engine Infrastructure”, Proceedings of the 3rd International Universal Communication Symposium Pages 126-131, IUCS '09
[4] K. Sharma , J.P. Gupta , D. P. Agarwal, “PARCAHYD: An Architecture of a Parallel Crawler based on Augmented Hypertext Documents “, IJOAT 2010
[5] D. Shoubin, X. F. Lu, L. Zhang, and H. Kejing, “An Efficient Parallel Crawler in Grid Environment”, GCC 2003, LNCS 3032, pp. 229–232, 2004.
[6] M. Debajyoti, M. Sajal, G. Soumya, K. Saheli, Y. C. Kim, “Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine”, 6th International Workshop on MSPT Proceedings MSPT 2006
[7] S. Akansha, K. K. Singh, “Faster and Efficient Web Crawling with Parallel Migrating Web Crawler”, IJCSI International Journal of Computer Science Issues, Vol. 7,Issue 3, No 11, May 2010
[8] Q.C. Chen, X. H. Yang, X. L. WANG, “A Peer-to-Peer Based Passive Web Crawling System”, International Conference on Machine Learning and Cybernetics on 2011 IEEE, Guilin, 10-13 July, 2011
[9]. A. Heydon, M. Najork, “ Mercator: A scalable, extensible web crawler”, World Wide Web Volume 2 219–229, Number 4 1999
[10]. M. Najork, A. Heydon, “High-performance web crawling”, Technical report, SRC Research Report 173, Compaq Systems Research, Palo Alto, CA , 2001
[11]. B. Pinkerton, “Finding what people want: Experiences with the web crawler”, WWW, pp. 30–40 , 1994
[12]. H. T. Lee, D. Leonard, X. Wang, D. Loguinov, ” Irlbot: Scaling to 6 billion pages and beyond”, ACM Transactions on the Web 3(3), 1–34 , 2009
[13]. V. Shkapenyuk, T. Suel, ”Design and implementation of a high-performance distributed web crawler.” In: ICDE, pp. 357–368, 2002

電子全文

國圖紙本論文

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供，不一定有電子全文可供下載，若連結有誤，請點選上方之〝勘誤回報〞功能，我們會盡快修正，謝謝！

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	跨境網購商品比價資訊系統的設計與實作-以P公司為案例

無相關期刊

1.	應用多重代理人系統於情報蒐集的社交網路爬蟲之研究：以臉書爬蟲之建構為例
2.	具規則擷取與自我維護機制之監督式網頁擷取系統設計、開發及驗證
3.	結合 BDI 代理人與網路爬蟲
4.	地理網路爬蟲：具擴充及擴展性之地理網路資源爬行架構
5.	基於主題式網路爬行器以提昇跨語言資訊檢索效能之資訊檢索系統
6.	應用網路爬蟲技術於氣象站降雨量預測分析
7.	依使用者意願之自動國內旅遊行程規劃
8.	運用圖片與資料探勘技術於產品關聯與行銷關鍵字之應用研究-以服飾流通業為例
9.	停止惡意掃描的陷阱系統
10.	一個以狀態為基礎之 Android 應用程式爬蟲器
11.	基於Web之商家景點擷取與資料庫建置
12.	腦性麻痹中醫治療之文獻與資料探討
13.	基於網路爬蟲和聲控技術應用於線上音樂點歌之研究
14.	設計與實作一個臉書粉絲頁資料抓取器
15.	跨網頁語言平台之SQL Injection攻擊產生系統

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室