研究生(外文):Hsin-Mao Huang
論文名稱(外文):Web Data Retrieval, Management, and Analysis
指導教授(外文):Ming-Syan Chen
外文關鍵詞:Web Data MiningDistribution ComputingPeer-to-Peer System
為了擷取個別使用者存取模式,我們設計且實做了存取模式蒐集伺服器去實施全球資訊網資料探勘。經由頁面轉換的概念,我們設計的方法將實際上的解決代理伺服器所造成的使用者行為蒐集上的困難。在結果上證實了使用我們設計的方法所產生的traversal patterns比原本網頁伺服器所產生的Patterns不僅包含了更多的資訊而且也更加精準。
The World Wilde Web is a popular and interactive medium to disseminate information today. The Web has become a huge and mostly unstructured data repository. Peer-to-Peer system also has become a popular file sharing platform in recent years. In this dissertation, we consider three issues: capturing individual user''s access patterns for Web data mining, the influence of user''s clicking behavior and user''s interest for Web structure mining, and the searching policy for P2P system.
For capturing individual user''s access pattern, we design and implement an access pattern collection server to conduct data mining in the Web. By using the concept of page conversion, the proposed method is able to resolve the difficulty imposed by proxy servers and capture the Web user behavior effectively. Using the devised mechanism, traversal patterns are generated and compared to those produced by the ordinary Web servers to validate our results.
In addition, for considering the page readers'' contribution in Web structure mining, the influence of user''s interest in VIPAS system is discussed. We devise a new algorithm, called Adjustable Cluster based VIPAS (AC-VIPAS), to adjust Web pages'' scores according to the recommendation of users with similar interest. The experiment is conducted to evaluate the performance of the content based user cluster.
Finally, for improving the searching performance in Peer-to-Peer system, we propose a cluster-based peer-to-peer system, called PeerCluster. In PeerCluster, all participant computers are grouped into various interest clusters, each of which contains computers that have the same interests. To efficiently route and broadcast messages across/within interest clusters, a hypercube topology is employed. Moreover, we augment PeerCluster with a system recovery mechanism to make it robust against unpredictable computer/network failures.
1 Introduction 5
1.1 Motivation and Overview of the Dissertation 5
1.2 Organization of theDissertation 11
2 Capturing User Access Patterns in the Web for Data Mining 13
2.1 Introduction 13
2.2 Access Pattern Collection Server 15
2.2.1 Enciphering Module 16
2.3 Employing the APCS Logs for Traversal Pattern Derivation 17
2.3.1 Mining on Logs fromOrdinaryWeb Servers 17
2.3.2 Mining on Logs from APCS 20
2.3.3 Remark 22
2.4 Summary 24
3 AC-VIPAS: Adjustable Cluster Based Virtual Link Powered Authority Search 25
3.1 Introduction 25
3.2 Preliminary 28
3.2.1 The Notion of Virtual Links 29
3.2.2 VIPAS Algorithm 30
3.3 AC-VIPAS: Adjustable Cluster Based VIPAS Algorithm 31
3.3.1 Content Based User Cluster 33
3.3.2 Adjustment of Web pages’ scores 36
3.3.3 Discussion 37
3.4 Experimental Analysis 38
3.5 Summary 40
4 PeerCluster: A Cluster-Based Peer-to-Peer System 41
4.1 Introduction 41
4.2 Preliminaries 45
4.3 Cluster-based Peer-to-Peer System 47
4.3.1 Design of PeerCluster 47
4.3.2 Description of Protocols 51
4.3.3 Scalability 53
4.4 Performance Analysis 56
4.4.1 Simulation Model 56
4.4.2 Experimental Results 57
4.5 Summary 61
5 Conclusions 63
