跳到主要內容

臺灣博碩士論文加值系統

(2600:1f28:365:80b0:2119:b261:d24c:ce10) 您好!臺灣時間:2025/01/21 07:53
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:黃信貿
研究生(外文):Hsin-Mao Huang
論文名稱:全球資訊網中之資料擷取、管理與分析
論文名稱(外文):Web Data Retrieval, Management, and Analysis
指導教授:陳銘憲陳銘憲引用關係
指導教授(外文):Ming-Syan Chen
學位類別:博士
校院名稱:國立臺灣大學
系所名稱:電機工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2005
畢業學年度:93
語文別:英文
論文頁數:70
中文關鍵詞:網頁資料探勘分散式計算點對點系統
外文關鍵詞:Web Data MiningDistribution ComputingPeer-to-Peer System
相關次數:
  • 被引用被引用:0
  • 點閱點閱:161
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:2
現今全球資訊網是受歡迎的一種交互式訊息傳播的媒介。網際網路已經變成了龐大的且具為無架構的資料容器。Peer-to-peer系統也已經變成廣泛的檔案分享平台。在本篇論文,我們探討了三項技術:為了全球資訊網資料探勘中個別使用者存取模式的擷取、使用者點選行為與使用者興趣對全球資訊網結構探勘的影響和P2P系統的搜尋策略。
為了擷取個別使用者存取模式,我們設計且實做了存取模式蒐集伺服器去實施全球資訊網資料探勘。經由頁面轉換的概念,我們設計的方法將實際上的解決代理伺服器所造成的使用者行為蒐集上的困難。在結果上證實了使用我們設計的方法所產生的traversal patterns比原本網頁伺服器所產生的Patterns不僅包含了更多的資訊而且也更加精準。
此外,為了探討在網頁結構探勘中在閱讀網頁使用者上的貢獻,使用者閱讀行為的影響已經被討論在VIPAS系統上。我們設計一個稱為AC-VIPAS的新演算法,此演算法將根據相似興趣的使用者的推薦來微調網頁次序。我們建立了評估以內文基礎的使用者叢集效能的實驗。實驗結果呈現出我們提出的以內文基礎的使用者叢集演算法的正確率是好過傳統的計數基礎的使用者叢集演算法。
最後,為了改善P2P系統上的搜尋效率,我們提出一個叢集式的P2P系統,稱為PeerCluster。在PeerCluster中,所有加入的電腦都被分到一個興趣叢集,而在興趣叢集中所有的電腦都是具有同一主題的興趣。為了能夠在興趣叢集間快速路由及廣播,我們使用了hypercube網路拓普來實作我們的系統。而且,我們也增強PeerCluster具有系統自動修復機制以對抗不可預期的電腦故障與網路中斷。
The World Wilde Web is a popular and interactive medium to disseminate information today. The Web has become a huge and mostly unstructured data repository. Peer-to-Peer system also has become a popular file sharing platform in recent years. In this dissertation, we consider three issues: capturing individual user''s access patterns for Web data mining, the influence of user''s clicking behavior and user''s interest for Web structure mining, and the searching policy for P2P system.
For capturing individual user''s access pattern, we design and implement an access pattern collection server to conduct data mining in the Web. By using the concept of page conversion, the proposed method is able to resolve the difficulty imposed by proxy servers and capture the Web user behavior effectively. Using the devised mechanism, traversal patterns are generated and compared to those produced by the ordinary Web servers to validate our results.
In addition, for considering the page readers'' contribution in Web structure mining, the influence of user''s interest in VIPAS system is discussed. We devise a new algorithm, called Adjustable Cluster based VIPAS (AC-VIPAS), to adjust Web pages'' scores according to the recommendation of users with similar interest. The experiment is conducted to evaluate the performance of the content based user cluster.
Finally, for improving the searching performance in Peer-to-Peer system, we propose a cluster-based peer-to-peer system, called PeerCluster. In PeerCluster, all participant computers are grouped into various interest clusters, each of which contains computers that have the same interests. To efficiently route and broadcast messages across/within interest clusters, a hypercube topology is employed. Moreover, we augment PeerCluster with a system recovery mechanism to make it robust against unpredictable computer/network failures.
1 Introduction 5
1.1 Motivation and Overview of the Dissertation 5
1.2 Organization of theDissertation 11
2 Capturing User Access Patterns in the Web for Data Mining 13
2.1 Introduction 13
2.2 Access Pattern Collection Server 15
2.2.1 Enciphering Module 16
2.3 Employing the APCS Logs for Traversal Pattern Derivation 17
2.3.1 Mining on Logs fromOrdinaryWeb Servers 17
2.3.2 Mining on Logs from APCS 20
2.3.3 Remark 22
2.4 Summary 24
3 AC-VIPAS: Adjustable Cluster Based Virtual Link Powered Authority Search 25
3.1 Introduction 25
3.2 Preliminary 28
3.2.1 The Notion of Virtual Links 29
3.2.2 VIPAS Algorithm 30
3.3 AC-VIPAS: Adjustable Cluster Based VIPAS Algorithm 31
3.3.1 Content Based User Cluster 33
3.3.2 Adjustment of Web pages’ scores 36
3.3.3 Discussion 37
3.4 Experimental Analysis 38
3.5 Summary 40
4 PeerCluster: A Cluster-Based Peer-to-Peer System 41
4.1 Introduction 41
4.2 Preliminaries 45
4.3 Cluster-based Peer-to-Peer System 47
4.3.1 Design of PeerCluster 47
4.3.2 Description of Protocols 51
4.3.3 Scalability 53
4.4 Performance Analysis 56
4.4.1 Simulation Model 56
4.4.2 Experimental Results 57
4.5 Summary 61
5 Conclusions 63
[1] Open Directory Project (ODP). In http://dmoz.org/.
[2] The Discussion Board of eDonkey. In http://www.cyndi.idv.tw/forum/index.php.
[3] K. Aberer. P-Grid: A Self-Organizing Access Structure for P2P Information Systems. In
Proc. of the International Conference on Cooperative Information Systems, 2001.
[4] R. Agrawal, T. Imielinski, and A. Swami. Mining Associations between Sets of Items in
Massive Databases. In Proceeding of ACM SIGMOD, pages 207—216, May 1993.
[5] R. Agrawal and J. Shafer. Parallel Mining of Association Rules. IEEE Transactions on
Knowledge and Data Engineering, pages 8(6):866—883, December 1996.
[6] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases.
Proc.of the 20th International Conference on Very Large Data Bases, pages 478—499,
1994.
[7] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of International
Conference on Data Engineering (ICDE’95), pages 3—14, Mar 1995.
[8] AltaVista. In http://www.altavista.com/.
[9] R. Armstrong, D. Freitag, T. Joachime, and T. Mitchell. WebWatcher: A Learning Apprentice
for the World Wide Web. AAAI Spring Symposium on Information Gathering from
Heterogeneous, Distributed Environments, March 1995.
[10] G. O. Arocena and A. O. Mendelzon. WebOQL: Restructuring Documents, Databases and
Webs. Proceedings of the 14th International Conference on Data Engineering, February
1998.
[11] M. Balabanovic and Y. Shoham. Learning Information Retrieval Agents: Experiments with
AutomatedWeb Browsing. AAAI Spring Symposium Series on Information Gathering from
Distributed, Heterogeneous Environments, Working Notes, 1995.
[12] K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public
Web search engines. Proceedings of the Seventh International World-Wide Web Conference,
Brisbane, Australia, 1998.
[13] L. Bhuyan and D. P. Agrawal. Generalized Hypercube and Hyperbus Structures for a
Computer Network. volume C-33, pages 323—333, 1984.
[14] Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Proc. of
the 7th International World Wide Web Conference, 1998.
[15] S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proc.
7th Int. WWW Conf., April 1998.
[16] C. Buckley, G. Salton, and J. Allan. The Effect of Adding Relevance Information in a
Relevance Feedback Environment. International ACM SIGIR Conference on Research and
Development in Information Retreival, 1994.
[17] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A Query Language and Optimization
Techniques for Unstructured Data. Proceedings of ACM SIGMOD, pages 505—516, June
1996.
[18] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic
resource compilation by analyzing hyperlink structure and associated text. Proc. of
the 7th International World Wide Web Conference, 1998.
[19] S. Chakrabarti, B. Dom, D. Gibson, S. R. Kumar, P. Raghavan, S. Rajagopalan, and
A. Tomkins. Experiments in Topic Distillation. ACM SIGIR workshop on Hypertext Information
Retrieval on the Web, 1998.
[20] S. Chakrabrti, B. Dom, D. Gibson, J. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan,
and A. Tompkins. Mining the Link Structure of the World Wide Web. IEEE Computer,
August 1998.
[21] T. F. Chan and Y. Saad. Multigrid Algorithms on the Hypercube Multiprocessor. IEEE
Trans. on Computers, C-35(11):969—977, 1986.
[22] S. Chawather, H. G. Molina, J. Hammer, K. Irland, Y. Papakonstantinou, J. Ullman, and
J. Widom. The TSIMMIS Project: Integration of Heterogeneous Information Sources. Proceedings
of SPSJ Conf., 1994.
[23] M.-S. Chen, J. Han, and P. S. Yu. Data Mining: An Overview from a Database Perspective.
IEEE Transactions on Knowledge and Data Engineering, 8(6):866—833, 1996.
[24] M.-S. Chen, J.-S. Park, and P. S. Yu. Efficient Data Mining for Path Traversal Patterns.
IEEE Transactions on Knowledge and Data Engineering, 10(2), April 1998.
[25] M. S. Chen, P. S. Yu, and K. L. Wu. Optimal NODUP All-To-All Broadcasting Schemes in
Distributed Computing Systems. IEEE Trans. on Parallel and Distributed Systems, 5:1275—
1285, 1994.
[26] D.W. Cheung, V. T. Ng,W. Fu, and Y. Fu. Efficient Mining Association Rules in Distributed
Databases. IEEE Transactions on Knowledge and Data Engineering, pages 8(6):911—
922, December 1996.
[27] Clip2.com. The Gnutella Protocol Specification V0.4. In
http://www9.limewire.com/developer/ gnutella_protocol_0.4.pdf.
[28] R. Cooley, B. Mobasher, and J. Srivastava. Web Mining: Information and Pattern Discovery
on the World Wide Web. IEEE Conf. on Tools with Artificial Intelligence, pages 558—567,
1997.
[29] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. TheMIT
Press/McGraw-Hill Book Company, 1990.
[30] A. Crespo. Routing Indices for Peer-to-Peer Systems. In Proc. Of the 22nd International
Conf. On Distributed Computing Systems (ICDCS), 2002.
[31] A. F. E. Cohen and H. Kaplan. Associative Search in Peer to Peer Networks: Harnessing
Latent Semantics. In IEEE INFOCOM 2003, 2003.
[32] edonkey. In http://www.edonkey2000.com/.
[33] R. Fielding, J. Gettys, H. Frystyk, and T. Berners-Lee. Hypertext Transfer Protocol —
HTTP/1.1. Technical Report Request for Comments: 2068 Internet Engineering Task Force,
Jan 1997.
[34] Y. Fu, K. Sandhu, and M. Shih. Clustering of Web users based on access patterns. 1999.
[35] D. Gibson, J. Kleinberg, and P. Raghavan. InferringWeb Communities from Link Topology.
ACM Conference on Hypertext and Hypermedia, 1998.
[36] Google. In http://www.google.com/.
[37] Grokster. In http://www.grokster.com/.
[38] N. Gunther. Hypernets - Good (G)news for Gnutella. In
http://www.perfdynamics.com/Papers/Gnews.html, 2002.
[39] J. Han, G. Dong, and Y. Yin. Efficient Mining of Partial Periodic Patterns in Time Series
Database. Proceeding of the 15th International Conference on Data Engineering, March
1999.
[40] F. Harary. Graph Theory. Mass.: Addison-Wesley, 1969.
[41] D. Hardy and M. F. Schwartz. Essence: A Resource Discovery System Based on Semantic
File Indexing. Proc. of the USENIX Winter Conf., pages 361—374, 1993.
[42] X.-M. Huang, C.-Y. Chang, and M.-S. Chen. PeerCluster: A Cluster-Based Peer-to-Peer
System forWeb Data Sharing. accepted by IEEE Trans. on Parallel and Distributed Systems,
2005.
[43] B. m. K. Sripanidkulchai and H. Zhang. Efficient Content Location Using Interest-Based
Locality in Peer-to-Peer Systems. In IEEE INFOCOM 2003, 2003.
[44] Kazaa. In http://www.kazaa.com/.
[45] J. Kleinberg. Authoritative Sources in a Hyper Linked Environment. Proc. of ACM-SIAM
Symposium on Discrete Algorithms, 1998.
[46] D. Konopnicki and O. Shmueli. Information Gathering in the WWW: The W3QL Query
Language and the W3QS system. ACM Transactions on Database Systems, Dec. 1998.
[47] R. Laboratories. Answers to Frequently Asked Questions About Today’s Cryptography
Version 3.0. Technical report, 1996.
[48] L. Lakshmanan, F. Sadri, and I. Subramanian. A Declarative Language for Querying and
Restructuring the Web. Proc. 6th Int. Workshop on Research Issues in Data Engineering,
1996.
[49] T.-B. Lee, R. Cailliau, A. Loutonen, and A. Secret. TheWorld-WideWeb. Communications
of the ACM, pages 76—82, 1994.
[50] J. Liebeherr and T. K. Beam. HyperCast: A Protocol for Maintaining Multicast Group
Members in a Logical Hypercube Topology. In Proc. 1st InternationWorkshop on Networked
Group Communication (NGC’99), 1999.
[51] C.-C. Lin and M.-S. Chen. Vipas: Virtual link powered authority search in the web. Proc.
of the 29th Intern’l Conf. on Very Large Data Bases (VLDB-2003), September 2003.
[52] I.-Y. Lin and M.-S. Chen. On Methodology for Client-Based User Access Pattern Collection
in the Web. In Proceeding of the 11th Conference on Information Networking, Jan 1997.
[53] J. L. Lin and M. H. Dunham. Mining Association Rules: Anti-Skew Algorithms. Proceedings
of the 14th International Conference on Data Engineering, pages 486—493, February 1998.
[54] Q. Liv, P. Cao, E. Cohen, K. Li, and S. Shenker. Search and Replication in Unstructured
Peer-to-Peer Network. In Proc. of ACM SIGMETRIC’02, 2002.
[55] Lycos. In http://www.lycos.com/.
[56] S. D. M. Schlosser, M. Sintek and W. Nejdl. A scalable and ontology-based P2P infrastructure
for semantic web services. In Proc. of the 2th International Conference on Peer-to-Peer
Computing, pages 104—111, 2002.
[57] B. Mobasher, N. Jain, E.-H. Han, and J. Srivastava. Web Mining: Pattern Discovery from
World Wide Web Transactions. Technical Report TR 96-050, Univ. of Minnesota, Dept. of
CS, Minneapolis, 1996.
[58] Napster Inc. Napster Website. In http://www.napster.com.
[59] W.Nejdl,M.Wolpers,W. Siberski, C. Schmitz,M. S. andI. Brunkhorst, andA. Lser. Super-
Peer-Based Routing and Clustering Strategies for RDF-Based Peer-To-Peer Networks. In In
Proceedings of the 12th International World Wide Web Conference (WWW2003), Budapest,
Hungary, 2003.
[60] R. Ng and J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining.
Proceedings of the 18th International Conference on Very Large Data Bases, pages 144—155,
September 1994.
[61] T. R. I. A. of America (RIAA). Peer-to-peer file-sharing technology: Consumer protection
and competition issues. P2P File-Sharing Workshop, November 2004.
[62] J.-S. Park, M.-S. Chen, and P. S. Yu. Using a Hash-Based Method with Transaction Trimming
forMining Association Rules. IEEE Transactions on Knowledge and Data Engineering,
9(5):813—825, October 1997.
[63] M. Pazzani, L. Nguyen, and S. Mantik. Towards aWWWInformation Filtering and Seeking
Agent. IEEE 1995 Inta˛e˛l Conf. on Tools with Artificial Intelligence, 1995.
[64] J. Pitkow and K. K. Bharat. WebViz: A Tool for World-Wide Web Access Log Analysis.
Proceedings of the 2nd WWW Conference, 1995.
[65] A. press release on the July 2000 study is available at
http://www.cyveillance.com/newsroom/pressr/000710.asp.
[66] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A Scalable Content-
Addressasble Network. In Proc. Of SIGCOMM’01, 2001.
[67] P. Reynolds and A. Vahdat. Efficient Peer-to-Peer Keyword Searching. In Proceedings of
the ACM/IFIP/USENIX Middleware conference, 2003.
[68] J. Ritter. Why Gnutella Can’t Scale? No, Really. In
http://www.darkridge.com/ jpr5/doc/gnutella.html.
[69] A. Rowstron and P. Druschel. Pastry: Scalable, Distributed Object Location and Routing
for Large-Scale Peer-to-Peer Systems. In Proc. Of the 18th IFIP/ACM International Conf.
On Distributed Systems Platforms (Middleware 2001), 2001.
[70] G. Salton and C. Buckley. Term Weighting Approaches in Automatic Text Retrieval. Technical
Report 87-881, Department of Computer Science, Cornel University, 1987.
[71] H. Schutze, D. Hull, and J. Pedersen. A Comparison of Classifiers and Document Representations
for the Routing Problem. International ACM SIGIR Conference on Research and
Development in Information Retrieval, 1995.
[72] J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A Scalable Parallel Classifier for Data Mining.
Proceedings of the 22th International Conference on Very Large Databases, September
1996.
[73] E. Spertus. Parasite: Mining structural information on the web. Proc. of the 6th InternationalWorld
Wide Web Conference, 1997.
[74] K. Sripanidkulchai. The Popularity of Gnutella Queries and Its Implications on Scalability.
In http://www.cs.cmu.edu/ kunwadee/research/p2p/ gnutella.html.
[75] K. Sripanidkulchai, B. Maggs, and H. Zhang. Efficient Content Location and Retrieval in
Peer-to-Peer Systems by Exploiting Locality in Interests. In Proc. of ACM SIGCOMM’01,
2001.
[76] J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web Usage Mining: Discovery
and Applications of Usage Patterns from Web Data. ACM-SIGKDD Explorations, January
2000.
[77] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. Chord: A Scalable
Peer-to-Peer Lookup Service for Internet Applications. In Proc. Of SIGCOMM’2001, 2001.
[78] S. M. Weiss and C. A. Kulikowski. Computer Systems That Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.
Morgan Kaufmann, 1991.
[79] J. Xiao, Y. Zhang, X. Jia, and T. Li. Measuring similarity of interests for clustering webusers.
Proceedings of 12th Australasian Database Conference, Gold Coast, Jan. 2001.
[80] Y. Xie and V. V. Phoha. Web User Clustering from Access Log Using Belief Function.
Proceedings of ACM K-CAP’01, First International Conference On Knowledge Capture„
ACM Press, Victoria, British Columbia, Canada, 2001.
[81] L. B. N. L. Y. Chawathe, S. Ratnasamy and S. Shenker. Making Gnutella-link P2P Systems
Scalable. In SIGCOMM 03, 2003.
[82] Yahoo. In http://www.yahoo.com/.
[83] B. Yang and H. Garcia-Molina. Comparing Hybrid Peer-to-Peer Systems. In Proc. Of Very
Large Database (VLDB), 2001.
[84] B. Yang and H. Garcia-Molina. Improving Search in Peer-to-Peer Systems. In Proc. Of the
22nd International Conf. On Distributed Computing Systems (ICDCS), 2002.
[85] O. Zaiane and J. Han. WebML: Querying the World-Wide Web for Resources and
Knowledge. Proc. (CIKM’98) Int’l Workshop on Web Information and Data Management
(WIDM’98), Nov. 1998.
[86] O. R. Zaiane. Resources and Knowledge Discovery from the Internet and Multimedia Repositories.
In PhD thesis, Simon Fraser University, Dept. of Computer Science, March 1999.
[87] O. R. Zaiane, M. Xin, and J. Han. DiscoveringWeb Access Patterns and Trends by Applying
OLAP and Data Mining Technology onWeb Logs. Proc. Advances in Digital Libraries Conf.
(ADL’98), Santa Barbara, CA, pages 19—29, April 1998.
[88] B. Y. Zhao, J. Kubiatowicz, and A. Joseph. Tapestry: An Infrastructure for Fault-Tolerant
Wide Area Location and Routing. Technical Report UCB/CSD-01-1141, University of California
at Berkeley, 2001.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文