(34.204.201.220) 您好!臺灣時間:2021/04/20 11:54
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:陳姿伊
研究生(外文):Tsu-I Chen
論文名稱:支援網際網路個人化資訊過濾的索引之設計
論文名稱(外文):Design of Index Structures for Supporting Personalized Information Filtering on the Internet
指導教授:張玉盈
指導教授(外文):Ye-In Chang
學位類別:碩士
校院名稱:國立中山大學
系所名稱:資訊工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2003
畢業學年度:91
語文別:英文
論文頁數:125
中文關鍵詞:個人化索引資料過濾使用者檔案網頁挖掘
外文關鍵詞:indexinformation filteringpersonalizationprofileweb usage mining
相關次數:
  • 被引用被引用:0
  • 點閱點閱:81
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
由於網際網路的蓬勃發展,使得資料過濾遇到許多新的挑戰。資料過濾是一個研究的領域,其目的乃是發展出一個工具可用來在一堆資料中,過濾出有用的資料。資料過濾可以經由網頁和使用者需要的資訊比對,找出好的結果。首先,使用者先描述他們需要的是什麼,這叫做使用者檔案(profile)。將這些檔案建立出一個索引,一堆網頁將會和這個索引作比對。每一個網頁的表示方式也和使用者檔案的表示方式相同。經由網頁和使用者檔案的比對,使用者有興趣的網頁就可以被找出來。最後,根據比對的結果,過濾出來的網頁就可以推薦給使用者。因此,在資料過濾的服務中,我們所考慮的重點是如何建立一個索引可以做有效率的比對。當一個使用者檔案的索引被建立之後,我們可以減少修正使用者檔案所需花的空間和時間。在這篇論文中,首先,我們提出一個方法,稱為 count-based tree method,它會計算每一個關鍵字出現的次數,用來減少 the tree method 所需花費的大量空間,。接下來,我們提出三個以大的項目集合(large itemset)為基礎的方法,分別是 the count-major large itemset method、the weighted large itemset method 和 the hybrid method。在這三個以大的項目集合為基礎的方法中,我們首先將有相似興趣的使用者分在同一群。接下來,針對每一群,我們利用關連式法則的挖掘技術來幫助我們建立出索引。我們利用在關連式法則的挖掘中,其中之一個有名的方法 Apriori 演算法的概念,來設計這三個方法。但是,我們修改了在 Apriori 演算法中的決定最小的贊成值(support)和最後的目標。我們不一定總是得到最大的項目集合(Lk)。有時,我們只會選擇Lw,在這裡 w < k。總結來說,我們提出的這四個方法所花費的空間都比 Yan 和 Garcia-Molina 他們所提的 the tree method所花費的空間來的少。根據我們的模擬結果,我們的這四個方法會根據不同的資料而能提供最好的結果。接下來,我們提出一個儲存有不同輕重關鍵字的索引的方法,來減少更新的花費。當使用者的興趣常常在改變的時候,我們必須考慮如何提供一個索引結構,使得更新的花費比較少。在這個方法中,我們將每一個關鍵字的重量考慮進來。那就是每一個關鍵字可以被區分為 long-term interests和 short-term interests,當這個關鍵字的重量大於某個門檻,我們就稱它為 long-term interests,如果這個關鍵字的重量小於某個門檻,我們就稱它為 short--term interests。由於修改 short-term interests 的機會比修改 long-term interests 的機會來的大很多,因此我們可以更新的範圍是局部地。根據我們模擬的結果,我們的方法確實可以減少 Wu 和 Chen 他們的方法在更新上的花費。
Owing to the booming development of the WWW, it creates many new challenges for information filtering. Information Filtering (IF) is an area of research that develops tools for discriminating between relevant and irrelevant information. IF can find good matches between the web pages and the users'' information needs. Users first give descriptions about what they need, i.e., user profiles, to start the services. A profile index is built on these profiles. A series of incoming web pages will be put into the matching process. Each incoming web page is represented in the same form of the user profile. In this way, the users who are interested in an incoming web page can be identified by comparing the descriptions of the web page with each user profile. At last, the web page will be recommended to the users whose profiles belong to the filtered results. Therefore, a critical issue of the information filtering service is how to index the user profiles for an efficient matching process. When we index the user profile, we can reduce the costs of storage space and the processing time for modifying the user profiles. In this thesis, first, we propose a count-based tree method, which takes the count of each keyword into consideration, to reduce the large storage spaces as needed by the tree method. Next, three large-itemset-based methods are proposed to reduce the storage space, which are called the count-major large itemset method, the weighted large itemset method and the hybrid method. In these three large-itemset-based methods, we first cluster profiles with similar interests into the same group. Next, for each cluster, we apply the mining association rules techniques to help us to construct the index strategies. We design three methods by using
the idea of the Apriori algorithm which is one of well-known approaches in mining association rules. But, we modify the minimum support and the goal in the Apriori algorithm. We may not always output the large itemset Lk. That is, we may only use Lw, where w < k. In summary, the cost of storage spaces of our four methods are less than that of the tree method proposed by Yan and Garcia-Molina. According to our simulation results, each of our four methods may provide the best result when different input data sets. Next, we propose a large-itemset-based approach to the incremental update of the index structure for storing keywords to
reduce the update cost. When someone''s interests are often changed, we must care about the way how to provide the low update cost of the index structure. We take the weight of each keyword into consideration. That is, each keyword can be distinguished the long-term interest which has weight above the threshold from the short-term interest which has weight below the threshold. Owing to that the probability of modifying the short-term interests is higher than that of modifying the long-term interests, we can update the short-term interests locally. According to our simulation results, our method really can reduce the update cost as needed by Wu and Chen'' methods.
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Information Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Two Kinds of Filtering Approaches . . . . . . . . . . . . . . . 3
1.1.2 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Path Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Index Structures on the Internet . . . . . . . . . . . . . . . . . . . . . 9
1.4 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 20
2. A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 The Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 The Brute Force (BF) Method . . . . . . . . . . . . . . . . . . 22
2.1.2 The Pro le Indexing (PI) Method . . . . . . . . . . . . . . . . 22
2.1.3 The Selective Pro le Indexing (SPI) Method . . . . . . . . . . 23
2.2 The Boolean Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1 Yan and Garcia-Molona''s Methods Under the Boolean Model 27
2.2.2 Wu and Chen''s Methods Under the Boolean Model . . . . . . 33
2.2.2.1 Index Path with Path Signatures . . . . . . . . . . . 33
2.2.2.2 Index Graph with Path Signatures . . . . . . . . . . 35
2.2.2.3 Index Path with Pro le Sets . . . . . . . . . . . . . . 37
2.2.2.4 Index Graph with Pro le Sets . . . . . . . . . . . . . 39
3. The Count-Based Tree Method . . . . . . . . . . . . . . . . . . . . . 41
3.1 The Count-Based Tree Method . . . . . . . . . . . . . . . . . . . . . 41
3.2 A Special Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 The Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4. The Largest-Itemset-Based Methods . . . . . . . . . . . . . . . . . . 51
4.1 A Survey of the Apriori Algorithm . . . . . . . . . . . . . . . . . . . 51
4.2 The Count-Major Large Itemset Method . . . . . . . . . . . . . . . . 53
4.3 The Weighted Large Itemset Method . . . . . . . . . . . . . . . . . . 64
4.4 The Hybrid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5. The Large-Itemset-Based Approach to the Incremental Update . 84
5.1 The Update Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 The Update Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.1 The Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 Simulation Results of the Boolean Model-Based Methods . . . . . . . 106
6.3 Simulation Results of the Vector Space Model-Based Methods . . . . 115
7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 122
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
[1] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules in Large Databases," Proc. 20th Int''l Conf. Very Large Data Bases, pp. 490-501, Sept. 1994.
[2] Sergey Brin and Larry Page. Google Search Engine. URL: http://google.stanford.edu.
[3] E. Y. Chang, C. Li, H. Garcia-Molina and G. Wiederhold, Clustering for Approximate Similarity Search in High-Dimensional Spaces," IEEE Trans. on Knowledge and Data Engineering, Vol. 14, No. 4, pp. 792-808, July 2002.
[4] M. S. Chen, J. S. Park and P. S. Yu, EÆcient Data Mining for Path Traversal Patterns," IEEE Trans. on Knowledge and Data Engineering, Vol. 10, No. 2, pp. 209-221, March 1998.
[5] J. Cho, N. Shivakumar and H. Garcia-Molina, Finding Replicated Web Collections," Proc. SIGMOD Conf., pp. 355-366, 2000.
[6] E. J. Glover, S. Lawrence, M. D. Gordon, W. P. Birmingham and C. L. Giles, Web Search-YourWay," Communications of the ACM, Vol. 44, No. 12, pp. 97-102, 2001.
[7] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon and J. Riedl, Applying Collaborative Filtering to Usenet News," Communications of the ACM, Vol. 40, No. 3, pp. 77-87, March 1997.
[8] T. Kuik and P. Shoval, User Pro le Generation for Intelligent Information Agents-Research in Progress," Proc. of the Second Int''l Workshop Agent-
Oriented Information Systems, pp. 63-72, 2000.
[9] S. Melnik, S. Raghavan, B. Yang and H. Garcia-Molina, Building a Distributed Full-Text Index for the Web," ACM Trans. on Information Systems, Vol. 19, No. 3, pp. 217-241, July 2001.
[10] L. Page, S. Brin, R. Motwani and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web," URL: http://google.stanford.edu/backrub/pageranksub.ps, 1998.
[11] Y. W. Park and E. S. Lee, A New Generation Method of an User Pro le for
Information Filtering on the Internet," Proc. of Int. Conf. on Data Engineering, pp. 337-347, 1994.
[12] J. S. Park, M. S. Chen and P. S. Yu, An EÆcient Hash-Based Algorithm for
Mining Association Rules," ACM SIGMOD Int. Conf. Management of Data, pp. 175-186, 1995.
[13] J. Picard and J. Savoy, Using Probailistic Argumentation Systems to Search and ClassifyWeb Sites," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, Vol. 24, No. 3, pp. 33-41, Sep. 2001.
[14] C. Shahabi, A. M. Zarkesh, J. Adibi and V. Shah, Knowledge Discovery from
Users Web-Page Navigation," Proc. of IEEE Workshop Research Issues in Data
Engineering, pp. 20-29, 1997.
[15] D. H. Widyantoro, T. R. Ioerger and J. Yen, An Adaptive Algorithm for Learning Changes in User Interests," Proc. of the 8th Int. Conf. on Information and Knowledge Management, pp. 405-412, 1999.
[16] Y. H. Wu and Arbee L. P. Chen, Index Structures of User Pro les for EÆ-
cient Web Page Filtering Services," Proc. of the 20th Int. Conf. on Distributed
Computing Systems, pp. 644-653, 2000.
[17] Y. H. Wu, Y. C. Chen and Arbee L. P. Chen, Enabling Personalized Recom-
mendation on the Web Based on User Interests and Behaviors," Proc. of IEEE
Workshop Research Issues in Data Engineering, pp. 17-24, 2001.
[18] Yahoo, URL: http://www.yahoo.com, 1999.
[19] T. W. Yan and H. Garcia-Molina, Distributed Selective Dissemination of Information," Proc. of Parallel and Distributed Information Systems, pp. 89-98, 1994.
[20] T. W. Yan and H. Garcia-Molina, Index Structures for Information Filtering Under the Vector Space Model," Proc. of Int. Conf. on Data Engineering, pp. 337-347, 1994.
[21] T. W. Yan and H. Garcia-Molina, Index Structures for Selective Dissemination of Information Under the Boolean Model," ACM Trans. on Database Systems, Vol. 19, No. 2, pp. 322-364, Nov. 1994.
[22] T. W. Yan and H. Garcia-Molina, SIFT{A Tool for Wide-Area Information Dissemination," Proc. of the 1995 USENIX Technical Conf., pp. 177-186, 1995.
[23] O. Zamir, Fast and Intuitive Clustering of Web Documents," Proc. of the 3rd Int. Conf. on Knowledge Discovery and Data Mining, pp. 287-290, 1997.
[24] O. Zamir and O. Etzioni, Web Document Clustering: A Feasibility Demonstration," Proc. of the 21st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 46-54, 1998.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔