(100.26.179.251) 您好!臺灣時間:2021/04/15 17:38
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:林佩穎
研究生(外文):Pei-Ying Lin
論文名稱:一個以集合檢查來從資料串流中搜尋最大頻繁項目集的方法
論文名稱(外文):A Set-Checking Algorithm for Mining Maximal Frequent Itemsets from Data Streams
指導教授:張玉盈
指導教授(外文):Ye-In Chang
學位類別:碩士
校院名稱:國立中山大學
系所名稱:資訊工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2011
畢業學年度:99
語文別:英文
論文頁數:79
中文關鍵詞:最大頻繁項目集晶格地標視窗模型項目集資料串流
外文關鍵詞:ItemsetLandmark Window ModelLatticeMaximal Frequent ItemsetData Stream
相關次數:
  • 被引用被引用:0
  • 點閱點閱:115
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
在資料探勘的領域中,從資料串流(data stream)中找出最大頻繁項目集(maximal frequent itemset)在線上探勘是一個重要的問題。最大頻繁項目集就是每個項目集(itemset)出現的次數大於等於支持值且此項目集不會是任何一個項目集的子集或超集。傳統資料庫搜尋最大頻繁項目集的演算法無法應用在資料串流中是因為資料串流有以下幾個特性: (1) 連續 (2) 快速 (3) 無資料上限 (4) 即時 (5) 一次搜尋完成。資料串流的探勘有幾個新的挑戰,第一,因為資料串流是連續且沒有限制資料量大小所以我們無法將所有的資料存在主要的記憶體或第二存取空間。第二,傳統探勘資料的方法存取資料是採用多次搜尋,因為資料串流只能一次搜尋所以不適用。第三,為了維持資料快速抵達以及期望快速回報答案,所以探勘資料串流需要快速和即時。為了解決以地標視窗模型(landmark window model)為基礎來從資料串流中搜尋最大頻繁項目集,Mao等人提出INSATNT演算法。在地標視窗模型裡是從一開始的時間到目前的時間搜尋資料,使用地標視窗模型的好處是我們可以很精確的找出我們要的答案。INSTANT演算法因為資料結構簡單,所以可以節省許多記憶體空間,但是在搜尋最大頻繁項目集要花費很多時間。當新的交易進來,INSTANT演算法和舊的交易做比較的次數會很多。在本論文裡,我們提出Set-Checking的方法在資料串流中用地標視窗模型搜尋最大頻繁項目集。我們提出的Set-Checking演算法是以晶格(lattice)的結構去存取我們目前所知道的資訊,所謂的晶格結構就是存在一個子節點是父節點的子集合。在每一個節點,我們會記錄項目集和支持值。當新的交易進來時,我們會針對交易的特性考慮五種集合: (1) equivalent (2) superset (3) subset (4) intersection (5) empty set。根據晶格的結構和以上五種集合的概念,我們可以很快新增我們的交易以及更新支持值。從我們的模擬結果可得知,我們提出的Set-Checking演算法比INSTANT演算法執行時間還要快。
Online mining the maximal frequent itemsets over data streams is an important problem in data mining. The maximal frequent itemset is the itemset which the support is large or equal to the minimal support and the itemset is not the subset or superse of each itemset. Previous algorithms to mine the maximal frequent itemsets in the traditional database are not suitable for data streams. Because data streams have some characteristics: (1) continuous (2) fast (3) no data limit (4) real time (5) searching once, mining data streams have many new challenges. First, they are unrealistic to keep the entire stream in the main memory or even in a secondary storage area, since a data stream comes continuously and the amount of data is unbounded. Second, traditional methods of mining on stored datasets by multiple scans are infeasible, since the streaming data is passed only once. Third, mining streams requires fast, real-time processing in order to keep up with the high data arrival rate and mining results are expected to be available within short response time. In order to solve mining maximal frequent itemsets from data streams using the landmark window model, Mao et. al. propose the INSTANT algorithm. In the landmark window model, knowledge discovery is performed based on the values between the beginning time and the present. The advantage of using the landmark window model is that the results are correct as compared to the other models. The structure of the INSTANT algorithm is simple and it can save many memory space. But it takes long time in mining the maximal frequent itemsets. When the new transactions comes, the number of comparisons between the old transactions of INSATNT algorithm is too much. In this thesis, we propose the Set-Checking algorithm to mine frequent itemsets from data streams using the landmark window model. We use the structure of lattice to store our information. The structure of lattice records the subset relationship between the child node and the father node. For every node, we can record the itemset and the support. When the new transaction comes, we consider five relations: (1) equivalent (2) superset (3) subset (4) intersection (5) empty relations. According to the lattice structure of the five sets , we can add the transaction and the renew support efficiently. From our simulation result, we find that the process time of our Set-Checking algorithm is faster than that of the INSTANT algorithm.
LIST OF FIGURES.............................................................iii
LIST OF TABLES................................................................vii
1. Intrduction........................................................................1
1.1 Mining Association Rules........................................2
1.2 Data Stream Mining...................................................2
1.2.1 Applications.........................................................2
1.2.2 Window Models in Data Streams..................4
1.3 Mining Maximal Frequent Itemsets in Data Streams...5
1.4 Related Work...9
1.5 Motivations.................................................................10
1.6 Organization of the Thesis......................................11
2. A Survey of Data Mining Algorithm.............................13
2.1 The Apriori Algoruthm for Association Rules Mining...13
2.2 The BBM Algorithm for Maximal Frequent Itemsets Mining...15
2.3 The MFI-TransSW ALgorithm for Frequent Itemsets Mining in Data Streams...18
2.4 Maximal Frequent Itemsets Mining in Data Streams...20
2.4.1 The INSATNT Algorithm.................................21
2.4.2 The DSM-MFI Algorithm.................................23
3. The Set-Checking Algorithm.......................................25
3.1 Observations for Mining Maximal Frequent Itemsets...25
3.2 The Proposed Algorithm.........................................26
3.2.1 Data structure..................................................31
3.2.2 Data Initialization.............................................32
3.2.3 Data Insertion..................................................33
3.2.4 The Maximal Frequent Itemset.....................40
3.2.5 A Comparison..................................................40
4. Performance
4.1 The Simulation Model.............................................50
4.2 Experiment Results.................................................51
5. Conclusion....................................................................59
5.1 Summary...................................................................59
5.2 Future Work...............................................................60
BIBLIOGRAPHY.................................................................61
[1] R. Agrawal and R. Srikant, "Fast Algorithm for Mining Association Rules in large databases," Proc. of the 20th Int. Conf. in Very Large Data Bases, pp. 490-501, 1994.
[2] T. Calders, N. Dexters, and B. Goethals, “Mining Frequent Itemsets in a Stream,” Proc. of IEEE Int. Conf. on Data Mining, pp. 83-92, 2007.
[3] J.H. Chang and W. S. Lee, “A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Stream,” Journal of Information Science and Engineering, Vol. 4, No. 11, pp. 753-762, July 2004.
[4] M. S. Chen, J. Han and P. S. Yu, “Data Mining: An Overview from A Database Perspective,” IEEE Trans. On Knowledge and Data Eng., Vol. 8, No. 5, pp. 866-882, Dec. 1996.
[5] Y. Chi, H. Wung, P. S. Yu, and R. R. Muntz, “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window,” Knowledge and Information Systems, pp. 59-66, 2004.
[6] Y. Chi, H. Wung, P. S. Yu, and R. R. Muntz, “Catch the Moment: Maintaining Closed Frequent Itemsets over a Data Stream Sliding Window,” Proc. of the 4th IEEE Int. Conf. on Data Mining, pp. 265-294, 2006.
[7] B. Datar and R. Widom, “Models and Issues in Data Stream Systems,” Proc. of 2002 ACM Symp. On Principles of Database Systems, Citeseer, 2002.
[8] M. M. Gaber, A Zaslavsky, and S. Krishnaswamy, “Mining Data Streams” a Review,” ACM SIGMOD Record, Vol.34, No. 2, pp. 18-26, June 2005.
[9] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu, “Mining Frequent Patterns in Data Streams at Multiple Time Granularities,” Nest Gnerartion Data Mining, Vol. 212, pp. 191-212, 2003.
[10] N. Jiang and L. Gruenwald, “Research Issues in Data Stream Association Rule Mining,” ACM SIGMOD Record, Vol.35, No. 1, pp. 14-19, March 2006.
[11] N. Jiang and L. Gruenwald, “CFI-stream: Mining Closed Frequent Itemsets in Data Streams,” Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 592-597, 2006.
[12] J. L. Koh and S. N. Shin, “An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams,” Computer Science Data Warehousing and Knowledge Discovery, Vol. 4081, No. 1, 99. 352-362, Sept. 2006.
[13] C.-H. Lee, C.-R. Lin and M.-S. Chen, “Sliding Window Filtering: An Efficient Method for Incremental Mining on a Time-Variant Database,” Information Systems, pp. 227-244, 2005.
[14] W. Lee, C. T. Park, and S. J. Stolfo, “Automated Intrusion Detection Using NFR: Methods and Experiences,” Proc. of the 1st conference on Workshop on Intrusion Detection and Network Monitoring, pp. 63-72, 1999.
[15] H. F. Li, S. Y. Lee, and M. K. Shan, “An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams,” Proc. of Int. Conf. on Principals and Practice of Knowledge Discovery in Databases, pp. 20-24, 2004.
[16] H. F. Li and S. Y. Lee, “Mining Frequent Itemsets over Data Streams Using Efficient Window Sliding Techniques,” Expert Systems with Applications, Vol. 36, pp. 1466-1477, March 2009.
[17] H. F. Li, S. Y. Lee, and M. K. Shan, “Online Mining (Recently) Maximal Frequent Itemsets from Data Streams,” In Proc. of the IEEE RIDE, 2005.
[18] C. H. Lin, D. Y. Chiu, and A. L. P. Chen, “Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window,” Proc. of SIAM Int. Conf. on Data Mining, pp. 648-660, 2005.
[19] G. Mao, X. Wu, X. Zhu, G. Chen, and C. Liu, ”Mining Maximal Frequent Itemsets From Data Streams,” Journal of Information Science, Vol. 33, No. 3, pp. 251-262, March 2007.
[20] J. Srivastava, R. Cooley, M. Deshpande, and P. N. Tan, “Web Usage Mining: Discovery and Applications of Usage Patterns From Web Data,” ACM SIGMOD Explorations Newsletter, Vol. 1, No. 2, pp.12-23, Jan. 2000.
[21] Y. Tao and D. Papadias, “Maintaining Sliding Window Skylines on Data Stream,” Journal of Information Science and Engineering, Vol. 18, No. 3, pp. 377-391, March 2006.
[22] J.-W. Xin, G.-Q. Yang, J.-Z Sun, and Y.-P. Zhang, “A New Algorithm for Discovery Maximal Frequent Itemsets Based on Binary Vector Sets,” Machine Learning and Cybernetics, pp. 1120-1124, 2006.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔