跳到主要內容

臺灣博碩士論文加值系統

(35.172.136.29) 您好!臺灣時間:2021/08/02 17:45
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:張源峰
研究生(外文):Yuan-feng Chang
論文名稱:一個於大型資料庫中以滑動視窗來探勘最大項目集合的方法
論文名稱(外文):A Sliding-Window Approach to Mining Maximal Large Itemsets for Large Databases
指導教授:張玉盈
指導教授(外文):Ye-in Chang
學位類別:碩士
校院名稱:國立中山大學
系所名稱:資訊工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2004
畢業學年度:92
語文別:英文
論文頁數:89
中文關鍵詞:資料探索關聯式法則漸增式探索分割最大代表集
外文關鍵詞:partitionIncremental MiningMaximal Large ItemsetsData MiningAssociation Rules
相關次數:
  • 被引用被引用:0
  • 點閱點閱:107
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
探索關聯式法則,指的是如何從尚未分析的資料找出有用的資訊。而找出最大代表集(maximal large itemsets)為其進階工作,是為了找出能夠代表全部資訊的代表集合。以前用來找出最大代表集的方法可以分為兩類:窮舉法(exhausted)跟捷徑法(shortcut)。捷徑法會比窮舉法產生較少的項目集,且在時間與儲存空間上會有較佳的效能。在另一方面,當資料庫有更新動作時,可以重新執行整個探索的演算法。另一方法,則為漸增性探索(incremental mining)。這個方法可以有效率地更新有關關聯式法則的資訊,且不需要重新執行整個探索的演算法。然而,以前基於捷徑法的演算法無法支援最大代表集的漸增性探索。而用於漸增性探索的演算法,如SWF演算法,卻不能夠有效地支援最大代表集的探索,因為這些方法是基植於窮舉法。因此,在本篇論文中,我們著重在設計一個同時能在找出最大代表集與漸增性探索這兩方面都有良好效能的演算法。根據一些在最大代表集方面之觀察,例如,“一個項目集如果是大代表集的,那它的子項目集必定會是大代表集,所以這些子項目集不需要再被檢查”,我們提出一個以滑動視窗(Sliding-Window)來探索的方法,SWMax演算法,能同時有效地找出最大代表集及漸增性探索。我們的SWMax演算法是一個只需掃瞄兩次,且以分割資料庫為基礎的方法。我們將在掃瞄第一遍時找出所有大小分別為1與3的候選項目集,(C1, C3),還有大小分別是1與3的大代表集(L1, L3)。我們在掃瞄第一遍資料庫之後,會去生產虛擬的最大代表集。然後,我們利用L1去生產C2,用L3生產C4,用C4生產C5,直到Ck無法被生產出來為止。在第二次掃瞄資料庫時,我們對Ck做累計同時找出最大代表集。在漸增性探索方面,我們考慮兩種情況:(1)資料新增,(2)資料刪除。在這兩種情況下,當用SWF演算法,只要有大小為1的項目集在舊的資料庫無法成為候選項目而不被保留,那在更新之後的資料庫也決不會再出現,因此導致正確結果可能會被流失。也就是,可能會有遺漏的情況發生。因為SWF演算法只保留大小為2的項目集的資訊。而我們的SWMax演算法,則可以正確地進行漸增性探索。因為,我們保留了大小為1與3的項目集來支援更新資訊。在我們的效能測試中,我們會生產一些人造的資料庫,來模擬真實交易記錄的資料庫。從我們的模擬結果得知,我們的SWMax演算法可以比SWF演算法產生較少的項目集且需要較少的執行時間。
Mining association rules, means a process of nontrivial extraction of implicit,
previously and potentially useful information from data in databases. Mining maximal
large itemsets is a further work of mining association rules, which aims to find
the set of all subsets of large (frequent) itemsets that could be representative of all large
itemsets. Previous algorithms to mining maximal large itemsets can be classified into two approaches: exhausted and
shortcut. The shortcut approach could generate smaller number of
candidate itemsets than the exhausted approach,
resulting in better performance in terms of time and storage space.
On the other hand, when updates to the transaction databases occur,
one possible approach is to re-run the mining algorithm on the whole
database. The other approach is incremental mining, which aims for efficient maintenance of discovered association rules
without re-running the mining algorithms. However,
previous algorithms for mining maximal large itemsets based on the shortcut approach
can not support incremental mining for mining maximal large itemsets.
While the algorithms for incremental mining, {it e.g.}, the SWF
algorithm, could not efficiently support mining maximal large
itemsets, since it is based on the exhausted approach.
Therefore, in this thesis, we focus on the design of an
algorithm which could provide good performance for both mining maximal itemsets and incremental mining.
Based on some observations, for example, ``{it if an itemset is large, all its
subsets must be large; therefore, those subsets need not to be examined
further}", we propose a Sliding-Window approach, the SWMax algorithm, for
efficiently mining maximal large itemsets and incremental mining. Our
SWMax algorithm is a two-passes partition-based approach. We will find all candidate
1-itemsets ($C_1$), candidate 3-itemsets ($C_3$), large 1-itemsets ($L_1$),
and large 3-itemsets ($L_3$) in the first pass.
We generate the virtual maximal large itemsets after the first pass. Then, we use $L_1$ to generate $C_2$, use $L_3$
to generate $C_4$, use $C_4$ to generate $C_5$, until there is no
$C_k$ generated. In the second pass, we use the virtual maximal large itemsets to
prune $C_k$, and decide the maximal large itemsets.
For incremental mining, we consider two cases: (1)
data insertion, (2) data deletion. Both in Case 1 and Case 2, if an itemset
with size equal to 1 is not large in the original database, it could not be found in the
updated database based on the SWF algorithm. That is, a missing case
could occur in the incremental mining process of the SWF
algorithm, because the SWF algorithm only keeps the $C_2$ information.
While our SWMax algorithm could support incremental mining
correctly, since $C_1$ and $C_3$ are maintained in our algorithm.
We generate some synthetic databases to simulate the real transaction
databases in our simulation. From our simulation, the
results show that our SWMax algorithm could generate fewer number of candidates
and needs less time than the SWF algorithm.
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Formal Problem Description . . . . . . . . . . . . . . . . . . . 2
1.1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Mining Maximal Large Itemsets . . . . . . . . . . . . . . . . . . . . . 5
1.3 Incremental Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 14
2. A Survey of Algorithms for Mining Maximal Large Itemsets and
Incremental Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 The Partition Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 The Pincer-Search Algorithm . . . . . . . . . . . . . . . . . . . . . . 23
2.4 The Max-Miner Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 The Pattern Decomposition Algorithm . . . . . . . . . . . . . . . . . 27
2.6 The SWF Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3. The SWMax Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Interesting Observations for Mining Maximal Large Itemsets . . . . . 35
3.2 The SWMax Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Interesting Observations for Incremental Mining . . . . . . . . . . . . 56
3.4 Incremental Mining Process . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1 Generation of Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A. The Counting Approach in the SWF Algorithm . . . . . . . . . . . 89
[1] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules,"
Proc. of the 20th Int. Conf. on Very Large Data Bases, 1994.
[2] R. J. Bayardo, Efficiently Mining Long Patterns from Databases," Proc. of Int.
Conf. on Data Eng., pp. 85{93, 1998.
[3] D. Burdick, M. Calimlim, and J. Gehrke, Ma a: A Maximal Frequent Item-
set Algorithm for Transactional Databases," Proc. of Int. Conf. on Data Eng.,
pp. 443{452, 2001.
[4] Y. I. Chang and Y. M. Hsieh, Setm*-Lmax: An Efficient SET-Based Approach
to Find Maximal Large Itemsets," Proc. of Int. Conf. on Computer Symposium:
Workshop on Software Eng. and Database Systems, 2002.
[5] M. S. Chen, J. Han, and P. S. Yu, Data Mining: An Overview from A Database
Perspective," IEEE Trans. on Knowledge and Data Eng., Vol. 8, No. 5, pp. 866{
882, Dec. 1996.
[6] D. W. Cheung, J. Han, V. T. Ng, and C. Y. Wong, Maintenance of Discovered
Association Rules in Large Databases: An Incremental Updating Technique,"
Proc. of the 12th IEEE Int. Conf. on Data Eng., 1996.
[7] D. W. Cheung, S. D. Lee, and B. Kao, A General Incremental Technique for
Maintaining Discovered Association Rules," Proc. on Database Systems for Ad-
vanced Applications, pp. 185{194, April 1997.
[8] U. M. Fayyad, G. P. Shapiro, P. Smyth, and R. Uthurusamy, Advances in
Knowledge Discovery and Data Mining," AAAI/MIT Press, 1996.
[9] Y. Fu, Data Mining: Tasks, Techniques, and Applications," IEEE Potentials,
Vol. 16, No. 4, pp. 18{20, 1997.
[10] C. H. Lee, C. R. Lin, and M. S. Chen, Sliding-Window Filtering: An E?
cient Algorithm for Incremental Mining," Proc. of the 10th ACM Int. Conf. on
Information and Knowledge Management, pp. 263{270, 2001.
[11] D. I. Lin and Z. M. Kedem, Pincer-Search: An Efficient Algorithm for Discov-
ering the Maximum Frequent Set," IEEE Trans. on Knowledge and Data Eng.,
Vol. 14, No. 3, May/June 2002.
[12] J. S. Park, M. S. Chen, and P. S. Yu, An E ective Hash-Based Algorithm for
Mining Association Rules," Proc. of the ACM SIGMOD Int. Conf. on Manage-
ment of Data, pp. 175{186, 1999.
[13] V. Pudi and J. R. Haritsa, Quantifying The Utility of The Past in Mining Large
Databases," Information Systems, Vol. 25, No. 5, pp. 323{343, April 2000.
[14] N. L. Sarda and N. V. Srinivas, An Adaptive Algorithm for Incremental Mining
of Association Rules," Proc. of the 9th Int. Workshop on Database and Expert
Systems Applications, pp. 240{246, 1998.
[15] A. Savasere, E. Omiecinski, and S. Navath, An Efficient Algorithm for Mining
Association Rules in Large Databases," Proc. of the 21st VLDB Conf., pp. 432{
444, 1995.
[16] A. Silbersrchatz, M. Stonebraker, and J. D. Ullman, Database Research:
Achievements and Opportunities into The 21st Century," Tech. Rep. 26{27, Re-
port NSF Workshop Future of Database Systems Research, May 1995.
[17] S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka, An Efficient Algorithm for
The Incremental Updation of Association Rules in Large Databases," Proc. of
the 3rd Int. Conf. on Knowledge Discovery and Data Mining, pp. 263{266, 1997.
[18] P. S. M. Tsai, C. C. Lee, and A. L. P. Chen, An Efficient Approach for In-
cremental Association Rule Mining," Proc. of Paci c-Asia Conf. on Knowledge
Discovery and Data Mining, pp. 74{83, 1999.
[19] Y. J. Tsay and Y. W. Chang-Chien, An Efficient Cluster and Decomposition Al-
gorithm for Mining Association Rules," Information Sciences, Vol. 160, pp. 161{
171, Mar. 2004.
[20] M. J. Zaki and K. Gouda, Fast Vertical Mining Using Di sets," Proc. of the
9th Int. Conf. on Knowledge Discovery and Data Mining, pp. 24{27, 2003.
[21] M. J. Zaki and C. J. Hsiao, CHARM: An Efficient Algorithm for Closed Itemset
Mining," Proc. of the SIAM Int. Conf. on Data Mining, pp. 99{110, 2002.
[22] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, New Algorithms for
Fast Discovery of Association Rules," Proc. of the 3rd Int. Conf. on Knowledge
Discovery and Data Mining, pp. 283{286, 1997.
[23] M. Zhang, D. C. B. Kao, and C. L. Yip, Efficient Algorithms for Incremental
Update of Frequent Sequences," Proc. of the 6th Paci c-Asia Conf. on Knowl-
edge Discovery and Data Mining, pp. 186{197, 2002.
[24] Q. Zou, D. J. W. Chu, and H. Chiu, Pattern Decomposition Algorithm for
Data Mining of Frequent Patterns," Knowledge and Information Systems, Vol. 4,
No. 4, pp. 466{482, Oct. 2002.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top