跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.104) 您好!臺灣時間:2025/12/03 15:42
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:郭芳甄
研究生(外文):Fang-Chen Kuo
論文名稱:從雜訊資料中探勘頻繁項目集
論文名稱(外文):Mining Approximate Frequent Itemsets from Noisy Data
指導教授:趙景明趙景明引用關係
指導教授(外文):C. M. Chao
學位類別:碩士
校院名稱:東吳大學
系所名稱:資訊科學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2008
畢業學年度:96
語文別:中文
論文頁數:63
中文關鍵詞:資料探勘關聯規則頻繁項目集探勘相似性頻繁項目集雜訊資料
外文關鍵詞:Data MiningAssociation RulesFrequent Itemsets MiningApproximate Frequent ItemsetsNoisy Data
相關次數:
  • 被引用被引用:1
  • 點閱點閱:340
  • 評分評分:
  • 下載下載:39
  • 收藏至我的研究室書目清單書目收藏:1
在資料探勘﹙Data Mining﹚的領域中探勘頻繁項目集﹙Mining Frequent Itemsets﹚是一個很重要的主題,它能夠找到資料分佈情況以及探勘出存在資料中的頻繁樣式﹙Frequent Patterns﹚,其探勘結果可以作為其他分析系統的輸入資料,除此之外,企業也可利用探勘結果分析顧客行為進而改善行銷策略,進而增加銷售能力。然而,在過去傳統的頻繁項目集探勘都是採用精確的探勘模式,而此種探勘方法並不適合應用在真實的資料上,因為真實的資料往往都會存在著雜訊(noise),若在有雜訊的資料中採用精確模式來做探勘便無法產出真正的頻繁項目集,若企業採用錯誤的探勘結果則會導致決策錯誤,最後嚴重影響企業利益。
在近年有學者在研究如何在雜訊資料中取出頻繁項目集,然而他們的方法運用在資料為稀疏矩陣(sparse matrix)的狀況下探勘時間會大幅度的增加,所以他們的方法並不適用於各種資料集。基於上述的探勘演算法缺點,本篇研究提出一個新的相似性頻繁項目探勘的方法稱之為TAFI (Trie Approximate Frequent Itemsets),它利用精簡資料庫(reduced basket)替代掉原始交易資料庫以去除探勘稀疏矩陣的問題,同時可以大幅減少探勘時所需的儲存空間以及提升計算支持個數(support count)的速度。另外,TAFI採用了Trie資料結構可以有效率的提升探勘頻繁項目的速度,並且利用項目出現頻率來排序以建立Trie結構可以減少候選項目的數量。由實驗結果得知,TAFI演算法執行效率優於其他演算法,在不同類型的雜訊資料下仍然可以維持良好的執行效率。
To discover association rules, frequent itemset mining can find out items that appear frequently together in a dataset and to be the first step in the analysis of data arising in broad range of application. Moreover, industry can use mining result to improve marketing strategy and profitability. Traditional frequent itemset mining utilizes the “exact” mode. However, the exact-mode mining is not appropriate for real data. Mining noisy data using the exact mode cannot generate correct frequent itemsets, and may eventually lead to incorrect decisions.
In recent years, many researchers have studied how to discover frequent itemsets from noisy data. However, existing methods can become inefficient when the dataset is sparse. Therefore, these methods cannot be applied to all kinds of datasets. In this paper, we propose a new algorithm, called the TAFI algorithm, for mining approximate frequent itemsets. The TAFI algorithm not only can correctly and efficiently discover approximate frequent itemsets from noisy data, but also can perform well with spare datasets.
誌謝 i
摘要 ii
Abstract iii
目錄 iv
圖目錄 vii
表目錄 x
1.緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 3
1.4 研究流程 4
1.5 論文架構 5
2.文獻探討 6
2.1 資料探勘的定義 6
2.1.1 知識探索與資料探勘 7
2.1.2 資料探勘的類型 9
2.1.3 資料探勘的方法 12
2.2 關聯規則(Association rules) 16
2.2.1 關聯規則的兩個參數 18
2.2.2 關聯規則之探討 18
2.2.3 關聯規則相關演算法-Apriori演算法 19
2.3 探勘相似性頻繁項目集演算法 23
2.3.1 ETI演算法 23
2.3.2 FT-Apriori演算法 25
2.3.3 AFI演算法 26
2.3.4 AC-Close 演算法 27
3.相似性頻繁項目集探勘 30
3.1 問題描述 30
3.2 相似性頻繁項目集探勘演算法 33
3.2.1 雜訊容忍支持度資料修剪Noise-Tolerant Support Pruning 33
3.2.2 0/1延伸 35
4.研究方法 37
4.1 精簡交易資料庫 37
4.2 Trie資料結構 38
4.3 根據出現頻率做項目編碼的轉換 39
4.4 Trie的建構與候選項目集的產生 40
4.6 TAFI演算法的虛擬碼與流程圖 45
5.效能評估 48
5.1 實驗環境 48
5.2 資料庫參數說明與實驗設計 48
5.3 實驗分析 50
5.3.1實驗一﹕使用資料集A在不同的支持度門檻下TAFI演算法與AFI演算法之效能差異 50
5.3.2實驗二﹕使用資料集A在不同的雜訊容忍度但列方向與行方向的雜訊容忍度相同的情況下TAFI演算法與AFI演算法之效能差異 51
5.3.3實驗三﹕使用資料集A在不同的支持度門檻下不同雜訊容忍度且列方向與行方向的雜訊容忍度也不同的情況下TAFI演算法之執行時間 52
5.3.4實驗四﹕使用資料集A在不同的交易個數下TAFI演算法與AFI演算法之效能差異 53
5.3.5實驗五﹕使用資料集A在不同的項目個數下TAFI演算法與AFI演算法之效能差異 54
5.3.6實驗六﹕使用資料集A在不同的平均項目個數下TAFI演算法與AFI演算法之效能差異 55
5.3.7實驗七﹕使用資料集B在不同的支持度門檻下TAFI演算法與AFI演算法之效能差異 56
5.3.8實驗八﹕使用資料集B在不同的支持度門檻下不同雜訊容忍度且row方向與column方向的雜訊容忍度也不同的情況下TAFI演算法之執行時間er ec 57
6.結論與未來研究方向 58
6.1結論 58
6.2未來研究方向 58
參考文獻 59
[1]Agrawal, R. and Srikant, R., “Fast algorithms for mining association rules”, In Proceedings of 20th International Conference on very Large Data, pp.487-499, 1994.
[2]Agrawal, R. and Srikant, R., “Mining Sequential Patterns”, In Proceedings of the Int’l Conference on Data Engineering, 1995.
[3]Agrawal, R., Imielinski, T. and Swami, A., “Mining Association Rules Between Sets of Items in Large Databases”, In Proceedings of the ACM SIGMOD Conference on Management of Data, pp.207-216, 1993.
[4]Alsabti, K., Ranka, S. and Singh, V., “An Efficient K-Means Clustering Algorithm”, PPS/SPDP Workshop on High performance Data Mining, 1997.
[5]Bayardo, R. J., “Efficiently Mining Long Patterns from Databases”, In Proceedings of ACM SIGOD On Management of Data, pp.85-93, 1998.
[6]Brin, S., Motwai, R., Ullman, J. D. and Tsur, S., “Dynamic Itemset Counting and Implication Rules for Market Basket Data”, ACM SIGMOD Conference on Management of Data, pp.265-276, 1997.
[7]Breiman, L., Friedman, J., Olshen, R. and Stone, C., “Classification of Regression Trees”, Wadsworth, 1984.
[8]Brin, S., Motwani, R. and Silverstein, C., “Beyond Market Baskets: Generalizing Association Rules to Correlations”, ACM SIGMOD Conference on Management of Data, pp.265-276, 1997.
[9]Chen, M. S., Han, J. and Yu, P. S., “Data mining:An Overview from a Database Perspective”, IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, December 1996.
[10]Cheng, H., Yu, P. S. and Han, J., “AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery”, In Proceedings of 6th IEEE International Conference on Data Mining, pp.839-844, 2006
[11]Comer, D. and Sethi, R., “The Complexity of Trie Index Construction”, Journal of the ACM (JACM), Vol.24, No.3, pp. 428–440, 1977.
[12]Curt, H., “ The Devil''s in The Detail: Techniques, Tool, and Applications for Data mining and Knowlegde Discovery-Part 1”, Intelligent Software Strategies, Vol. 6, No. 9, pp.3, 1995.
[13]Elder, J., Pregibon, D., “A statics perspective on knowledge discovery in databases”, Advances in Knowledge Discovery and Data Mining, pp.83-115, 1996.
[14]Han, J., Pei, J., Yin, Y. and Mao, R., “Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach”, Data Mining and Knowledge Discovery, Vol. 8, No. 1. , pp. 53-87, 2004.
[15]Han, J. and Kamber, M., Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufman, 2006.
[16]Hipp, J., Myka, A., Wirth, R. and Guntzer, U., “A New Algorithm for Faster Mining of Generalized Association Rules”, In Proceedings of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery, 1998.
[17]Kaufman, L. and Rousseeuw, P.J., Finding Groups in Data : An Introduction to Cluster Analysis, 1990.
[18]Knuth, D.E., The Art of Computer Programming, Vol. 3, 3rd Edition, Addison-Wesley, 2004.
[19]Liu, J., Paulsen, S., Wang, W., Nobel, A. and Prins, J., “Mining Approximate Frequent Itemset from Noisy Data”, In Proceedings of 5th IEEE International Conference on Data Mining, pp. 721-724, 2005.
[20]Liu, J., Paulsen, S., Sun, X., Wang, W., Nobel, A. and Prins, J., “Mining approximate frequent itemsets in the presence of noise: Algorithm and analysis”, In Proceedings of SIAM International Conference on Data Mining, 2006.
[21]Mannila, H. and Ronkainen, P., “Similarity of Event Sequences”, In Proceedings of the Fourth International Workshop on Temporal Representation and Reasoning (TIME’97) pp.136-139, 1997.
[22]Chen, M.S, Hun, J. and Yu, P.S., “Data mining : An Overview from a Database Perspective”, IEEE Transactions on Knowledge and Data Engineering, Vol.8 No.6, 1996.
[23]Zaki, M. J., “Scalable Algorithms for Association Mining”, IEEE Transactions On Knowledge and Data Engineering, Vol.12, Issue 3, pp.372-390, 2000.
[24]Ng, R. and Han, J., “Efficient and Effective Clustering Method for Spatial Data Mining”, In Proceedings of the Int’l Conf. Very Large Data Bases, 1994, pp.144-155.
[25]Quilan, J. R., C4.5 : Programs for Machine Learning, Morgan Kaufmann, 1993.
[26]Quilan, J. R., Induction of decision trees, Machine Learning, pp.81-106, 1986.
[27]Park, J. S., Chen, M. S. and Ui, P. S., “An Effective Hash Based Algorithm for Mining Association Rules ”, In Proceedings of the ACM SIGMOD, pp.175-186, 1995.
[28]Pawlak, Zdzisław; Wong, S.K.M. and Ziarko, Wojciech., “Rough sets: Probabilistic versus deterministic approach”, International Journal of Man-Machine Studies, pp. 81–95,1988.
[29]Pei, J., Tung, A. K. and Han, J., “Fault-tolerant frequent pattern mining: Problems and challenges”, In Proceedings of the 2001 ACM-SIGMOD, 2001.
[30]Savasere, A., Omiecinski, E. and Navathe, S., “An Efficient Algorithm for Mining Association Rules in Large Databases”, In Proceedings of 21st VLDB, pp.432-444, 1995.
[31]Srikant, R. and Agrawal, R., “Mining Generalized Association Rules”, In Proceedings of the 21st Int’l Conference on VLDB, pp.407-419, 1995.
[32]Sun, X. and Nobel, A. B., “Significance and Recovery of Block Structures in Binary Matrices with Noise”, In Proceedings of the 19th Annual Conference on Learning Theory, 2006.
[33]Yang, C., Fayyad, U. and Bradley, P. S., “Efficient Discovery of Error-tolerant Frequent Itemsets in High Dimensions”, In Proceedings of Knowledge Discovery and Data Mining, pp.194-203, 2001.
[34]Zaki, M. J., Parthasarathy, S., Ogihara, M. and Li, W., “New Algorithms for Fast Discovery of Association Rules”, 3rd Int’l Conference on Knowledge Discovery&Data Mining, Newport, CA, August 1997.
[35]IBM Almaden Research Center, “Quest Synthetic Data Generation Code”, From: http://www.almaden.ibm.com/cs/quest/syndata.html.
[36]UCI machine learning repository,“Chess (King-Rook vs. King) Data Set”, From:http://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King%29.
[37]Fayyad, U. and Smyth, P.,“From Data Mining to Knowledge Discovery:An Overview”, Advances in Knowledge Discovery and Data Mining, pp. 1-36, 1996.
[38]Weiss, S.M. and Kulikowski, C.A., Computer System that Learn : Classification and Predicition Methods form Statistics, Neural Net, Machine Learning, and Expert System, Morgan Kaufman, 1991.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊