跳到主要內容

臺灣博碩士論文加值系統

(216.73.217.165) 您好!臺灣時間:2026/05/20 07:12
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:王濬哲
研究生(外文):Wang, Jun-Zhe
論文名稱:高效益循序資料樣式探勘之研究
論文名稱(外文):A Study on Efficient Mining of High Utility Sequential Patterns
指導教授:黃俊龍黃俊龍引用關係陳以錚
指導教授(外文):Huang, Jiun-LongChen, Yi-Cheng
口試委員:彭文志陳銘憲陳以錚謝孫源曾新穆張嘉惠洪宗貝黃俊龍
口試委員(外文):Peng, Wen-ChihChen, Ming-SyanChen, Yi-ChengHsieh, Sun-YuanTseng, Vincent S.Chang, Chia-HuiHong, Tzung-PeiHuang, Jiun-Long
口試日期:2016-10-13
學位類別:博士
校院名稱:國立交通大學
系所名稱:資訊科學與工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2016
畢業學年度:105
語文別:英文
論文頁數:96
中文關鍵詞:高效益資料探勘高效益循序資料探勘前k高效益循序資料探勘漸進式高效益循序資料探勘
外文關鍵詞:High utility sequential patternhigh utility sequential pattern miningtop-k high utility sequential pattern miningincremental high utility sequential pattern mining
相關次數:
  • 被引用被引用:0
  • 點閱點閱:498
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
循序資料樣式探勘(Sequential Pattern Mining)是資料探勘研究中一個非常重要且根本的研究議題,過去在這個議題上,有許多演算法被設計,目的是在資料庫中找出所有常出現的循序資料樣式,而這個議題亦在現實生活中有許多廣泛的應用。然而,常出現的循序資料樣式可能對使用者或企業較不具有價值,相對的,較罕見的高報酬,高成本,高評價,高風險 等具獨特性的循序資料樣式可能較為使用者或企業所重視。

有鑑於此,近年來,高效益循序資料樣式探勘(High Utility Sequential Pattern Mining)議題被提出,其目的是在序列資料庫中,找出高效益的循序資料樣式。有別於傳統的循序資料樣式探勘,在序列資料庫中,每個項目(item)會有一個效益值,可能代表利潤,成本...等可被量化的特性,表示使用者對此項目的重視程度。然而此問題並不容易解決,主要是由於(1)序列的效益值(sequence utility)不服從downward closure property,以及(2)某序列在其超序列中,可能含有多個案例,每個案例都會有其效益值。針對(1),已有序列加權效益(Sequence Weight Utility)被提出並且被應用到少數演算法中,進行搜尋空間中,候選序列的過濾(pruning),然而由於一個序列的序列加權效益通常遠大於其與其超序列的序列效益值,可能會造成大量不具高效益的序列被產生出,也因而造成演算法效能低落。然而針對(2),現今所有演算法都是將所有案例的效益值記下,如此一來,在探勘過程,可能會累積龐大的案例與效益值對,消耗龐大的系統記憶體資源。為了解決上述兩個問題,本論文在第一個研究主題上,提出了Prefix Extension Utility (PEU)與Reduced Sequence Utility (RSU)兩個較小的序列效益值上限來解決問題(1), 並且提出了Utility-Chain的資料結構來解決問題(2),同時可以加快效益值與序列效益上限的計算, 重要的是,我們結合了兩個序列效益上限與Utility-Chain,設計出HUS-Span演算法來探勘高效益循序資料樣式。實驗結果顯示,HUS-Span較既有方法在效能與記憶體資源使用上佔有優勢。

此外,由於設定門檻值對於較不具經驗的使用者,可能會造成困擾,而設定太高或太低的門檻,均可能會造成過少或過多的循序資料樣式產生。有鑑於此,已有Top-K高效益循序資料樣式探勘議題被提出,而在此問題上,既有方法基本上採用深度優先搜尋策略(Depth-First Search, DFS)來探勘Top-k樣式, 雖然深度優先搜尋較省記憶體,但其可能無法在探勘的過程中,較快找出Top-K樣式。有鑑於此,本論文在此議題上,提出了一個最佳優先搜尋策略(Best-First Search, BFS)的演算法,稱為TKHUS-SpanBFS,其結合PEU,RSU與Utility-Chain,並且永遠往搜尋空間中,所有已被產生但未被探索過,具有最大PEU的序列探索。實驗結果顯示,此演算法對於DFS演算法,具有相當顯著的效能優勢, 然而,最佳優先搜尋演算法可能會耗費大量的記憶體資源,因此,本論文在限制BFS的序列數目下,提出一個結合BFS與DFS的演算法,稱為TKHUS-SpanHybrid,以取得記憶資源與效能之間的平衡。

最後, 由於在現實生活中,新的資料往往隨時輸入資料庫。然而,每次資料庫更新後,便重新探勘資料庫以找出高效益循序資料樣式,尤其在更新的序列數目不多時,會顯得非常沒效率。因此,本論文的最後一個研究主題針對隨時累積的序列資料庫,進行漸進式探勘。為了減少多餘的重複運算,同時避免耗盡記憶體空間,我們提出了一個更小的序列效益上限:Tight Sequence Utility (TSU),來緩衝前次探勘結果。確切地說,我們設計一個樹狀結構緩衝舊資料庫中的高TSU序列,其優點是能減少緩衝的序列數目,並且結合Utility-Chain來設計每個序列所需儲存的資訊。針對需要更新的序列,我們設計了較有效率的節點更新方式,包含效益與效益上限等。同時,為了加速探勘效率,我們設計了資料庫掃瞄範圍減少策略,並且更新樹上相對應的節點資訊,以便支援後續的資料庫更新。結合上述幾種策略,我們提出了IncUsp-Miner演算法,在序列資料庫中進行漸進探勘高效益循序資料樣式。實驗結果顯示,在更新序列數目為原資料庫序列數目一半以下的情況下,IncUsp-Miner演算法可以有效率地探勘高效益循序資料樣式。
Sequential pattern mining is a fundamental research issue in data mining, which aims to find out all of the frequent subsequences in a sequence database. So far, many algorithms have been proposed to address this problem, and it has wide real world applications. However, frequent sequential patterns may not be informative to some users or business. By contrast, they may be interested in some rare patterns with high revenue, cost, etc.

In view of this, the concept of high utility sequential pattern mining has been introduced recently, which refers to identify sequences with high utilities (e.g., profits) but probably with low frequencies in a sequence database. To identify high utility sequential patterns, due to lack of downward closure property in this problem, most existing algorithms first generate candidate sequences with high Sequence-Weighted Utilities (SWUs), which is an upper bound of the utilities of a sequence and all its supersequences, and then calculate the actual utilities of these candidates. This causes a large number of candidates generated since SWU is usually much larger than the real utilities of a sequence and all its supersequences. In view of this, we propose two tight utility upper bounds, Prefix Extension Utility (PEU) and Reduced Sequence Utility (RSU), as well as two companion pruning strategies, and devise HUS-Span algorithm to identify high utility sequential patterns by employing these two pruning strategies. Experimental results on some real and synthetic datasets show that HUS-Span is able to generate less candidate sequences, and thus outperforms other prior algorithms in terms of mining efficiency.

In addition, since setting a proper utility threshold is usually difficult for users, we also propose algorithm TKHUS-Span to identify top-$k$ high utility sequential patterns by using these two pruning strategies. Three searching strategies, guided Depth-First Search (GDFS), Best-First Search (BFS), and hybrid search of BFS and GDFS, are also proposed to improve the efficiency of TKHUS-Span. Experimental results on some real and synthetic datasets show that TKHUS-Span with strategy BFS is able to explore less candidate sequences, and thus outperforms other prior algorithms in terms of mining efficiency.

In practice, most sequence databases usually grow over time, and it is inefficient for existing algorithms to mine HUSPs from scratch when databases grow with a small portion of updates. In view of this, we propose the IncUSP-Miner algorithm to mine HUSPs incrementally. Specifically, to avoid redundant re-computations, we propose a tighter upper bound of the utility of a sequence, called Tight Sequence Utility (TSU), and then design a novel data structure, called the candidate pattern tree, to buffer the sequences whose TSU values are greater than or equal to the minimum utility threshold in the original database. Accordingly, to avoid keeping a huge amount of utility information for each sequence, a set of concise utility information is designed to be kept in each tree node. Moreover, several strategies are also proposed to reduce the amount of computation for utility update and the scopes of database scans, thereby improving the mining efficiency. Experimental results on some real and synthetic datasets show that IncUSP-Miner is able to efficiently mine high utility sequential patterns incrementally.
摘要i
Abstract iii
誌謝v
Contents vii
List of Figures x
List of Tables xii
1 Introduction 1
2 Related Work 5
2.1 High Utility Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 High Utility Itemset Mining . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 High Utility Sequential Pattern Mining . . . . . . . . . . . . . . . . 6
2.2 Top-K High Utility Pattern Mining . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Top-K High Utility Itemset Mining . . . . . . . . . . . . . . . . . . 7
2.2.2 Top-K High Utility Sequential Pattern Mining . . . . . . . . . . . . 8
2.3 Incremental Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Incremental Mining of High Utility Itemsets . . . . . . . . . . . . . 9
2.3.2 Incremental Mining of Frequent Sequential Patterns . . . . . . . . . 9
3 High Utility Sequential Pattern Mining 11
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 HUS-Span: The Proposed Algorithm for Mining High Utility Sequential Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Lexicographic Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Utility-Chain Structure . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2.1 Generating Utility-Chains of 1-Sequences . . . . . . . . . . 20
3.3.2.2 Generating Utility-Chains of l-Sequences (l≥2) . . . . . . 21
3.3.3 The Proposed Pruning Strategies . . . . . . . . . . . . . . . . . . . 24
3.3.3.1 The Prefix Extension Utility Strategy . . . . . . . . . . . 24
3.3.3.2 The Reduced Sequence Utility Strategy . . . . . . . . . . 26
3.3.4 HUS-Span Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Experimental Results of High Utility Sequential Pattern Mining . . 32
3.4.1.1 Effect of Utility Threshold . . . . . . . . . . . . . . . . . . 32
3.4.1.2 Effect of Database Size . . . . . . . . . . . . . . . . . . . . 36
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Top-K High Utility Sequential Pattern Mining 38
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 TKHUS-Span: The Proposed Algorithm for Mining Top-k High Utility Sequential Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.1 TKHUS-Span with Guided Depth-First Search Strategy . . . . . . 41
4.3.2 TKHUS-Span with Best-First Search Strategy . . . . . . . . . . . . 43
4.3.3 TKHUS-Span with Hybrid Search Strategy . . . . . . . . . . . . . . 47
4.4 Experimental Results of Top-k High Utility Sequential Pattern Mining . . 48
4.4.1 Average Related Frequency of Top-k High Utility Sequential Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.2 Effect of the Maximal Size of BFS Queue . . . . . . . . . . . . . . . 50
4.4.3 Effect of k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.4 Effect of Database Size . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Incremental Mining of High Utility Sequential Patterns 57
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Preliminary Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1 Utility Upper Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1.1 The Tight Sequence Utility . . . . . . . . . . . . . . . . . 62
5.4 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.2 Initial Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.2.1 Buffering Sequences in the Candidate Pattern Tree . . . . 71
5.4.2.2 USP-Miner Algorithm . . . . . . . . . . . . . . . . . . . . 72
5.4.3 Incremental Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.3.1 Tree Node Update . . . . . . . . . . . . . . . . . . . . . . 74
5.4.3.2 Node Skipping Strategy . . . . . . . . . . . . . . . . . . . 75
5.4.3.3 Database Scan Reduction Strategies . . . . . . . . . . . . 76
5.4.3.4 Supporting Multiple Database Updates . . . . . . . . . . . 77
5.4.3.5 IncUSP-Miner Algorithm . . . . . . . . . . . . . . . . . . 80
5.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5.1 Effect of Utility Threshold . . . . . . . . . . . . . . . . . . . . . . . 83
5.5.2 Effects of Update Ratio . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5.3 Effects of Database Size . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5.4 The Effects of the Number of Distinct Items . . . . . . . . . . . . . 89
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 Conclusions and Future Work 91
Bibliography 94
[1] J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, "Sequential pattern mining using a bitmap representation," in Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining, 2002, pp. 429-435.
[2] M. N. Garofalakis, R. Rastogi, and K. Shim, "Spirit: Sequential pattern mining with regular expression constraints," in Proceedings of the 25th International Conference on Very Large Data Bases, 1999, pp. 223-234.
[3] J. Han, J. Pei, and Y. Yin, "Mining frequent patterns without candidate generation," in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000, pp. 1-12.
[4] C. Kim, J.-H. Lim, R. T. Ng, and K. Shim, "Squire: Sequential pattern mining with quantities," Journal of Systems and Software, vol. 80, no. 10, pp. 1726-1745, 2007.
[5] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, "Mining sequential patterns by pattern-growth: the prefixspan approach," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1424-1440, 2004.
[6] R. Srikant and R. Agrawal., "Mining sequential patterns: Generalizations and performance improvements," in Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, 1996, pp. 3-17.
[7] J. Wang and J. Han, "Bide: Efficient mining of frequent closed sequences," in Proceedings of the 20th IEEE International Conference on Data Engineering, 2004, pp.79-90.
[8] X. Yan, J. Han, and R. Afshar, "Clospan: Mining closed sequential patterns in large datasets," in Proceedings of the 3rd SIAM International Conference on Data Mining, 2003, pp. 166-177.
[9] H. Yao, H. J. Hamilton, and C. J. Butz, "A foundational approach to mining itemset utilities from databases," in Proceedings of the 7th SIAM International Conference on Data Mining, 2004, pp. 482-486.
[10] Y. Liu, W. Liao, and A. Choudhary, "A fast high utility itemsets mining algorithm," in Proceedings of the 1st ACM International Workshop on Utility-based Data Mining, 2005, pp. 90-99.
[11] Y.-C. Li, J.-S. Yeh, and C.-C. Chang, "Isolated items discarding strategy for discovering high utility itemsets," Data & Knowledge Engineering, vol. 64, no. 1, pp. 198-217, 2008.
[12] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, and Y.-K. Lee, "Efficient tree structures for high utility pattern mining in incremental databases," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 12, pp. 1708-1721, 2009.
[13] V. S. Tseng, C.-W. Wu, B.-E. Shie, and P. S. Yu, "Up-growth: an efficient algorithm for high utility itemset mining," in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010, pp. 253-262.
[14] M. Liu and J. Qu, "Mining high utility itemsets without candidate generation," in Proceedings of the 21st ACM international conference on Information and knowledge management, 2012, pp. 55-64.
[15] J. Liu, K. Wang, and B. Fung, "Direct discovery of high utility itemsets without candidate generation," in the 12rd IEEE International Conference on Data Mining, 2012, pp. 984-989.
[16] C. F. Ahmed, S. K. Tanbeer, and B.-S. Jeong, "A novel approach for mining high-utility sequential patterns in sequence databases," ETRI Journal, vol. 32, no. 5, pp. 676-686, 2010.
[17] G.-C. Lan, T.-P. Hong, V. S. Tseng, and S.-L. Wang, "Applying the maximum utility measure in high utility sequential pattern mining," Expert Systems with Applications, vol. 41, no. 11, pp. 5071 - 5081, 2014.
[18] J. Yin, Z. Zheng, and L. Cao, "Uspan: an efficient algorithm for mining high utility sequential patterns," in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 660-668.
[19] B.-E. Shie, H.-F. Hsiao, V. S. Tseng, and P. S. Yu, "Mining high utility mobile sequential patterns in mobile commerce environments," in Proceedings of the 16th International Conference on Database Systems for Advanced Applications, 2011, pp.224-238.
[20] C. F. Ahmed, S. K. Tanbeer, and J. Byeong-Soo, "A framework for mining high utility web access sequences." IETE Technical Review, vol. 28, no. 1, pp. 3 - 16, 2011.
[21] C. W.Wu, B.-E. Shie, V. S. Tseng, and P. S. Yu, "Mining top-k high utility itemsets," in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 78-86.
[22] J. Yin, Z. Zheng, L. Cao, Y. Song, and W. Wei, "Efficiently mining top-k high utility sequential patterns," in the 13rd IEEE International Conference on Data Mining, 2013, pp. 1259-1264.
[23] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 2nd ed. Pearson Education, 2003.
[24] W. Zhang and R. E. Korf, "Depth-first vs. best-first search: New results," in the 11th AAAI National Conference on Artificial Intelligence, 1993, pp. 769-775.
[25] C.-W. Lin, T.-P. Hong, G.-C. Lan, J.-W. Wong, and W.-Y. Lin, "Incrementally mining high utility patterns based on pre-large concept," Applied Intelligence, vol. 40, no. 2, pp. 343-357, 2014.
[26] S. Parthasarathy, M. J. Zaki, M. Ogihara, and S. Dwarkadas, "Incremental and interactive sequence mining," in Proceedings of the 8th International Conference on Information and Knowledge Management, 1999.
[27] H. Cheng, X. Yan, and J. Han, "Incspan: Incremental mining of sequential patterns in large database," in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004.
[28] S. Nguyen, X. Sun, and M. Orlowska, "Improvements of incspan: Incremental mining of sequential patterns in large database," in PAKDD, 2005.
[29] Y. Chen, J. Guo, Y. Wang, Y. Xiong, and Y. Zhu, "Incremental mining of sequential patterns using prefix tree," in PAKDD, 2007.
[30] "Microsoft sql server 2008 analysis services unleashed," http://www.informit.com/store/microsoft-sql-server-2008-analysis-services-unleashed-9780672330018.
[31] "Ta-feng datasets," http://aiia.iis.sinica.edu.tw/index.php?option=com frontpage&Itemid=1.
[32] K. Bache and M. Lichman, "UCI machine learning repository," http://archive.ics.uci.edu/ml, 2013.
[33] "Amazon reviews," http://snap.stanford.edu/data/web-Amazon.html.
[34] J.-Z. Wang, J.-L. Huang, and Y.-C. Chen, "On efficiently mining high utility sequential patterns," Knowledge and Information Systems, to be published.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top