跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.80) 您好!臺灣時間:2024/12/04 05:59
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:李高丞
研究生(外文):Kao-ChengLee
論文名稱:基於記憶體可限制之分散式頻繁模式挖掘演算法
論文名稱(外文):On the Guarantee of Peak Memory Consumption in Map-ReduceFrequent Pattern Mining
指導教授:莊坤達莊坤達引用關係
指導教授(外文):Kun-Ta Chuang
學位類別:碩士
校院名稱:國立成功大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2017
畢業學年度:105
語文別:英文
論文頁數:41
中文關鍵詞:分散式運算頻繁模式挖掘數據挖掘
外文關鍵詞:HadoopMap-ReduceDistributed PlatformFrequent Pattern Mining
相關次數:
  • 被引用被引用:0
  • 點閱點閱:190
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
近年來,數據挖掘演算法的設計模式從傳統的單機運算概念延伸至如Hadoop,Spark這類具有擴展性、容錯性以及自我調控性的分散式運算平台。在數據急遽成長的環境下,平行運算被視為解決效能瓶頸的根本。在文獻中,基於Map-Reduce程式框架的挖掘演算法幾乎可以通過完全遷移複雜的索引來減少執行時間,例如開發在分散式平台的FPGrowth算法,其複雜結構透過有效性的修剪以及搜索效率均得到了全面的展示。

然而在本文探討中,這些複雜結構將成為大數據分析中的關鍵挑戰。一些複雜天生不可分割的性質會導致在分析過程中,耗盡某些運算節點的記憶體資源。儘管數據成長的計算瓶頸可以透過平行運算來解決,但由於數據增長也相對讓複雜結構的規模增加,因此記憶體的資源消耗將會急遽的上升。為了解決記憶體耗盡的問題,我們在本篇論文探討一個實用且重要的Map-Reduce挖掘框架,我們專注於在分散式平台下,藉由指定現有的記憶體資源來完成頻繁項集挖掘。我們設計了一個名為MFMR的新型Map-Reduce框架,同時兼具頻繁模式挖掘的效能,並符合用戶可控制記憶體上限之目標。
In recent years, the design paradigm of data mining algorithms has moved far beyond the traditional execution in single powerful server into the Map-Reduce-based execution in the scalable, fault-tolerance, and self-configurable clusters such as Hadoop or Spark. The parallelism has been recognized as the cure to resolve the performance bottleneck raising from the dramatic growth of data amount nowadays. In the literature, the state-of-the-art Map-Reduce mining algorithms almost pursue the reduction of the turnaround time by completely migrating the sophisticated index, such as the FP-Tree developed in non-parallel version of FPGrowth, in the distributed platform since the pruning effectiveness or search efficiency of these in-memory structure has been comprehensively demonstrated. Surprisingly, we in this paper explore that these in-memory structures will become the critical challenge in the big data era. The inseparable nature of some complicated index leads to the unacceptable memory consumption in some computing nodes, resulting in the ”out-of-memory” fatal error. The computational bottleneck from data increase could be resolved by the Map-Reduce manner, but however, the memory usage will be inevitably aggravated since data increase will enlarge the size of centralized indexes.

To remedy the out-of-memory challenge, we explore in this paper a practically important Map-Reduce mining framework to retrieve top-k frequent patterns in the presence of the memory constraint. Specifically, we focus on the memory issue which attempts to specify an available memory size in the Hadoop platform that can be utilized for mining frequent itemsets. We devise a novel Framework, called MFMR, to efficiently discover frequent patterns while complying with the goal of user-controllable memory upper bound. In our experimental studies on real data and synthetic data, the MFMR framework shows its prominent advantages, including the memory consumption and execution time, demonstrating its promising applicability.
中文摘要 . . . . . i
Abstract . . . . . ii
Contents . . . . . iii
List of Tables . . . . . iv
List of Figures . . . . . v
1 Introduction . . . . . 1
2 Related Work . . . . . 7
3 Preliminaries . . . . . 10
3.1 Memory-Bounded Candidate Generation . . . . . 11
3.2 Memory-Bounding Techniques in Distributed Environment . . . . . 12
3.3 Asynchronous Support Counting . . . . . 15
3.4 Illustrative Example of Memory Limited Mining Process . . . . . 18
4 Memory-Flexibility Map-Reduce Framework . . . . . 20
4.1 Implementation of MFMR . . . . . 20
4.2 Performance Optimization in Distributed Platforms . . . . . 22
5 Experimental Results . . . . . 26
5.1 Experimental Setup . . . . . 26
5.2 Studies on Mining Top-k Frequent Patterns . . . . . 27
6 Conclusions . . . . . 37
Bibliography . . . . . 38
[1] Frequent itemset mining dataset repository. In http://fimi.ua.ac.be/data, 2004.
[2] Apache mahout. In http://mahout.apache.org/, 2015.
[3] Apache hadoop. In http://hadoop.apache.org/, 2017.
[4] Apache spark. In https://spark.apache.org/, 2017.
[5] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
[6] F. Bodon. A fast apriori implementation. In Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations (FIMI’03), volume 90, 2010.
[7] I. Cantador, P. Brusilovsky, and T. Kuflik. 2nd workshop on information heterogeneity and fusion in recommender systems. In Proc. of the ACM Conf. on Recommender systems, RecSys, 2011.
[8] C.-C. Chen, C.-Y. Tseng, and M.-S. Chen. Highly scalable sequential pattern mining based on mapreduce model on the cloud. In IEEE International Congress on Big Data, pages 310–317. IEEE, 2013.
[9] H. Chen, T. Y. Lin, Z. Zhang, and J. Zhong. Parallel mining frequent patterns over big transactional data in extended mapreduce. In Granular Computing (GrC), IEEE International Conference on, pages 43–48. IEEE, 2013.
[10] K.-T. Chuang, J.-L. Huang, and M.-S. Chen. Mining top-k frequent patterns in the presence of the memory constraint. The VLDB Journal, 17(5):1321–1344, 2008.
[11] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
[12] F. Geerts, B. Goethals, and J. V. D. Bussche. Tight upper bounds on the number of candidate patterns. ACM Transactions on Database Systems (TODS), 30(2):333–363, 2005.
[13] A. Ghoting, P. Kambadur, E. Pednault, and R. Kannan. Nimble: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 334–342. ACM, 2011.
[14] B. Goethals. Memory issues in frequent itemset mining. In Proceedings of the 2004 ACM symposium on Applied computing, pages 530–534. ACM, 2004.
[15] G. Graefe, H. Volos, H. Kimura, H. Kuno, J. Tucek, M. Lillibridge, and A. Veitch. In-memory performance for big data. Proceedings of the VLDB Endowment, 8(1):37–48, 2014.
[16] C. Guo, Y. Ma, B. Yang, C. S. Jensen, and M. Kaul. Ecomark: evaluating models of vehicular environmental impact. In Proceedings of the 20th International Conference on Advances in Geographic Information Systems, pages 269–278. ACM, 2012.
[17] E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules, volume 26. ACM, 1997.
[18] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM Sigmod Record, volume 29, pages 1–12. ACM, 2000.
[19] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang. PFP: parallel fp-growth for query recommendation. In Proceedings of the ACM conference on Recommender systems, pages 107–114. ACM, 2008.
[20] L. Li and M. Zhang. The strategy of mining association rule based on cloud computing. In International Conference on Business Computing and Global Informatization, pages 475–478. IEEE, 2011.
[21] N. Li, L. Zeng, Q. He, and Z. Shi. Parallel implementation of apriori algorithm based on mapreduce. In Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD) 13th ACIS International Conference on, pages 236–241. IEEE, 2012.
[22] K. Lin, J. Xu, I. M. Baytas, S. Ji, and J. Zhou. Multi-task feature interaction learning. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1735–1744. ACM, 2016.
[23] M.-Y. Lin, P.-Y. Lee, and S.-C. Hsueh. Apriori-based frequent itemset mining algorithms on mapreduce. In Proceedings of the 6th international conference on ubiquitous information management and communication, page 76. ACM, 2012.
[24] S. Moens, E. Aksehirli, and B. Goethals. Frequent itemset mining for big data. In Big Data IEEE International Conference on, pages 111–118. IEEE, 2013.
[25] J. S. Park, M.-S. Chen, and P. S. Yu. An effective hash-based algorithm for mining association rules, volume 24. ACM, 1995.
[26] B. T. Rao, N. Sridevi, V. K. Reddy, and L. Reddy. Performance issues of heterogeneous hadoop clusters in cloud computing. arXiv preprint arXiv:1207.0894, 2012.
[27] H. P. Vanchinathan, A. Marfurt, C.-A. Robelin, D. Kossmann, and A. Krause. Discovering valuable items from massive data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1195–1204. ACM, 2015.
[28] J.Wang, J. Han, Y. Lu, and P. Tzvetkov. TFP: An efficient algorithm for mining top-k frequent closed itemsets. IEEE Transactions on Knowledge and Data Engineering, 17(5):652–663, 2005.
[29] H. Yang, R. Fujimaki, Y. Kusumura, and J. Liu. Online feature selection: A limited-memory substitution algorithm and its asynchronous parallel variation. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1945–1954. ACM, 2016.
[30] X. Y. Yang, Z. Liu, and Y. Fu. Mapreduce as a programming model for association rules algorithm on hadoop. In Information Sciences and Interaction Sciences (ICIS) 3rd International Conference on, pages 99–102. IEEE, 2010.
[31] H. Yin, B. Cui, J. Li, J. Yao, and C. Chen. Challenging the long tail recommendation. PVLDB, 5(9):896–907, 2012.
[32] C. Zeng, J. F. Naughton, and J.-Y. Cai. On differentially private frequent itemset mining. Proceedings of the VLDB Endowment, 6(1):25–36, 2012.
[33] Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 401–406. ACM, 2001.
[34] L. Zhou, Z. Zhong, J. Chang, J. Li, J. Z. Huang, and S. Feng. Balanced parallel fp-growth with mapreduce. In Information Computing and Telecommunications (YC-ICT) IEEE Youth Conference on, pages 243–246. IEEE, 2010.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top