跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.152) 您好!臺灣時間:2025/11/02 12:59
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:秦秉達
研究生(外文):Bing-Da Chin
論文名稱:基於Hadoop MapReduce叢集設計平行化二元分類演算法
論文名稱(外文):Design of Parallel Binary Classification Algorithm Based on Hadoop Cluster with MapReduce Framework
指導教授:陳弘明陳弘明引用關係陳世穎陳世穎引用關係
指導教授(外文):Hung-Ming ChenShih-Ying Chen
學位類別:碩士
校院名稱:國立臺中科技大學
系所名稱:資訊工程系碩士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2015
畢業學年度:103
語文別:中文
論文頁數:74
中文關鍵詞:資料探勘分類法SVM二元分類HadoopMapReduce
外文關鍵詞:Data MiningClassificationSVMBinary-class classificationHadoopMapReduce
相關次數:
  • 被引用被引用:5
  • 點閱點閱:490
  • 評分評分:
  • 下載下載:141
  • 收藏至我的研究室書目清單書目收藏:1
在現今單機電腦環境已經無法有效率的分析大量資料的同時,Hadoop運算平台之可儲存與分析之特性有著明確的重要性。對於大量資料分析過程而言,資料探勘的演算法應用是其中重要的一環。而本次研究為了解決二元分類演算法SVM之時間複雜度過高的問題,改良一二元分類演算法,於分散式平行化運算框架中達到加速篩選分類資料的效果。主要利用MapReduce程式框架之平行化處理特性實現此演算法並成功運行於Hadoop運算平台上,在使用相同資料集進行訓練分析的情形下,大幅降低了執行運算時間。

With increased amount data today,it is hard to analyze large data on single computer environment efficiently,the hadoop cluster is very important because we can save and large data by hadoop cluster. Data mining plays an important role of data analysis.Because time complexity of the binary-class classification SVM algorithm is a big issue,we design a parallel binary SVM algorithm to slove this problem,and achieve the effect of classifying appropriate data.
By leveraging the parallel processing property in MapReduce ,we implement multi-layer binary SVM by MapReduce framework,and run on the hadoop cluster successfully. By designing different parameters of hadoop cluster and using the same data set for training analysis, it shows that the new algorithm can reduce the computation time significantly.


摘要 I
ABSTRACT II
誌謝 III
目次 IV
表目次 VI
圖目次 VII
第1章 緖論 1
1.1研究背景 1
1.2研究動機 2
1.3研究流程 4
第2章 文獻探討 6
2.1雲端運算平台 6
2.1.1 Hadoop框架(Hadoop Framework) 6
2.1.2 Hadoop Distriuted File System(HDFS,分散式檔案系統) 8
2.1.3 MapReduce 11
2.2資料探勘分類法 14
2.2.1 分類類型 15
2.2.1.2 多類分類(Multi-label Classification) 15
2.2.1.1 二元分類(Binary-class Classification) 16
2.2.2 分類結果評估 19
2.2.2.1 分類正確性評估法 19
2.2.2.2 分類評估指標 20
2.2.3 分類規則演算法 23
2.2.3.1 SVM(Support Vector Machine,支持向量機) 23
2.2.3.2 Naïve Bayes 29
2.2.3.3 決策樹(Decision Tree) 30
2.3 CASCADE SVM 31
2.4 LIBSVM 35
2.5 MAPREDUCE-BASED DISTRIBUTED SVM 36
第3章 研究方法 38
3.1 平行化MAPREDUCE-BASED二元分類演算法設計 39
3.1.1 HDFS 39
3.1.2 MapReduce 40
3.1.3 演算法實作 41
3.1.3.1 LIBSVM 41
3.1.3.2 Cascade SVM 41
3.1.3.3 Multilayer-MapReduce-based Cascade SVM 42
第4章 實驗結果 51
4.1 實驗環境 51
4.2 實驗資料 52
4.3 實驗流程 53
4.4 實驗結果 54
4.4.1 應用於巨量資料之效能結果分析 54
4.4.2 Skin Segmentation分析結果 57
第5章 結論與未來展望 67
5.1 研究結論 67
5.2 未來展望 68
參考文獻 69


表 4 1 HADOOP叢集環境使用硬體配置 51
表 4 2 單機環境硬體配置 52
表 4 3 HADOOP叢集環境配置 52
表 4 4 MULTILAYER-MAPREDUCE-BASED CASCADE SVM平行化階段計算時間總和 (NODE=1) 表單位(秒) 55
表 4 5 MULTILAYER-MAPREDUCE-BASED CASCADE SVM平行化階段計算時間總和 (NODE=3) 表單位(秒) 55
表 4 6 MULTILAYER-MAPREDUCE-BASED CASCADE SVM平行化階段計算時間總和 (NODE=6) 表單位(秒) 55
表 4 7 平行化實作方法之執行時間比較 (NODE=6) 表單位(秒) 55
表 4 8 MULTILAYER-MAPREDUCE-BASED CASCADE SVM執行時間(NODE=1) 表單位(秒) 58
表 4 9 MULTILAYER-MAPREDUCE-BASED CASCADE SVM執行時間(NODE=3) 表單位(秒) 58
表 4 10 MULTILAYER-MAPREDUCE-BASED CASCADE SVM執行時間(NODE=3) 表單位(秒) 58
表 4 11 ÇATAK&;apos;&;apos;S MAPREDUCE-BASED DISTRIBUTED SVM 執行時間 (TASK=1) 表單位(秒) 59
表 4 12 ÇATAK&;apos;&;apos;S MAPREDUCE-BASED DISTRIBUTED SVM 執行時間 (TASK=2) 表單位(秒) 59
表 4 13 ÇATAK&;apos;&;apos;S MAPREDUCE-BASED DISTRIBUTED SVM 執行時間 (TASK=4) 表單位(秒) 59


圖 1 1 研究流程與架構圖 5
圖 2 1 HDFS與MAPREDUCE框架圖 8
圖 2 2 HDFS架構與副本建立示意圖 9
圖 2 3 HDFS檔案讀取的過程 10
圖 2 4 HDFS的檔案寫入流程 11
圖 2 5 MAPREDUCE平行化處理運作流程 14
圖 2 6 ONE-AGAINST-REST分類策略示意圖 17
圖 2 7 ONE-AGAINST-ONE分類策略示意圖 18
圖 2 8 DAG測試階段示意圖 19
圖 2 9 分類結果評估矩陣 21
圖 2 10 SVM最大邊界分類器之示意圖 24
圖 2 11 SVM為線性不可分的情形示意圖 27
圖 2 12 利用KERNEL FUNCTION的轉換示意圖 29
圖 2 13 決策樹示意圖 31
圖 2 14 CASCADE SVM示意圖 33
圖 2 15 CASCADE SVM流程圖 34
圖 2 16 LIBSVM簡易使用流程 36
圖 3 1 MULTILAYER-MAPREDUCE-BASED CASCADE SVM系統流程圖 39
圖 3 2 實作MAPRED-BASED CASCADE SVM演算法示意圖 44
圖 3 3 MULTILAYER-MAPREDUCE-BASED CASCADE SVM第一階段虛擬碼 45
圖 3 4 MULTILAYER-MAPREDUCE-BASED CASCADE SVM第二階段虛擬碼 46
圖 3 5 MULTILAYER-MAPREDUCE-BASED CASCADE SVM第三階段JOB設置之虛擬碼 48
圖 3 6 MULTILAYER-MAPREDUCE-BASED CASCADE SVM第三階段MAP TASK虛擬碼 49
圖 3 7 MULTILAYER-MAPREDUCE-BASED CASCADE SVM第三階段REDUCE TASK虛擬碼 50
圖 4 1 LIBSVM 格式資料 53
圖 4 2 實驗流程圖 54
圖 4 3 MULTILAYER-MAPREDUCE-BASED CASCADE SVM於不同節點之執行時間比較圖 56
圖 4 4 平行化方法之平行化階段時間比較圖 57
圖 4 5 總執行時間比較(TASK=1) 61
圖 4 6 總執行時間比較(TASK=2) 61
圖 4 7 總執行時間比較(TASK=4) 62
圖 4 8 MULTILAYER-MAPREDUCE-BASED CASCADE SVM不同節點之執行時間比較圖 63
圖 4 9 正確率比較(TASK = 1) 64
圖 4 10 正確率比較(TASK = 2) 64
圖 4 11 正確率比較(TASK = 4) 65


[1]Wamba, S. F., Akter, S., Edwards, A., Chopin, G., &; Gnanzou, D. (2015). How ‘big data’can make big impact: Findings from a systematic review and a longitudinal case study. International Journal of Production Economics, 165, 234-246.
[2]Wilder, B. (2012). Cloud architecture patterns: using microsoft azure. &;quot; O&;apos;&;apos;Reilly Media, Inc.&;quot;.
[3]Bracci, F., Corradi, A., &; Foschini, L. (2012, July). Database security management for healthcare SaaS in the Amazon AWS Cloud. In Computers and Communications (ISCC), 2012 IEEE Symposium on (pp. 000812-000819). IEEE.
[4]Linda A. Winters-Miner, PhD, Pat S. Bolding, MD, Joseph M. Hilbe, JD, PhD, Mitchell Goldstein, MD,Thomas Hill, PhD, Robert Nisbet, PhD, Nephi Walton, MS, PhD, Gary D. Miner, PhD.(2015). IBM Watson for Clinical Decision Support, Practical Predictive Analytics and Decisioning Systems for Medicine(pp.1038–1040)
[5]Ghemawat, S., Gobioff, H., &; Leung, S. T. (2003, October). The Google file system. In ACM SIGOPS operating systems review (Vol. 37, No. 5, pp. 29-43). ACM.
[6]Lam, C. (2010). Hadoop in action. Manning Publications Co..
[7]Carstoiu, D., Cernian, A., &; Olteanu, A. (2010, May). Hadoop hbase-0.20.2 performance evaluation. In New Trends in Information Science and Service Science (NISS), 2010 4th International Conference on (pp. 84-87). IEEE.
[8]Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., &; Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), 10-18.
[9]Cristianini, N., &; Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge university press.
[10]Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46). IBM New York.
[11]A. Clare, R.D. King, Knowledge discovery in multi-label phenotype data, in:Proceedings of the 5th European Conference on PKDD, 2001, pp. 42–53.
[12]H. Blockeel, L.D. Raedt, J. Ramon, Top-down induction of clustering trees, in:Proceedings of the 15th International Conference on Machine Learning, 1998,pp. 55–63.
[13]Faqeeh, M., Abdulla, N., d Al-Ayyoub, M., Jararweh, Y., &; Quwaider, M. (2014, August). Cross-lingual Short-Text Document Classification for Facebook Comments. In The 2nd International Conference on Future Internet of Things and Cloud (FiCloud 2014).
[14]Sun, A., Lim, E. P., &; Liu, Y. (2009). On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems,48(1), 191-201.
[15]Liu, C. L., Nakashima, K., Sako, H., &; Fujisawa, H. (2003). Handwritten digit recognition: benchmarking of state-of-the-art techniques. Pattern Recognition,36(10), 2271-2285.
[16]Alham, N. K., Li, M., Liu, Y., &; Hammoud, S. (2011). A MapReduce-based distributed SVM algorithm for automatic image annotation. Computers &; Mathematics with Applications, 62(7), 2801-2811.
[17]Fu, K., Qu, J., Chai, Y., &; Dong, Y. (2014). Classification of seizure based on the time-frequency image of EEG signals using HHT and SVM. Biomedical Signal Processing and Control, 13, 15-22.
[18]Kumar, M. A., &; Gopal, M. (2010). A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recognition Letters, 31(11), 1437-1444.
[19]Cheng, W. C., &; Jhan, D. M. (2011, October). A cascade classifier using Adaboost algorithm and support vector machine for pedestrian detection. In Systems, Man, and Cybernetics (SMC), 2011 IEEE International Conference on (pp. 1430-1435). IEEE.
[20]Alladi, S. M., Shinde Santosh, P., Ravi, V., &; Murthy, U. S. (2008). Colon cancer prediction with genetic profiles using intelligent techniques. Bioinformation, 3(2), 130-133.
[21]Chang, C. C., &; Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
[22]Demšar, J., Curk, T., Erjavec, A., Gorup, Č., Hočevar, T., Milutinovič, M., ... &; Zupan, B. (2013). Orange: data mining toolbox in Python. The Journal of Machine Learning Research, 14(1), 2349-2353.
[23]Joachims, T. (1999). Svmlight: Support vector machine. SVM-Light Support Vector Machine http://svmlight. joachims. org/, University of Dortmund, 19(4).
[24]方耀輝. (2005). 以密度叢集法提升支持向量機之分類效率. 成功大學工業與資訊管理學系學位論文, 1-61.
[25]Graf, H. P., Cosatto, E., Bottou, L., Dourdanovic, I., &; Vapnik, V. (2004). Parallel support vector machines: The cascade svm. In Advances in neural information processing systems (pp. 521-528).
[26]Apache Hadoop, http://hadoop.apache.org, April 2014(last accessed:2015/06/7)
[27]李慧珍. (2014). 以多重 Hadoop 叢集提升雲端運算資料之可用及可靠度. 輔仁大學資訊工程學系學位論文, 1-58.
[28]鄭峰麒. (2013) .Hadoop 雲端運算效能評估與行動管理系統. 虎尾科技大學資訊工程研究所碩士論文.
[29]阮有淨江, (2013).設計與實作一個將單機環境軟體轉換到 Hadoop 基礎分散式環境的 MapReduce 框架 . 國立成功大學製造資訊與系統研究所碩博士班, 碩士論文.
[30]Shvachko, K., Kuang, H., Radia, S., &; Chansler, R. (2010, May). The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on (pp. 1-10). IEEE.
[31]White, T. (2012). Hadoop: The definitive guide. &;quot; O&;apos;&;apos;Reilly Media, Inc.&;quot;.
[32]Tsoumakas, G., &; Katakis, I. (2006). Multi-label classification: An overview.Dept. of Informatics, Aristotle University of Thessaloniki, Greece.
[33]Zhang, M. L., &; Zhou, Z. H. (2005, July). A k-nearest neighbor based algorithm for multi-label classification. In Granular Computing, 2005 IEEE International Conference on (Vol. 2, pp. 718-721). IEEE.
[34]Li, T., &; Ogihara, M. (2003, October). Detecting emotion in music. In ISMIR(Vol. 3, pp. 239-240).
[35]Boutell, M. R., Luo, J., Shen, X., &; Brown, C. M. (2004). Learning multi-label scene classification. Pattern recognition, 37(9), 1757-1771.
[36]鄧如秀. (2005). 二元分類技術應用於社區護理與篩檢之探討--以某縣原住民部落心臟血管疾病危險因子為例. 國立台北護理學院護理研究所碩士班碩士論文.
[37]Hong, L., Dan, O., &; Davison, B. D. (2011, March). Predicting popular messages in twitter. In Proceedings of the 20th international conference companion on World wide web (pp. 57-58). ACM.
[38]Chang, C. J. (2009). 即時無線瞌睡偵測腦機介面系統. 交通大學電機與控制工程系所學位論文, 1-86.
[39]Cateni, S., Colla, V., &; Vannucci, M. (2014). A method for resampling imbalanced datasets in binary classification tasks for real-world problems.Neurocomputing, 135, 32-41.
[40]Kleinbaum, D. G., &; Klein, M. (2010). Analysis of Matched Data Using Logistic Regression (pp. 389-428). Springer New York.
[41]Hagan, M. T., Demuth, H. B., &; Beale, M. H. (1996). Neural network design(pp. 2-14). Boston: Pws Pub..
[42]Friedman, N., Geiger, D., &; Goldszmidt, M. (1997). Bayesian network classifiers. Machine learning, 29(2-3), 131-163.
[43]Bishop, C. M. (2006). Pattern recognition and machine learning (Vol. 4, No. 4, p. 12). New York: springer.
[44]Platt, J. C., Cristianini, N., &; Shawe-Taylor, J. (1999, November). Large Margin DAGs for Multiclass Classification. In nips (Vol. 12, pp. 547-553).
[45]Basu, C., Hirsh, H., &; Cohen, W. (1998, July). Recommendation as classification: Using social and content-based information in recommendation. In AAAI/IAAI (pp. 714-720).
[46]Kim, J. H. (2009). Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational Statistics &; Data Analysis, 53(11), 3735-3745.
[47]Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai (Vol. 14, No. 2, pp. 1137-1145).
[48]Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.
[49]Olson, D. L., &; Delen, D. (2008). Advanced data mining techniques. Springer Science &; Business Media.
[50]Vapnik, V. N., &; Vapnik, V. (1998). Statistical learning theory (Vol. 1). New York: Wiley.
[51]Freund, Y., &; Mason, L. (1999, June). The alternating decision tree learning algorithm. In icml (Vol. 99, pp. 124-133).
[52]Kramer, O. (2015). Cascade Support Vector Machines with Dimensionality Reduction. Applied Computational Intelligence and Soft Computing, 2015.
[53]Song, J., Wu, T., &; An, P. (2008, November). Cascade linear SVM for object detection. In Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for (pp. 1755-1759). IEEE.
[54]Yang, J. (2006, June). An improved cascade SVM training algorithm with crossed feedbacks. In Computer and Computational Sciences, 2006. IMSCCS&;apos;&;apos;06. First International Multi-Symposiums on (Vol. 2, pp. 735-738). IEEE.
[55]Blake, C., &; Merz, C. J. (1998). {UCI} Repository of machine learning databases.
[56]Çatak, F. Ö., &; Balaban, M. E. (2013). A MapReduce based distributed SVM algorithm for binary classification. Turkish Journal of Electrical Engineering &; Computer Science.
[57]Skin Segmentation Data Set, UC Irvine Machine Learning Repository,https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation(last accessed:2015/6/25)
[58]HIGGS Data Set, UC Irvine Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/HIGGS (last accessed:2015/6/25)


QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊