(3.235.108.188) 您好!臺灣時間:2021/02/26 18:56
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:林玠言
研究生(外文):Chieh-Yen Lin
論文名稱:大規模羅吉斯回歸與線性支持向量機在Spark上之應用
論文名稱(外文):Large-scale Logistic Regression and Linear Support Vector Machines Using Spark
指導教授:林智仁林智仁引用關係
指導教授(外文):Chih-Jen Lin
口試委員:林軒田李育杰
口試委員(外文):Hsuan-Tien LinYuh-Jye Lee
口試日期:2014-07-11
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊網路與多媒體研究所
學門:電算機學門
學類:網路學類
論文種類:學術論文
論文出版年:2014
畢業學年度:102
語文別:英文
論文頁數:41
中文關鍵詞:大規模學習分散式運算羅吉斯回歸支持向量機牛頓法
外文關鍵詞:large scale learningdistributed computinglogistic regressionsupport vector machineNewton method
相關次數:
  • 被引用被引用:0
  • 點閱點閱:344
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
對於大規模分類問題之學習,羅吉斯回歸與線性支持向量機都是相當有用的方法。然而,此兩種模型的分散式實作,並沒有被徹底及完整地研究。另外,因為典型的映射化簡架構對於機器學習的迭代法之實作遭受到計算效率的瓶頸,所以叢集式記憶體內的運算平台─Spark在最近數年內逐漸嶄露頭角。由於Spark對於資料處理與分析的能力,此平台成為一個被廣泛使用的架構。在這篇論文裡,我們提出牛頓法之分散式演算法,並實作於Spark上。我們點出與分析會強烈影響計算效能與溝通時間的細節,並對這些問題提出解決辦法。最後,在經過謹慎的考量與研究後,我們將此論文中提出的演算法實作為一個有效率並且公開的工具以供使用。

Logistic regression and linear SVM are useful methods for large-scale classification. However, their distributed implementations have not been well studied. Recently, because of the inefficiency of the MapReduce framework on iterative algorithms, Spark, an in-memory cluster-computing platform, has been proposed. It has emerged as a popular framework for large-scale data processing and analytics. In this work, we consider a distributed Newton method for solving logistic regression as well linear SVM and implement it on Spark. We carefully examine many implementation issues significantly affecting running time and propose our solutions. After conducting thorough empirical investigations, we release an efficient and easy-to-use tool for the Spark community.

口試委員會審定書 i
中文摘要 ii
ABSTRACT iii
LIST OF FIGURES vi
LIST OF TABLES vii

I. Introduction 1
II. Apache Spark 3
2.1 Hadoop Distributed File System 3
2.2 Resilient Distributed Datasets 4
2.3 Lineage and Fault Tolerance of Spark 5
III. Logistic Regression, Support Vector Machines and Distributed
Newton Method 6
3.1 Logistic Regression and Linear SVM 6
3.2 A Trust Region Newton Method 7
3.3 Distributed Algorithm 8
IV. Implementation Design 11
4.1 Loop Structure 12
4.2 Data Encapsulation 14
4.3 Using mapPartitions Rather Than map 15
4.4 Caching Intermediate Information or not 17
4.5 Using Broadcast Variables 18
4.6 The Cost of the reduce Function 19
V. Related Works 21
5.1 LR Solver in MLlib 21
5.2 MPI LIBLINEAR 22
VI. Experiments 23
6.1 Different Loop Structures 24
6.2 Encapsulation 25
6.3 mapPartitions and map 25
6.4 Broadcast Variables and the coalesce Function 26
6.5 Analysis on Scalability 27
6.6 Comparing with MLlib 28
6.7 Comparison of Spark LIBLINEAR and MPI LIBLINEAR 29
VII. Discussions and Conclusions 33
APPENDICES 34
BIBLIOGRAPHY 40

[1] B. E. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers,” in COLT, 1992.
[2] C. Cortes and V. Vapnik, “Support-vector network,” MLJ, vol. 20, pp. 273–297, 1995.
[3] G.-X. Yuan, C.-H. Ho, and C.-J. Lin, “Recent advances of large-scale linear classification,” PIEEE, vol. 100, pp. 2584–2603, 2012.
[4] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” CACM, vol. 51, pp. 107–113, 2008.
[5] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” in Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010.
[6] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceedings
of the 9th USENIX conference on Networked Systems Design
and Implementation, 2012.
[7] M. Snir and S. Otto, MPI-The Complete Reference: The MPI Core. Cambridge, MA, USA: MIT Press, 1998.
[8] C.-J. Lin and J. J. Mor’e, “Newton’s method for large-scale bound constrained problems,” SIAM J. Optim., vol. 9, pp. 1100–1127, 1999.
[9] Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin, “Distributed Newton method for regularized logistic regression,” Dept. of Computer Science, Natl. Taiwan Univ., Tech. Rep., 2014.
[10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classification,” JMLR, vol. 9, pp. 1871–1874, 2008.
[11] T. White, Hadoop: The definitive guide, 2nd ed. O’Reilly Media, 2010.
[12] D. Borthakur, “HDFS architecture guide,” 2008.
[13] C.-J. Lin, R. C. Weng, and S. S. Keerthi, “Trust region Newton method for large-scale logistic regression,” JMLR, vol. 9, pp. 627–650, 2008.
[14] O. L. Mangasarian, “A finite Newton method for classification,” Optimization Methods and Software, vol. 17, no. 5, pp. 913–929, 2002.
[15] M. Odersky, L. Spoon, and B. Venners, Programming in Scala. Artima, 2008.
[16] M. Odersky, P. Altherr, V. Cremet, B. Emir, S. Micheloud, N. Mihaylov, M. Schinz, E. Stenman, and M. Zenger, “The Scala language specification,” 2004.
[17] A. Agarwal, O. Chapelle, M. Dudik, and J. Langford, “A reliable effective terascale linear learning system,” JMLR, vol. 15, pp. 1111– 1133, 2014.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊
 
系統版面圖檔 系統版面圖檔