跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.102) 您好!臺灣時間:2025/12/03 23:43
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:邢弘宇
研究生(外文):Hong-Yu Hsing
論文名稱:Spark Streaming串流資料處理架構效能之分析與估算
論文名稱(外文):A Performance Analysis and Estimation of the Data Stream of Spark Streaming
指導教授:陳世穎陳世穎引用關係陳弘明陳弘明引用關係
指導教授(外文):Shih-Ying ChenHung-Ming Chen
學位類別:碩士
校院名稱:國立臺中科技大學
系所名稱:資訊工程系碩士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2016
畢業學年度:104
語文別:中文
論文頁數:51
中文關鍵詞:Real time processingSpark Streamingbatch DurationMicro-batching
外文關鍵詞:Real time processingSpark Streamingbatch DurationMicro-batching
相關次數:
  • 被引用被引用:0
  • 點閱點閱:224
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
由於Spark Streaming處理資料的方式是屬於粗粒度(一次處理微量批次資料),造成不可避免的延遲,在Spark Streaming框架中,資料必須整理到一定的量後再一次處理,增加了資料處理的延遲,而延遲是由架構設計產生。由此可見,如何調校Spark Streaming運作參數是值得深思探討的。簡單來說,我們必須對運行時間以及記憶體的使用作優化,不過如果每次優化都要運作程式,非常消耗時間。因此,本篇對於Spark Streaming框架內的DStreamGraph作分析與估算,對於增加平行度、減少序列化的負擔、合理的批次處理時間,評估出對於此次處理操作較適合的參數設定,而不用重複的運作調整。本研究提出了轉換參數的公式估算模型,有效的針對批次處理間隔時間作分析與估算,透過本研究模型,開發者能夠準確並快速的找到適合的批次處理時間,使得後續調教工作能夠省下繁瑣的重複啟動程式與測試,並可作為Spark Streaming批次間隔時間的參考依據達到秒即以內的延遲。

Since Spark Streaming handles data in a coarse-grained model (processing a micro-batch of data at a time), delays are inevitable. In the framework of Spark Streaming, data is processed after a certain amount has been collected, which aggravates the problem of delays in data processing. Such delays stem from the design of the framework. In view of that, it is worth contemplating how to calibrate the operational parameters of Spark Streaming. To put it simply, we must try to perform optimization on the processing time and the use of memory. However, it would be very time-consuming to run the program each time optimization is required. Consequently, the study focuses on the analysis and estimation of the DstreamGraph within the framework of Spark Streaming. With a view to increasing the level of parallelism, decreasing the workload of serialization and deserialization, and securing reasonable batch-processing time, the appropriate parameter configuration for an operation is figured out, so that repetitive calibrations are not necessary. The study presents a formula estimation model for transformation parameters, which is effective in analyzing and estimating the duration of a batch-processing cycle. With the model, developers can accurately and swiftly figure out the most appropriate batch-processing time, preventing the redundant restarting and testing of the program for subsequent calibration. It also serves as guidance for setting the batch interval in Spark Streaming to limit the delay within the one-second range.

目次
摘要 I
ABSTRACT II
目次 IV
表目次 VI
圖目次 VII
第一章 緒論 1
1.1研究背景 1
1.2研究動機 9
1.3論文架構 17
第二章 相關研究 18
2.1串流即時處理應用 18
2.2滑動視窗(Sliding Window) 18
2.3大數據處理架構 19
2.3.1 Spark分散式處理架構 20
2.3.1.1 RDD運算能力 20
2.3.1.2 Spark子系統 21
第三章 問題分析與機制設計 23
3.1 問題分析 23
3.2 機制設計 24
第四章 實驗結果 28
4.1 實驗環境 28
4.2 實驗設計 28
4.3 實驗結果與分析 34
4.3.1整體串流處理時間與處理速度的關係 35
4.3.2影響輸入速率的因素分析 36
4.3.2.1 Block Interval對輸入速率的影響 36
4.3.2.2 Receiver數量對輸入速率的影響 37
4.3.3接收資料速率對估算處理速度的影響 37
4.3.3.1接收資料速率與批次間隔時間的關係 37
4.3.3.2接收資料速率與批次資料量的關係 40
4.3.4不同DStream轉換方式估算對處理速度的影響 41
4.3.4.1一般轉換操作估算對處理速度的影響 42
4.3.4.2以視窗為基礎的操作對處理速度的影響 45
第五章 結論與未來研究方向 48
參考文獻 50



[1]LEIBIUSKY, Jonathan; EISBRUCH, Gabriel; SIMONASSI, Dario. Getting started with storm. " O''Reilly Media, Inc.", 2012.
[2]趙必厦、程麗明(2015)。Hadoop再進化:Storm流式資料即時處理引擎。臺北市:佳魁資訊。
[3]Cluster Mode Overview - Spark 1.5.2 Documentation - Cluster Manager Types. apache.org. Apache Foundation. 2015-11-09.
[4]Apache Spark wiki https://zh.wikipedia.org/wiki/Apache_Spark#cite_note-4
[5]Matei Zaharia. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications. Invited Talk at NIPS 2011 Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale.
[6]林大貴(2015)。Hadoop+Spark大數據巨量分析與機器學習整合開發實戰。新北市:博碩文化。
[7]夏俊鸞,劉旭暉,邵賽賽,程浩,史鳴飛,黃潔(2015)。大數據的下一步:用Spark玩轉活用。臺北市:佳魁資訊。
[8]Apache Storm http://storm.apache.org/
[9]Samza http://samza.apache.org/
[10]ZAHARIA, Matei, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012. p. 2-2.
[11]Apache Spark 1.5.2 Overview https://spark.apache.org/docs/1.5.2/
[12]ZHANG, Kai; HU, Jiayu; HUA, Bei. A holistic approach to build real-time stream processing system with GPU. Journal of Parallel and Distributed Computing, 2015, 83: 44-57.
[13]TINATI, Ramine, et al. A Streaming Real-Time Web Observatory Architecture for Monitoring the Health of Social Machines. In: Proceedings of the 24th International Conference on World Wide Web. ACM, 2015. p. 1149-1154.
[14]HEMALATHA, C. Sweetlin; VAIDEHI, Vijay; LAKSHMI, R. Minimal infrequent pattern based approach for mining outliers in data streams. Expert Systems with Applications, 2015, 42.4: 1998-2012.
[15]Chang, J., & Lee, W.A sliding window method for finding recently frequent itemsets over online data streams. Journal of Information Science and Engineering, 2004,20(4), 753–762.
[16]Chi, Y., Wang, H., Yu, P. S., & Muntz, R. R. Catch the moment: Maintaining closed frequent itemsets over a data stream sliding window. Knowledge and Information Systems, 2006, 10(3), 265–294.
[17]DING, Luping; RUNDENSTEINER, Elke A. Evaluating window joins over punctuated streams. In: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, 2004. p. 98-107.
[18]YA-XIN, Yu, et al. An indexed non-equijoin algorithm based on sliding windows over data streams. Wuhan University Journal of Natural Sciences, 2006, 11.1: 294-298.
[19]ZHANG, Liang, et al. A graph-based sliding window multi-join over data stream. 重慶郵電大學學報 (自然科學版), 2007, 19.3: 362-366.
[20]LEE, Chang-Hung; LIN, Cheng-Ru; CHEN, Ming-Syan. Sliding window filtering: an efficient method for incremental mining on a time-variant database. Information systems, 2005, 30.3: 227-244.
[21]陳秀秀. 具時間權重之串流資料的結合運算機制. 國立臺中科技大學資訊科技與應用研究所學位論文, 2009, 1-52.


QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top