跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.152) 您好!臺灣時間:2025/11/02 12:59
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:曾偉誠
研究生(外文):Wei-Chen Zeng
論文名稱:雲端運算之XML巨量資料處理機制設計
論文名稱(外文):Efficient XML Data Processing based on MapReduce Framework
指導教授:陳世穎陳世穎引用關係陳弘明陳弘明引用關係
指導教授(外文):Shih-Ying ChenHung-Ming Chen
學位類別:碩士
校院名稱:國立臺中科技大學
系所名稱:資訊工程系碩士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2015
畢業學年度:103
語文別:中文
論文頁數:61
中文關鍵詞:雲端運算MapReduceXML tree paths巨量資料
外文關鍵詞:High in the clouds operationMapReduceXML tree pathsBig data
相關次數:
  • 被引用被引用:1
  • 點閱點閱:273
  • 評分評分:
  • 下載下載:12
  • 收藏至我的研究室書目清單書目收藏:1
由於硬體技術與網路技術的進步,各類應用的資料量快速增加,因此雲端運算技術成為處理巨量資料(Big data)最重要的研究課題。雲端運算提供一個新的服務架構,並將運算資源及儲存的空間更有效率利用,也提供開發環境與雲端服務。其中巨量資料處理目前大多以MapReduce運算環境來處理;但是在MapReduce運算環境下,資料必須以MapReduce架構規範的格式表示(即表示為(key,value)配對的組合),利用切割分派給每一台電腦,達到平行化的目的。另一方面,XML(eXtensible Markup Language)是W3C提出來的新一代標示語言,用以傳輸、處理各類複雜的文件,支援資訊查詢、電子資料交換等應用,是目前常見的資料交換與資料儲存標準格式。雖然單機上的XML文件的處理技術已成熟,但如果碰到索引值過長或XML文件過大時,單台主機無法負擔這麼龐大的運算量,所以有可能造成原本探索路徑失敗或是過慢,所以本研究提供了一個能對於XML巨大資料進行雲端運算平行化的處理。由於MapReduce在解析XML文件時,文件會由於被切割成數個部份再派送到各個運算節點,才得以進行平行化處理。本研究設計MapReduce運算機制處理XML巨量資料之解析僅需一輪MapReduce。因為原本預設的TextInputFormat並無法滿足,所以本論文首先要自行設計一個專屬XMLInputFormat類別於雲端平台上處理XML巨量資料,萃取其中每條XML路徑,並建立於HBase雲端資料庫上,以提供後續運算處理,例如資料探勘、XML資料查詢等運算。本文利用16台雲端伺服器構成Hadoop叢集來測試本演算法的效能。效能測試共分為兩個部分包括XML的特性與Hadoop的參數特性調整,利用這些特性來調整測試,實驗顯示本研究針對Hadoop底下MapReduce平行分散式處理框架,來處理巨量的XML文件是有效的,分別實驗最大XML檔案大小16GB提升67.4%與最多XML路徑筆數1360萬筆提升89.7%。

As a result of hardware technology and network technology progress, each kind of application material quantity increasing rapidly. So cloud computing technology becomes the processing great quantity material (Big data) the most important research topic.Cloud computing technology offers a new service construction, operates the resources, storage spatial use more effectively, and also provides the development environment and the cloud services. In which great quantity data processing mostly processes at present by the MapReduce operation environment; But under the MapReduce operation environment, the material must take the MapReduce construction standard form expression (i.e.Expressionas (key,value) pair combinations), Use cutting assigned to each computer, to parallel the purpose of. On the other hand, XML (extensible Markup Language) is the new generation of indication language which W3C raises,with transmits, processes each kind of complex document, supports the information inquiry, applications and so on electronic material exchange. It is the present common material exchange and the material storage standard form.Although on computer XML document processing technology already mature,But if you encounter the index XML document is too long or too large, single host cannot afford such a huge amount of computation, it is possible to explore the path originally caused the failure or too slow, so this study provides a tremendous energy for XML data for cloud computing, parallel processing technology.MapReduce with analyze the XML document, the document can because is cut the round number part faction to deliver again each operation node, only then can carry on parallel processing. But this cutting way will create nest of shape structure the XML label to destroy, will recall its nest shape relations with difficulty; Therefore will have in its processing difficulty.This research designs the MapReduce operation mechanism, Stage is divided into one rounds of MapReduce, we need to design their own XMLInputFormat an exclusive category, and named XMLInputFormat.class,XML and original take on HDFS in the correlation between processes the XML big data in the high on the clouds platform, the extract each XML path, and establishes on the HBase high in the clouds information bank, provides following operation processing, for example Data Mining, etc. We use 16 cloud servers build up Hadoop cluster to test the effectiveness of the algorithm, this study tested the effectiveness of two parts divided into XML features and Hadoop parameter adjustment use of these parameters to adjust the test, experiments show that, under the present study for Hadoop MapReduce distributed parallel processing framework to deal with a large amount of XML document is valid.The maximum size of the XML file(16GB) separately experiment 67.4% and a maximum of the XML path(13,600,000) separately experiment 89.7%.

摘要(i)
英文摘要(ii)
誌謝(iv)
目次(v)
表目次(vii)
圖目次(viii)
第一章 緒論(1)
1.1研究背景(1)
1.2研究動機(8)
第二章 相關研究(12)
2.1編碼XML文件於MapReduce的處理(12)
2.2原始XML文件於MapReduce的處理(13)
第三章 問題分析與機制設計(18)
3.1 問題分析(18)
3.2 機制設計(20)
第四章 實驗結果(35)
4.1 實驗環境(36)
4.2 實驗設計(37)
4.3實驗結果對效能的影響(39)
4.3.1 XML文件特性對效能的影響(39)
4.3.1.1 XML路徑多寡(39)
4.3.1.2 XML文件大小(41)
4.3.1.3XML文件標籤長短(45)
4.3.2 Hadoop參數特性調整對效能的影響(46)
4.3.2.1 Reduce數量調整之XML路徑的影響(46)
4.3.2.2 Reduce數量調整之文件大小的影響(50)
4.3.2.3 Block Size調整對文件大小的效能影響(54)
4.3.3壓力測試(55)
4.3.3.1 XML文件對於演算法的壓力測試(55)
第五章 結論及未來工作方向(57)
參考文獻(59)


[1]Zhiyuan Chen, Johannes Gehrke, Flip Korn, Nick Koudas, JayavelShanmugasundaram,DiveshSrivastava, Index structures for matching xml twigs usingrelational query processors, Data Knowl. Eng,60(2), pp.283–302,(2007).
[2]Hyebong Choi ,Kyong-Ha Lee ,Yoon-Joon Lee , Parallel labeling of massive XML data with MapReduce, pp.408-437,(2013).
[3]Y. Chen, S.B. Davidson, Y. Zheng, A bi-labeling based XPath processing system, Information Systems,35(2), pp.170–185,(2010).
[4]Cong, G., Fan, W., Kementsietsidis, A., Li, J., Liu, X.: Partial evaluation for distributed XPath query processing and beyond. ACM Trans. Database Syst. 37(4),32,(2012).
[5]Jeffrey Dean ,Sanjay Ghemawat, “MapReduce: simplified data processing on large clusters, “Communications of the ACM, vol. 51 Issue 1,(2008).
[6]Iman Elghandour, Ashraf Aboulnaga, Daniel C. Zilio, Fei Chiang, Andrey-Balmin, Kevin Beyer, Calisto Zuzarte, An XML index advisor for DB2,Proceedings of ACM SIGMOD international conference on Management of data, pp.1267–1270,(2008).
[7]K. Emoto,H. Imachi, “Parallel Tree Reduction on MapReduce,” in Procedia Computer Science, Volume 9, pp.1827-1836,(2012).
[8]Wen-Chiao Hsu, I-En Liao, CIS-X: A compacted indexing scheme for efficient query evaluation of XML documents, Information Sciences vol.241, pp.195–211,(2013).
[9]Kuen-Fang Jea, Wei-Han Li,A Partition-based Signature Scheme for Efficiently Processing XML Multi-queries in Cloud Computing,(2012).
[10]Xudong Lin, Ning Wang , De Xua, XiaoningZeng, A novel XML keyword query approach using entity subtree,The Journal of Systems and Softwarevol.83, pp.990–1003, (2010) .
[11]Xiping Liu , Lei Chen , Changxuan Wan, Dexi Liu , NaixueXiong, Exploiting structures in keyword queries for effective XML,Information Sciencesvol.240, pp.56-71,(2013).
[12]I-En Liao, Hsiao-Chen Shih, A Scalable XML Indexing Method Using MapReduce,(2014).
[13]Jiaheng Lu , XiaofengMeng , Tok Wang Ling cIndexing and querying XML using extended Dewey labeling schemeData&; Knowledge Engineering 70, pp. 35–59,(2011).
[14]C.Roddick, V.Braganholo, M. Mattoso, Virtual Partitioning ad-hoc Queries over Distributed XML Databases. Journal of Information and Data Management, vol. 2, pp.495-510,(2011).
[15]Huayu Wu, Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce, pp.183-190,(2014).
[16]T. White, &;quot;Hadoop: The Definitive Guide,&;quot; O&;apos;&;apos;Reilly Media, (2010).
[17]D. Zinn, S. Bowers, S. Kohler, and B. Ludascher, “Parallelizing XML datastreaming workflows via MapReduce,” in Journal of Computer and System Sciences, vol. 76, pp.447-463,(2010).
[18]陸嘉恒,Hadoop實戰技術手冊,中文,2012年.
[19]劉軍,Hadoop大數據處理,中文,2014年.
[20]吳素玉, 廖宜恩,“植基於路徑群聚的有效率XML索引方法”,中興大學資訊科學與工程研究所,碩士論文, 2008.
[21]陳雨霖, 廖宜恩, “NCIM:植基於節點群聚的有效率XML索引方法” ,中興大學資訊科學與工程研究所,碩士論文, 2009.
[22]XML,https://zh.wikipedia.org/zh-tw/XML,(last accessed:2015/01/16).
[23]雲端運算,http://www.nhu.edu.tw/~society/e-j/86/13.htm,(last accessed:2013/07/31).
[24]Apache Hadoop ,http://msdn.microsoft.com/zh-tw/windowsazure/ff721941.aspx.
[25]國立臺灣大學電子報淺談雲端運算http://www.cc.ntu.edu.tw/chinese/epaper/0008/20090320_8008.htm,(last accessed:2009/03/20).
[26]網格運算,http://sls.weco.net/blog/bryan0314/14-jan-2009/12497,(last accessed:2009/01/14).
[27]MapReduce,http://nickhsu.tumblr.com/post/3006590051/hadoop,(last accessed:2011/01/30).
[28]Apache Hadoop, http://hadoop.apache.org/,(last accessed:2015/05/05).
[29]iThome 徹底解讀IT明日之星雲端運算,http://www.ithome.com.tw/itadm/article.php?c=49410&;s=7,(last accessed:2008/06/19).
[30]ApacheHbase,http://my.oschina.net/beiyou/blog/76259,(last accessed:2012/09/02).
[31]國立臺灣大學電子報雲端運算平台http://www.cc.ntu.edu.tw/chinese/epaper/0011/20091220_1106.htm,(last accessed:2009/12/20).
[32]XML_DTD,http://cloudnfu.appspot.com/Lab/xml/faq/dtd/xmlqa01.htm.
[33]XML_Schema,http://zh.wikipedia.org/wiki/XML_Schema,(last accessed:2014/12/02).
[34]巨量資訊關鍵4V,http://www.digitimes.com.tw/tw/dt/n/shwnws.asp?CnlID=13&;Cat=2&;Cat1=&;id=0000345303_DYO2JHW21Q15XM25KORTH#ixzz3cjn4iA5N,(last accessed:2013/08/08).
[35]ASTM,http://www.astm.org/cgi-bin/SoftCart.exe/COMMIT/COMMITTEE/E31.htm?L+mystore+fdxl9438,(last accessed:2006/12/08).
[36]XMark,http://www.xml-benchmark.org/downloads.html,(last accessed:2009/02/01).
[37]Generatedata,http://generatedata.com/.
[38]Continuity of Care Record,https://zh.wikipedia.org/wiki/%E9%9B%BB%E5%AD%90%E5%81%A5%E5%BA%B7%E7%B4%80%E9%8C%84,(last accessed:2015/07/29).

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊