(3.237.178.91) 您好!臺灣時間:2021/03/07 02:34
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:吳東原
研究生(外文):WU, DONG-YUAN
論文名稱:基於MapReduce之快速資料相似度比對法
論文名稱(外文):Set-similarity joins using MapReduce
指導教授:蔡耀弘蔡耀弘引用關係
指導教授(外文):Tsai, Yao-Hong
口試委員:蔡耀弘鄭瑞恒楊權輝
口試委員(外文):Tsai, Yao-HongCheng, Rei-HengYang, Chyuan-Huei Thomas
口試日期:2018-01-19
學位類別:碩士
校院名稱:玄奘大學
系所名稱:資訊管理學系碩士班
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2018
畢業學年度:106
語文別:中文
論文頁數:66
中文關鍵詞:前綴過濾法資料集相似度聯結反向索引
外文關鍵詞:MapReduceprefix filteringset-similarity joininverted index
相關次數:
  • 被引用被引用:0
  • 點閱點閱:65
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:12
  • 收藏至我的研究室書目清單書目收藏:0
現今的時代,無所不在的資料搜尋,大量資料的分析,這些都需要比對資料的技術,而我們的研究就是支援這項技術。我們採取一篇文獻有關在MapReduce架構的資料集相似度聯結的方法並稱之為RF比對法作為基礎,我們將針對RF比對法的缺點進行改良並發展一套有效率的演算法,稱為前綴累加法。本論文的解決方案是使用MapReduce架構來比對兩個資料集合的相似度並輸出資料相似度對照表。演算法的流程主要分為兩個,第一個MapReduce的流程,我們使用前綴過濾法來篩選大量資料,收集相同的資料配對作為累加共同元素的目的,第二個MapReduce的流程,我們根據資料配對,比對後半段的資料,整合資料的交集與聯集,計算相似度。實驗中我們證明前綴累加法比RF比對法快速。結論是前綴累加法的優點是當篩選完資料後就不用再次比對完整的資料,缺點是資料切割越多就會增加整合資料的成本。
In modern time, there are ubiquitous uses of data query and massive data analysis. These are need techniques of comparing data which our study supports for. We adopted an algorithm from the paper related to set-similarity join in MapReduce framework which is named RF comparing algorithm as our groundwork. We modified RF comparing algorithm for its defect and developed a new efficient algorithm which is named Prefix Accumulating algorithm. Our solution is to identify similarities between data set with MapReduce framework and output the table for similarities between data set. There are two phases for Algorithm. We used Prefix Filtering to pick out the data that is possible to match each other from a lot amount of data, then collected the same pair of candidate for accumulating common elements in first MapReduce process. We verified the last half of data based on pair of candidate, then integrated union and intersection of data for calculating similarities in second one. In experiment, we proved that the Prefix Accumulating algorithm is faster than the RF comparing algorithm. The conclusion is that the advantage of Prefix Accumulating algorithm didn’t need to compare complete data again after prefix filtering. The disadvantage of Prefix Accumulating algorithm is that the more data partition, the more cost of integrating data.
中文摘要I
英文摘要II
誌謝 IV
目錄 V
圖目錄 VII
表目錄IX
第一章緒論1
1.1前言1
1.2動機與目的2
1.3論文架構2
第二章文獻探討4
2.1聯結4
2.2資料集相似度聯結6
2.3衡量相似度的方法7
2.4前綴過濾法8
2.5反向索引11
2.6MapReduce架構13
2.7相關文獻16
2.7.1RF相似度比對17
第三章本文提出的方法20
3.1問題的定義20
3.2RF比對法的流程與缺點20
3.3前綴累加法22
3.4MapReduce實作流程23
3.4.1第一個MapReduce:搜尋資料配對24
3.4.2第二個MapReduce:計算相似度29
第四章實驗35
4.1計算比對次數的介紹35
4.1.1RF比對法的比對次數的計算37
4.1.2前綴累加法的比對次數計算41
4.2評估速度47
第五章 結論與未來工作53
參考文獻54
[1]Chulyun Kim, Kyuseok Shim.Supporting set-valued joins in NoSQL using
MapReduce. In Information Systems 49(2015)page 52~64
[2]R. Vernica, M.J. Carey, C. Li, Efficient parallel set-similarity joins
using mapreduce, in Proceedings of the 2010 ACM SIGMOD Inter-national
Conference on Management of Data, Indianapolis, Indiana, USA, 2010,pp.495–
506.
[3]S.Chaudhuri, V.Ganti, R.Kaushik, A primitive operator for similarity joins
in data cleaning Proceedings of the 22nd International Conference on Data
Engineering, Atlanta, Georgia, USA, 2006, pp. 5.
[4]C.Xiao,W.Wang,X.Lin,J.X.Yi,Efficient similarity joins for near duplicate
detection,in W WW,Beijing,China,2008,pp.131–140
[5]J. Wang, G. Li, J. Feng, Can we beat the prefix filtering?: an adaptive
framework for similarity join and search, in ACM SIGMOD International
Conference on Management of Data, Scottsdale, Arizona, USA, 2012, pp.85–96.
[6]A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins.
In VLDB, pages 918–929, 2006.
[7]Leonardo Andrade Ribeiro, Theo Härder.Generalizing Prefix Filtering to
Improve Set Similarity Joins, in information systems, Volume 36, Issue 1,
March 2011, Pages 62-78
[8]Roberto J. Bayardo, Yiming Ma, Ramakrishnan Srikant. Scaling Up All Pairs
Similarity Search. In Proceedings of the 16th international conference on
World Wide Web, 2007
[9]Jeffrey Dean et al. Mapreduce: Simplified data processing on large
clusters. In Proceedings of the 6th USENIX OSDI, pages 137–150, 2004
[10]Ahmed Metwally, Christos Faloutsos. V-SMART-Join:A Scalable MapReduce
Framework for All-Pair Similarity Joins of Multisets and Vectors
[11]Dong Deng, Guoliang Li, Shuang Hao, Jiannan Wang, Jianhua Feng.
MassJoin: A MapReduce-based Method for Scalable String Similarity Joins.
In Data Engineering (ICDE), 2014 IEEE 30th International Conference. 2014
[12]Charles L. A. Clarke, Gordon V. Cormack. Dynamic Inverted indexes for a
Distributed Full-Text Retrieval System. Dept. of Computer Science
University of Waterloo, Waterloo, Ontario, Canada, 1995
[13]T. C. Hoad, J. Zobel. Methods for identifying versioned and plagiarized
document. JASIST, 2003.
[14]E. Spertus, M. Sahami, O. Buyukkokten. Evaluating similarity measures: a
large-scale study in the orkut social network. In KDD, 2005.
[15]Donna Harmon, Edward Fox, R. Baeza-Yates, W. Lee. In William B. Frankes,
Ricardo Baeza-Yates editors. Information Retrieval Data Structures &
Algorithms, chapter3, pages 28-43.Prentice Hall.1992. ISBN 0-13-463837-9
[16]Ramez Elmasri, Shamkant B. Navathe, 陳玄玲編譯,Fundamentals of Database
Systems, 資料庫系統概論, 台灣培生教育出版, 2005, ISBN 986-154-269-8
[17]黃三益,資料庫的核心理論與實務, 前程文化出版, 2015.01, ISBN 978-986-5774-30-1
[18]劉軍,狄宇昌編譯,Hadoop大數據處理,碁峰資訊,2014.08,ISBN 978-986-347-203-2

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文
 
無相關期刊
 
無相關點閱論文
 
系統版面圖檔 系統版面圖檔