跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.152) 您好!臺灣時間:2025/11/02 07:39
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:駱家淮
研究生(外文):Chia-Huai Lo
論文名稱:基於運用映射歸納框架建立次世代定序資料之後綴陣列與最長共同前綴陣列
論文名稱(外文):Constructing Suffix Array and Longest-Common-Prefix Array for Next-Generation-Sequencing Data Using MapReduce Framework
指導教授:李德財李德財引用關係
口試委員:何建明賴飛羆
口試日期:2015-08-18
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2015
畢業學年度:103
語文別:英文
論文頁數:28
中文關鍵詞:次世代定序資料分析後綴陣列最長共同前綴陣列BWT 陣列de novo 基因組建
外文關鍵詞:NGS data analysissuffix arrayLCP arrayBWT arrayde novo genome assembly
相關次數:
  • 被引用被引用:0
  • 點閱點閱:466
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
Next-generation sequencing (NGS) data is rapidly growing and represents a source of varieties of new knowledge in science. State-of-the-art sequencers, such as HiSeq 2500, can generate up to 1 trillion base-pairs of sequencing data in 6 days, with good quality at low cost. In genome sequencing projects today, the NGS data size often ranges from tens of billions base-pairs to several hundreds of billions base-pairs. It is time-consuming to process such a big set of NGS data, especially for applications based on sequence alignment, e.g., de novo genome assembly and correction of sequencing errors.
In literature, suffix array, longest common prefix (LCP) array and Burrows-Wheeler Transform (BWT) have been proved to be efficient indexes to speed up manifold sequence alignment tasks. For example, the all-pairs suffix-prefix matching problem, i.e., finding overlaps of reads to form the overlap graph for sequence assembly, can be solved in linear time by reading these arrays. However, constructing those arrays for NGS data remains challenging due to the huge amount of storage required to hold the suffix array. MapReduce is a promising alternative to tackle the NGS challenge, but the existing MapReduce method of suffix array construction, i.e., RPGI proposed by Menon et al [1] can only deal with input strings of size no greater than 4G base pairs and does not give LCPs in its output.
In the study, we developed a MapReduce algorithm to construct suffix and BWT arrays, as well as LCP array, for NGS data based on the framework of RPGI. In addition, the proposed method supports inputs with more than 4G base-pairs and is developed into new software. To evaluate its performance, we compare the time it takes to process subsets of the giant grouper NGS data set of size 125Gbp.


1 Introduction 1
1.1 Motivation 1
1.2 Thesis organization 2
2 Background 3
2.1 Suffix array, BWT and LCP array 3
2.2 De novo genome assembly of large NGS data based on suffix array 4
2.3 MapReduce framework 6
2.4 Existing methods of MapReduce suffix array construction 8
3 Developing a MapReduce algorithm to construct suffix array and LCP array for NGS data 12
3.1 Toward a scalable suffix array construction algorithm 12
3.2.2 Reducing memory usage with reducer number tuning 15
3.3 Embedded LCP array construction algorithm 16
4 Experiments 21
4.1 Datasets 21
4.2 Results 21
5 Discussion and conclusions 24
5.1 Discussion 24
5.1.1 Replacing memory mapping 25
5.2 Conclusions 26

Bibliography…………………………………………………………………………....30


[1] Menon, R. K., Bhat, G. P., & Schatz, M. C. (2011, June). Rapid parallel genome indexing with MapReduce. In Proceedings of the second international workshop on MapReduce and its applications (pp. 51-58). ACM.
[2] Wikipedia. DNA_sequencing: Next-generation methods.
https://en.wikipedia.org/wiki/DNA_sequencing#Next-generation_methods
[3] Mount, D. W., & Mount, D. W. (2001). Bioinformatics: sequence and genome analysis (Vol. 2). New York:: Cold spring harbor laboratory press.
[4] Abouelhoda, M. I., Kurtz, S., & Ohlebusch, E. (2004). Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1), 53-86.
[5] Manber, U., & Myers, G. (1993). Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5), 935-948.
[6] Burrows, M., & Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm.
[7] Illumina: HiSeq 2500 Scientific Data
http://www.illumina.com/systems/hiseq_2500_1500/scientific_data.html
[8] Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.
[9] Chang, Y. J., Chen, C. C., Chen, C. L., & Ho, J. M. (2012). A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC genomics, 13(Suppl 7), S28.


QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文