臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.152) 您好！臺灣時間：2025/11/02 07:39

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
外文摘要
目次
參考文獻
紙本論文
QR Code

本論文永久網址:

研究生:

駱家淮

研究生(外文):

Chia-Huai Lo

論文名稱:

基於運用映射歸納框架建立次世代定序資料之後綴陣列與最長共同前綴陣列

論文名稱(外文):

Constructing Suffix Array and Longest-Common-Prefix Array for Next-Generation-Sequencing Data Using MapReduce Framework

指導教授:

李德財

口試委員:

何建明、賴飛羆

口試日期:

2015-08-18

學位類別:

碩士

校院名稱:

國立臺灣大學

系所名稱:

資訊工程學研究所

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2015

畢業學年度:

103

語文別:

英文

論文頁數:

中文關鍵詞:

次世代定序資料分析、後綴陣列、最長共同前綴陣列、BWT 陣列、de novo 基因組建

外文關鍵詞:

NGS data analysis、suffix array、LCP array、BWT array、de novo genome assembly

相關次數:

被引用:0
點閱:466
評分:
下載:0
書目收藏:0

Next-generation sequencing (NGS) data is rapidly growing and represents a source of varieties of new knowledge in science. State-of-the-art sequencers, such as HiSeq 2500, can generate up to 1 trillion base-pairs of sequencing data in 6 days, with good quality at low cost. In genome sequencing projects today, the NGS data size often ranges from tens of billions base-pairs to several hundreds of billions base-pairs. It is time-consuming to process such a big set of NGS data, especially for applications based on sequence alignment, e.g., de novo genome assembly and correction of sequencing errors.
In literature, suffix array, longest common prefix (LCP) array and Burrows-Wheeler Transform (BWT) have been proved to be efficient indexes to speed up manifold sequence alignment tasks. For example, the all-pairs suffix-prefix matching problem, i.e., finding overlaps of reads to form the overlap graph for sequence assembly, can be solved in linear time by reading these arrays. However, constructing those arrays for NGS data remains challenging due to the huge amount of storage required to hold the suffix array. MapReduce is a promising alternative to tackle the NGS challenge, but the existing MapReduce method of suffix array construction, i.e., RPGI proposed by Menon et al [1] can only deal with input strings of size no greater than 4G base pairs and does not give LCPs in its output.
In the study, we developed a MapReduce algorithm to construct suffix and BWT arrays, as well as LCP array, for NGS data based on the framework of RPGI. In addition, the proposed method supports inputs with more than 4G base-pairs and is developed into new software. To evaluate its performance, we compare the time it takes to process subsets of the giant grouper NGS data set of size 125Gbp.

1 Introduction 1
1.1 Motivation 1
1.2 Thesis organization 2
2 Background 3
2.1 Suffix array, BWT and LCP array 3
2.2 De novo genome assembly of large NGS data based on suffix array 4
2.3 MapReduce framework 6
2.4 Existing methods of MapReduce suffix array construction 8
3 Developing a MapReduce algorithm to construct suffix array and LCP array for NGS data 12
3.1 Toward a scalable suffix array construction algorithm 12
3.2.2 Reducing memory usage with reducer number tuning 15
3.3 Embedded LCP array construction algorithm 16
4 Experiments 21
4.1 Datasets 21
4.2 Results 21
5 Discussion and conclusions 24
5.1 Discussion 24
5.1.1 Replacing memory mapping 25
5.2 Conclusions 26

Bibliography…………………………………………………………………………....30

[1] Menon, R. K., Bhat, G. P., & Schatz, M. C. (2011, June). Rapid parallel genome indexing with MapReduce. In Proceedings of the second international workshop on MapReduce and its applications (pp. 51-58). ACM.
[2] Wikipedia. DNA_sequencing: Next-generation methods.
https://en.wikipedia.org/wiki/DNA_sequencing#Next-generation_methods
[3] Mount, D. W., & Mount, D. W. (2001). Bioinformatics: sequence and genome analysis (Vol. 2). New York:: Cold spring harbor laboratory press.
[4] Abouelhoda, M. I., Kurtz, S., & Ohlebusch, E. (2004). Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1), 53-86.
[5] Manber, U., & Myers, G. (1993). Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5), 935-948.
[6] Burrows, M., & Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm.
[7] Illumina: HiSeq 2500 Scientific Data
http://www.illumina.com/systems/hiseq_2500_1500/scientific_data.html
[8] Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.
[9] Chang, Y. J., Chen, C. C., Chen, C. L., & Ho, J. M. (2012). A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC genomics, 13(Suppl 7), S28.

國圖紙本論文

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

無相關論文

1.	王鋒鈞，〈中國古代銅鏡隨葬的意義〉，《故宮文物月刊》，19：8，（台北，2001.11）。
2.	施翠峰，〈探討古銅鏡與宗教的關係〉，《藝術家》，31：2，（台北，1990.8），頁274-283。
3.	陳啟佑，〈銅鏡在古詩中的象徵意義〉，《中華文化復興月刊》，16：8，（台北，1983.8），頁22-28。
4.	蔡和璧，〈八方文物會長安〉，《故宮文物月刊》，12：2，（台北，2002.5）。
5.	鄧秋玲，〈漢代博局紋銅鏡析論（下），《故宮文物月刊》，12：4，（台北，2002.7）。
6.	鄧秋玲，〈漢代博局紋銅鏡析論（上）〉，《故宮文物月刊》，12：3，（台北，2002.6）。
7.	顏娟英，〈唐代銅鏡紋飾之內容與風格〉，《中央研究院歷史語言研究所集刊》，（台北，1989.10），頁289-366。

1.	在無線隨意網路上能耗拓展圖之功率最小化問題的改良貪婪演算法
2.	行動資料安全管理之設計原則與實作
3.	在微流體裝置上的流體可繞度研究
4.	以空間分析決定結核病主動篩檢高危險群之政策分析：以花東地區為例
5.	探討乳癌存活者的就醫經驗與後續追蹤間的相關性
6.	利用雙光子聚合之微透鏡陣列之設計與製造
7.	毫米波頻段環形開槽晶片天線及矽透鏡封裝
8.	氮肥種類與施用量對紫花紫錐菊生長及咖啡酸衍生物含量的影響
9.	鉛同位素特徵在土壤與水稻鑑識上的應用
10.	Taiwan''s Multicultural Tongzhi: Popular Representations, Activist Engagements, Markets and Space
11.	整合電化學阻抗分析晶片之微流道系統進行整合電化學阻抗分析晶片之微流道系統進行全血樣本之血漿分離及其血紅素量測
12.	利用自分類過程建構金屬超分子剛-柔嵌段共聚物之研究
13.	基於 MapReduce 巨量資料框架之次世代定序錯誤校正演算法
14.	以演算法的角度探討多核相依系統調速之能耗最佳化問題
15.	某醫學中心癌症病童生命末期照護現況探討-2010至2012年病歷回顧

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室