臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.152) 您好！臺灣時間：2025/11/02 07:39

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
紙本論文
QR Code

本論文永久網址:

研究生:

鐘緯駿

研究生(外文):

Wei-Chun Chung

論文名稱:

基於 MapReduce 巨量資料框架之次世代定序錯誤校正演算法

論文名稱(外文):

Algorithms for Correcting Next-Generation Sequencing Errors Based on MapReduce Big Data Framework

指導教授:

李德財

指導教授(外文):

Der-Tsai Lee

口試日期:

2017-06-12

學位類別:

博士

校院名稱:

國立臺灣大學

系所名稱:

資訊工程學研究所

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2017

畢業學年度:

105

語文別:

英文

論文頁數:

中文關鍵詞:

de novo 基因組裝、MapReduce、巨量資料、次世代定序、錯誤校正

外文關鍵詞:

big data、de novo genome assembly、error correction、next-generation sequencing、MapReduce

相關次數:

被引用:0
點閱:179
評分:
下載:0
書目收藏:0

次世代定序(next-generation sequencing; NGS)技術的快速發展造就超大規模資料的爆炸性增長與 de novo 基因組裝(de novo genome assembly)的各式計算問題。較深的定序深度與越來越長的基因序列(read)隱含更多的序列錯誤，並增加錯誤組裝的可能性。巨大的資料量除了造成極高的硬碟讀寫負載，更會拖慢運算速度並使執行時間無法被準確預估。為了在不影響組裝品質下加快耗時的組裝過程並解決錯誤校正(error correction)的相關問題，我們著眼於使用雲端運算(cloud computing)改進可用於次世代定序資料 de novo 基因組裝的演算法設計、架構設計以及實做。

定序資料內含的錯誤使得組裝品質降低，並產生破碎的片段重疊群(contigs)。為此我們提出一套基於雲端運算的錯誤校正演算法，並參考 ALLPATHS-LG 的錯誤校正設計，使其能保守地進行錯誤校正以避免誤判。為了達成以減少磁碟讀寫的大量負載來加快執行時間的目的，我們提出名為「序列-訊息對應圖」(read-message diagram)的訊息控制策略，用以呈現計算過程中需由基因序列生成的中介資料結構。我們同時開發多種調控模式以縮減中介資料量，進而減少磁碟讀寫的操作數量。

我們已將提出的錯誤校正演算法實作於 MapReduce 雲端計算框架上，並以最先進的工具進行效能評估。我們提出的訊息控制策略也成功減少中介資料量並加快執行速度。至此，我們不僅顯著地減少組裝流程所需的時間，更提高了組裝的品質。

本論文提出並實作用以加快 de novo 基因組裝以及提高組裝品質的演算法與其架構設計。這些研究成果對於轉錄體學(transcriptomics)、總體基因體學(metagenomics)、藥物基因體學(pharmacogenomics)以及精準醫學(precision medicine)等可利用次世代定序巨量資料(NGS big data)進一步發展相關應用的生物資訊領域極具參考價值。

The rapid advancement of next-generation sequencing (NGS) technology has generated an explosive growth of ultra-large-scale data and computational problems, particularly in de novo genome assembly. Greater sequencing depths and increasingly longer reads have introduced numerous errors, which increase the probability of misassembly. The huge amounts of data cause severely high disk I/O overhead and lead to an unexpectedly long execution time. To speed up the time-consuming assembly processes without affecting its quality and to address problems pertaining to error correction, we focus on improving algorithm design, architecture design, and implementation of NGS de novo genome assembly based on cloud computing.

Errors in sequencing data result in fragmented contigs, which lead to an assembly of poor quality. We therefore propose an error correction algorithm based on cloud computing. The algorithm emulates the design of error correction algorithm of ALLPATHS-LG, and is designed to correct errors conservatively to avoid false decisions. To speed up execution time by reducing the massive disk I/O overhead, we introduce a message control strategy, the read-message (RM) diagram, to represent structure of the intermediate data generated along with each read. Then, we develop various schemes to trim off portions of the RM diagram to shrink the size of the intermediate data and thereby reduce the number of disk I/O operations.

We have implemented the proposed algorithms on the MapReduce cloud computing framework and evaluated them using state-of-the-art tools. The RM method reduces the intermediate data size and speeds up execution. Our proposed algorithms have improved not only the execution time of the pipeline dramatically, but also the quality of assembly.

This dissertation presents algorithms and architectural designs that speed up execution time and improve the quality of de novo genome assembly. These studies are valuable for further development of NGS big data applications for bioinformatics, including transcriptomics, metagenomics, pharmacogenomics, and precision medicine.

1 Introduction 1
1.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Dissertation organization . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related works 5
2.1 ReadStack algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 MapReduce framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Error correction in NGS big data 11
3.1 The CloudEC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Design of CloudEC . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Pinch Corrector algorithm . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Spread Corrector algorithm . . . . . . . . . . . . . . . . . . . . . 16
3.2 The read-message (RM) diagram . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Flooded messages . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Design of RM diagram . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 RM schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 The reference-based benchmark framework . . . . . . . . . . . . . . . . 27
3.3.1 Benchmark generator . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Two-phase evaluation approach . . . . . . . . . . . . . . . . . . 28
4 Experiments 31
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 CloudEC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 CloudEC with RM schemes . . . . . . . . . . . . . . . . . . . . 35
4.3.3 Scalability testing . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Conclusions and future work 45
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bibliography 47

[1] C. Alkan, S. Sajjadian, and E. E. Eichler. Limitations of next-generation genome sequence assembly. Nature methods, 8(1):61–65, 2011.
[2] E. Anderson and J. Tucek. Efficiency matters! ACM SIGOPS Operating Systems Review, 44(1):40–45, 2010.
[3] W. J. Ansorge. Next-generation DNA sequencing techniques. New biotechnology, 25(4):195–203, 2009.
[4] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, et al. A view of cloud computing. Communications of the ACM, 53(4):50–58, 2010.
[5] G. Berg, C. Zachow, H. Müller, J. Philipps, and R. Tilcher. Next-generation bio-products sowing the seeds of success for sustainable agriculture. Agronomy, 3(4): 648–656, 2013.
[6] Y.-J. Chang, C.-C. Chen, C.-L. Chen, and J.-M. Ho. A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC genomics, 13(7):S28, 2012.
[7] K. Chaturvedi and L. Sahijram. Plant molecular biology applications in horticulture: An overview. In Plant Biology and Biotechnology, pages 113–129. Springer, 2015.
[8] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: new analysis practices for big data. Proceedings of the VLDB Endowment, 2(2):1481–1492, 2009.
[9] F. F. Costa. Big data in biomedicine. Drug discovery today, 19(4):433–440, 2014.
[10] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
[11] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative MapReduce. In Proceedings of the 19th ACM international symposium on high performance distributed computing, pages 810–818. ACM, 2010.
[12] T. C. Glenn. Field guide to next-generation DNA sequencers. Molecular ecology resources, 11(5):759–769, 2011.
[13] S. Gnerre, I. MacCallum, D. Przybylski, F. J. Ribeiro, J. N. Burton, B. J. Walker, T. Sharpe, G. Hall, T. P. Shea, S. Sykes, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences, 108(4):1513–1518, 2011.
[14] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for big data analytics. In Cidr, volume 11, pages 261–272, 2011.
[15] W. Huang, L. Li, J. R. Myers, and G. T. Marth. ART: a next-generation sequencing read simulator. Bioinformatics, 28(4):593–594, 2012.
[16] L. Ilie and M. Molnar. RACER: rapid and accurate correction of errors in reads. Bioinformatics, 29(19):2490–2493, 2013.
[17] L. Ilie, F. Fazayeli, and S. Ilie. HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics, 27(3):295–302, 2011.
[18] E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for MapReduce programs. Proceedings of the VLDB Endowment, 4(6):385–396, 2011.
[19] L. Jourdren, M. Bernard, M.-A. Dillies, and S. Le Crom. Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses. Bioinformatics, 28(11):1542–1543, 2012.
[20] W.-C. Kao, A. H. Chan, and Y. S. Song. ECHO: a reference-free short-read error correction algorithm. Genome research, 21(7):1181–1192, 2011.
[21] D. R. Kelley, M. C. Schatz, and S. L. Salzberg. Quake: quality-aware detection and correction of sequencing errors. Genome biology, 11(11):R116, 2010.
[22] F. Krueger. Trim Galore!: A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files, 2015.
[23] B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg. Searching for SNPs with cloud computing. Genome biology, 10(11):R134, 2009.
[24] B. Langmead, K. D. Hansen, and J. T. Leek. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome biology, 11(8):R83, 2010.
[25] S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in MapReduce. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, pages 85–94. ACM, 2011.
[26] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon. Parallel data processing with MapReduce: a survey. AcM sIGMoD Record, 40(4):11–20, 2012.
[27] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.
[28] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin. The sequence alignment/map format and SAMtools. Bioinformatics, 25(16):2078–2079, 2009.
[29] L. Liu, Y. Li, S. Li, N. Hu, Y. He, R. Pong, D. Lin, L. Lu, and M. Law. Comparison of next-generation sequencing systems. BioMed Research International, 2012, 2012.
[30] Y. Liu, J. Schröder, and B. Schmidt. Musket: a multistage k-mer spectrum-based error corrector for illumina sequence data. Bioinformatics, 29(3):308–315, 2013.
[31] R. Luo, B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, G. He, Y. Chen, Q. Pan, Y. Liu, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience, 1(1):18, 2012.
[32] P. Medvedev, E. Scott, B. Kakaradov, and P. Pevzner. Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics, 27(13): i137–i141, 2011.
[33] P. Mell, T. Grance, et al. The NIST definition of cloud computing. 2011.
[34] M. L. Metzker. Sequencing technologies–the next generation. Nature reviews. Genetics, 11(1):31, 2010.
[35] N. Nagarajan and M. Pop. Sequence assembly demystified. Nature reviews. Genetics, 14(3):157, 2013.
[36] H. Nordberg, K. Bhatia, K. Wang, and Z. Wang. BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics, 29(23):3014–3019, 2013.
[37] A. O’Driscoll, J. Daugelaite, and R. D. Sleator. ‘big data’, Hadoop and cloud computing in genomics. Journal of biomedical informatics, 46(5):774–781, 2013.
[38] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 165–178. ACM, 2009.
[39] M. Pop. Genome assembly reborn: recent computational challenges. Briefings in bioinformatics, 10(4):354–366, 2009.
[40] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages 13–24. IEEE, 2007.
[41] K. Robasky, N. E. Lewis, and G. M. Church. The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics, 15(1):56–62, 2014.
[42] D. H. Roukos and C.-S. Ku. Clinical cancer genome and precision medicine. Annals of surgical oncology, 19(12):3646–3650, 2012.
[43] L. Salmela and J. Schröder. Correcting errors in short reads by multiple alignments. Bioinformatics, 27(11):1455–1461, 2011.
[44] S. L. Salzberg, A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C. Schatz, A. L. Delcher, M. Roberts, et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome research, 22(3):557–567, 2012.
[45] M. C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11):1363–1369, 2009.
[46] M. C. Schatz and B. Langmead. The DNA data deluge. IEEE Spectrum, 50(7): 28–33, 2013.
[47] M. C. Schatz, B. Langmead, and S. L. Salzberg. Cloud computing and the DNA data race. Nature biotechnology, 28(7):691–693, 2010.
[48] J. Schröder, H. Schröder, S. J. Puglisi, R. Sinha, and B. Schmidt. SHREC: a short-read error correction method. Bioinformatics, 25(17):2157–2163, 2009.
[49] A. Schumacher, L. Pireddu, M. Niemenmaa, A. Kallio, E. Korpelainen, G. Zanetti, and K. Heljanko. SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics, 30(1):119–120, 2014.
[50] D. Sims, I. Sudbery, N. E. Ilott, A. Heger, and C. P. Ponting. Sequencing depth and coverage: key considerations in genomic analyses. Nature Reviews Genetics, 15(2): 121–132, 2014.
[51] L. D. Stein. The case for cloud computing in genome informatics. Genome biology, 11(5):207, 2010.
[52] R. C. Taylor. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC bioinformatics, 11(12):S1, 2010.
[53] T. J. Treangen and S. L. Salzberg. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature reviews. Genetics, 13(1):36, 2011.
[54] J. D. Ullman. Designing good MapReduce algorithms. XRDS: Crossroads, The ACM Magazine for Students, 19(1):30–34, 2012.
[55] G. Wang, A. R. Butt, P. Pandey, and K. Gupta. A simulation approach to evaluating design decisions in MapReduce setups. In Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS’09. IEEE International Symposium on, pages 1–11. IEEE, 2009.
[56] X. Yang, K. S. Dorman, and S. Aluru. Reptile: representative tiling for short read error correction. Bioinformatics, 26(20):2526–2533, 2010.
[57] X. Yang, S. P. Chockalingam, and S. Aluru. A survey of error-correction methods for next-generation sequencing. Briefings in bioinformatics, 14(1):56–66, 2013.
[58] D. R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research, 18(5):821–829, 2008.
[59] Q. Zhang, L. Cheng, and R. Boutaba. Cloud computing: state-of-the-art and research challenges. Journal of internet services and applications, 1(1):7–18, 2010.
[60] Q. Zou, X.-B. Li, W.-R. Jiang, Z.-Y. Lin, G.-L. Li, and K. Chen. Survey of MapReduce frame operation in bioinformatics. Briefings in bioinformatics, 15(4):637–647, 2013.

國圖紙本論文

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	基於Hadoop 框架建立巨量資料分析處理模型研究
2.	雲端醫療紀錄之巨量資料存取與處理平台建置
3.	海量資料時代發展與未來應用趨勢之研究
4.	使用Hadoop技術建立巨量資料分析處理模型：以空氣汙染資料為例
5.	改善MapReduce效能之區域性感知排程
6.	一個運用MapReduce框架之頻繁項目集探勘演算法
7.	Apache Spark 運用於虛擬化技術之效益研究
8.	霧邊界運算平台運用於雲端大數據分析
9.	基於天氣之車速預測系統─使用大數據分析
10.	巨量資料處理系統之分析與比較
11.	結合MapReduce與模糊聚類作為大數據的資料分析
12.	運用雲端運算於巨量資料之頻繁項目集探勘
13.	基於Hadoop之網際網路訊務萃取器設計與實作
14.	基於MapReduce計算架構的巨量資料高頻項目探勘演算法
15.	巨量醫療資料之時序事件追蹤與分析

無相關期刊

1.	基於阿貝原則分析旋轉平台角度定位誤差之方法與補償
2.	大動態範圍與高精準度光波前量測方法之探討
3.	鈣網蛋白突變在骨髓增生性腫瘤的角色
4.	有容量的點覆蓋問題在放寬條件下之可近似性
5.	構建水資源、能源與糧食交織系統風險評估方法-以金門地區為例
6.	小型四旋翼無人機應用於水稻作業之可行性評估
7.	商譽減損會計準則之探討
8.	新創事業的策略發展討論-以A公司為例
9.	網路與實體購物的服務品質對顧客滿意與忠誠的影響：以服飾業為例。
10.	以交易紀錄與商品屬性的BPR個人化推薦應用-以一電商為例
11.	益達胺對於蜜蜂複眼視覺光感度之影響
12.	液滴在具結構表面的濕潤行為與親疏水表面接觸角定量定義探討
13.	人口老化下的居家醫療照護整合網絡—以嘉義地區為例
14.	由創投業者與新創團隊視角剖析創業失敗與成功要素
15.	袋裝預拌式乾粉水泥砂漿使用現況成本效益與推廣策略之研究

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室