跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.152) 您好!臺灣時間:2025/11/02 07:39
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:鐘緯駿
研究生(外文):Wei-Chun Chung
論文名稱:基於 MapReduce 巨量資料框架之次世代定序錯誤校正演算法
論文名稱(外文):Algorithms for Correcting Next-Generation Sequencing Errors Based on MapReduce Big Data Framework
指導教授:李德財李德財引用關係
指導教授(外文):Der-Tsai Lee
口試日期:2017-06-12
學位類別:博士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2017
畢業學年度:105
語文別:英文
論文頁數:52
中文關鍵詞:de novo 基因組裝MapReduce巨量資料次世代定序錯誤校正
外文關鍵詞:big datade novo genome assemblyerror correctionnext-generation sequencingMapReduce
相關次數:
  • 被引用被引用:0
  • 點閱點閱:179
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
次世代定序(next-generation sequencing; NGS)技術的快速發展造就超大規模資料的爆炸性增長與 de novo 基因組裝(de novo genome assembly)的各式計算問題。較深的定序深度與越來越長的基因序列(read)隱含更多的序列錯誤,並增加錯誤組裝的可能性。巨大的資料量除了造成極高的硬碟讀寫負載,更會拖慢運算速度並使執行時間無法被準確預估。為了在不影響組裝品質下加快耗時的組裝過程並解決錯誤校正(error correction)的相關問題,我們著眼於使用雲端運算(cloud computing)改進可用於次世代定序資料 de novo 基因組裝的演算法設計、架構設計以及實做。

定序資料內含的錯誤使得組裝品質降低,並產生破碎的片段重疊群(contigs)。為此我們提出一套基於雲端運算的錯誤校正演算法,並參考 ALLPATHS-LG 的錯誤校正設計,使其能保守地進行錯誤校正以避免誤判。為了達成以減少磁碟讀寫的大量負載來加快執行時間的目的,我們提出名為「序列-訊息對應圖」(read-message diagram)的訊息控制策略,用以呈現計算過程中需由基因序列生成的中介資料結構。我們同時開發多種調控模式以縮減中介資料量,進而減少磁碟讀寫的操作數量。

我們已將提出的錯誤校正演算法實作於 MapReduce 雲端計算框架上,並以最先進的工具進行效能評估。我們提出的訊息控制策略也成功減少中介資料量並加快執行速度。至此,我們不僅顯著地減少組裝流程所需的時間,更提高了組裝的品質。

本論文提出並實作用以加快 de novo 基因組裝以及提高組裝品質的演算法與其架構設計。這些研究成果對於轉錄體學(transcriptomics)、總體基因體學(metagenomics)、藥物基因體學(pharmacogenomics)以及精準醫學(precision medicine)等可利用次世代定序巨量資料(NGS big data)進一步發展相關應用的生物資訊領域極具參考價值。
The rapid advancement of next-generation sequencing (NGS) technology has generated an explosive growth of ultra-large-scale data and computational problems, particularly in de novo genome assembly. Greater sequencing depths and increasingly longer reads have introduced numerous errors, which increase the probability of misassembly. The huge amounts of data cause severely high disk I/O overhead and lead to an unexpectedly long execution time. To speed up the time-consuming assembly processes without affecting its quality and to address problems pertaining to error correction, we focus on improving algorithm design, architecture design, and implementation of NGS de novo genome assembly based on cloud computing.

Errors in sequencing data result in fragmented contigs, which lead to an assembly of poor quality. We therefore propose an error correction algorithm based on cloud computing. The algorithm emulates the design of error correction algorithm of ALLPATHS-LG, and is designed to correct errors conservatively to avoid false decisions. To speed up execution time by reducing the massive disk I/O overhead, we introduce a message control strategy, the read-message (RM) diagram, to represent structure of the intermediate data generated along with each read. Then, we develop various schemes to trim off portions of the RM diagram to shrink the size of the intermediate data and thereby reduce the number of disk I/O operations.

We have implemented the proposed algorithms on the MapReduce cloud computing framework and evaluated them using state-of-the-art tools. The RM method reduces the intermediate data size and speeds up execution. Our proposed algorithms have improved not only the execution time of the pipeline dramatically, but also the quality of assembly.

This dissertation presents algorithms and architectural designs that speed up execution time and improve the quality of de novo genome assembly. These studies are valuable for further development of NGS big data applications for bioinformatics, including transcriptomics, metagenomics, pharmacogenomics, and precision medicine.
1 Introduction 1
1.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Dissertation organization . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related works 5
2.1 ReadStack algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 MapReduce framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Error correction in NGS big data 11
3.1 The CloudEC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Design of CloudEC . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Pinch Corrector algorithm . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Spread Corrector algorithm . . . . . . . . . . . . . . . . . . . . . 16
3.2 The read-message (RM) diagram . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Flooded messages . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Design of RM diagram . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 RM schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 The reference-based benchmark framework . . . . . . . . . . . . . . . . 27
3.3.1 Benchmark generator . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Two-phase evaluation approach . . . . . . . . . . . . . . . . . . 28
4 Experiments 31
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 CloudEC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 CloudEC with RM schemes . . . . . . . . . . . . . . . . . . . . 35
4.3.3 Scalability testing . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Conclusions and future work 45
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bibliography 47
[1] C. Alkan, S. Sajjadian, and E. E. Eichler. Limitations of next-generation genome sequence assembly. Nature methods, 8(1):61–65, 2011.
[2] E. Anderson and J. Tucek. Efficiency matters! ACM SIGOPS Operating Systems Review, 44(1):40–45, 2010.
[3] W. J. Ansorge. Next-generation DNA sequencing techniques. New biotechnology, 25(4):195–203, 2009.
[4] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, et al. A view of cloud computing. Communications of the ACM, 53(4):50–58, 2010.
[5] G. Berg, C. Zachow, H. Müller, J. Philipps, and R. Tilcher. Next-generation bio-products sowing the seeds of success for sustainable agriculture. Agronomy, 3(4): 648–656, 2013.
[6] Y.-J. Chang, C.-C. Chen, C.-L. Chen, and J.-M. Ho. A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC genomics, 13(7):S28, 2012.
[7] K. Chaturvedi and L. Sahijram. Plant molecular biology applications in horticulture: An overview. In Plant Biology and Biotechnology, pages 113–129. Springer, 2015.
[8] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: new analysis practices for big data. Proceedings of the VLDB Endowment, 2(2):1481–1492, 2009.
[9] F. F. Costa. Big data in biomedicine. Drug discovery today, 19(4):433–440, 2014.
[10] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
[11] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative MapReduce. In Proceedings of the 19th ACM international symposium on high performance distributed computing, pages 810–818. ACM, 2010.
[12] T. C. Glenn. Field guide to next-generation DNA sequencers. Molecular ecology resources, 11(5):759–769, 2011.
[13] S. Gnerre, I. MacCallum, D. Przybylski, F. J. Ribeiro, J. N. Burton, B. J. Walker, T. Sharpe, G. Hall, T. P. Shea, S. Sykes, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences, 108(4):1513–1518, 2011.
[14] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for big data analytics. In Cidr, volume 11, pages 261–272, 2011.
[15] W. Huang, L. Li, J. R. Myers, and G. T. Marth. ART: a next-generation sequencing read simulator. Bioinformatics, 28(4):593–594, 2012.
[16] L. Ilie and M. Molnar. RACER: rapid and accurate correction of errors in reads. Bioinformatics, 29(19):2490–2493, 2013.
[17] L. Ilie, F. Fazayeli, and S. Ilie. HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics, 27(3):295–302, 2011.
[18] E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for MapReduce programs. Proceedings of the VLDB Endowment, 4(6):385–396, 2011.
[19] L. Jourdren, M. Bernard, M.-A. Dillies, and S. Le Crom. Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses. Bioinformatics, 28(11):1542–1543, 2012.
[20] W.-C. Kao, A. H. Chan, and Y. S. Song. ECHO: a reference-free short-read error correction algorithm. Genome research, 21(7):1181–1192, 2011.
[21] D. R. Kelley, M. C. Schatz, and S. L. Salzberg. Quake: quality-aware detection and correction of sequencing errors. Genome biology, 11(11):R116, 2010.
[22] F. Krueger. Trim Galore!: A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files, 2015.
[23] B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg. Searching for SNPs with cloud computing. Genome biology, 10(11):R134, 2009.
[24] B. Langmead, K. D. Hansen, and J. T. Leek. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome biology, 11(8):R83, 2010.
[25] S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in MapReduce. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, pages 85–94. ACM, 2011.
[26] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon. Parallel data processing with MapReduce: a survey. AcM sIGMoD Record, 40(4):11–20, 2012.
[27] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.
[28] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin. The sequence alignment/map format and SAMtools. Bioinformatics, 25(16):2078–2079, 2009.
[29] L. Liu, Y. Li, S. Li, N. Hu, Y. He, R. Pong, D. Lin, L. Lu, and M. Law. Comparison of next-generation sequencing systems. BioMed Research International, 2012, 2012.
[30] Y. Liu, J. Schröder, and B. Schmidt. Musket: a multistage k-mer spectrum-based error corrector for illumina sequence data. Bioinformatics, 29(3):308–315, 2013.
[31] R. Luo, B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, G. He, Y. Chen, Q. Pan, Y. Liu, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience, 1(1):18, 2012.
[32] P. Medvedev, E. Scott, B. Kakaradov, and P. Pevzner. Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics, 27(13): i137–i141, 2011.
[33] P. Mell, T. Grance, et al. The NIST definition of cloud computing. 2011.
[34] M. L. Metzker. Sequencing technologies–the next generation. Nature reviews. Genetics, 11(1):31, 2010.
[35] N. Nagarajan and M. Pop. Sequence assembly demystified. Nature reviews. Genetics, 14(3):157, 2013.
[36] H. Nordberg, K. Bhatia, K. Wang, and Z. Wang. BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics, 29(23):3014–3019, 2013.
[37] A. O’Driscoll, J. Daugelaite, and R. D. Sleator. ‘big data’, Hadoop and cloud computing in genomics. Journal of biomedical informatics, 46(5):774–781, 2013.
[38] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 165–178. ACM, 2009.
[39] M. Pop. Genome assembly reborn: recent computational challenges. Briefings in bioinformatics, 10(4):354–366, 2009.
[40] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages 13–24. IEEE, 2007.
[41] K. Robasky, N. E. Lewis, and G. M. Church. The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics, 15(1):56–62, 2014.
[42] D. H. Roukos and C.-S. Ku. Clinical cancer genome and precision medicine. Annals of surgical oncology, 19(12):3646–3650, 2012.
[43] L. Salmela and J. Schröder. Correcting errors in short reads by multiple alignments. Bioinformatics, 27(11):1455–1461, 2011.
[44] S. L. Salzberg, A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C. Schatz, A. L. Delcher, M. Roberts, et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome research, 22(3):557–567, 2012.
[45] M. C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11):1363–1369, 2009.
[46] M. C. Schatz and B. Langmead. The DNA data deluge. IEEE Spectrum, 50(7): 28–33, 2013.
[47] M. C. Schatz, B. Langmead, and S. L. Salzberg. Cloud computing and the DNA data race. Nature biotechnology, 28(7):691–693, 2010.
[48] J. Schröder, H. Schröder, S. J. Puglisi, R. Sinha, and B. Schmidt. SHREC: a short-read error correction method. Bioinformatics, 25(17):2157–2163, 2009.
[49] A. Schumacher, L. Pireddu, M. Niemenmaa, A. Kallio, E. Korpelainen, G. Zanetti, and K. Heljanko. SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics, 30(1):119–120, 2014.
[50] D. Sims, I. Sudbery, N. E. Ilott, A. Heger, and C. P. Ponting. Sequencing depth and coverage: key considerations in genomic analyses. Nature Reviews Genetics, 15(2): 121–132, 2014.
[51] L. D. Stein. The case for cloud computing in genome informatics. Genome biology, 11(5):207, 2010.
[52] R. C. Taylor. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC bioinformatics, 11(12):S1, 2010.
[53] T. J. Treangen and S. L. Salzberg. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature reviews. Genetics, 13(1):36, 2011.
[54] J. D. Ullman. Designing good MapReduce algorithms. XRDS: Crossroads, The ACM Magazine for Students, 19(1):30–34, 2012.
[55] G. Wang, A. R. Butt, P. Pandey, and K. Gupta. A simulation approach to evaluating design decisions in MapReduce setups. In Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS’09. IEEE International Symposium on, pages 1–11. IEEE, 2009.
[56] X. Yang, K. S. Dorman, and S. Aluru. Reptile: representative tiling for short read error correction. Bioinformatics, 26(20):2526–2533, 2010.
[57] X. Yang, S. P. Chockalingam, and S. Aluru. A survey of error-correction methods for next-generation sequencing. Briefings in bioinformatics, 14(1):56–66, 2013.
[58] D. R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research, 18(5):821–829, 2008.
[59] Q. Zhang, L. Cheng, and R. Boutaba. Cloud computing: state-of-the-art and research challenges. Journal of internet services and applications, 1(1):7–18, 2010.
[60] Q. Zou, X.-B. Li, W.-R. Jiang, Z.-Y. Lin, G.-L. Li, and K. Chen. Survey of MapReduce frame operation in bioinformatics. Briefings in bioinformatics, 15(4):637–647, 2013.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊