跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.17) 您好!臺灣時間:2025/09/03 02:52
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:王亮博
研究生(外文):Liang-Bo Wang
論文名稱:BioCloud:線上定序分析平台
論文名稱(外文):BioCloud: an online sequencing analysis platform
指導教授:莊曜宇
指導教授(外文):Eric Y. Chuang
口試委員:盧子彬蔡孟勳賴亮全陳倩瑜
口試委員(外文):Tzu-Pin LuMong-Hsun TsaiLiang-Chuan LaiChien-Yu Chen
口試日期:2016-07-21
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:生醫電子與資訊學研究所
學門:工程學門
學類:生醫工程學類
論文種類:學術論文
論文出版年:2016
畢業學年度:104
語文別:英文
論文頁數:75
中文關鍵詞:次世代定序線上分析平台
外文關鍵詞:Next-generation sequencingonline analysis platform
相關次數:
  • 被引用被引用:0
  • 點閱點閱:329
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
随著次世代定序技術的問世,它已經成為基因體研究中最重要的資料來源之一。與傳統的方法比較,次世代定序能在可行的時間與預算內,提供高通量的序列以及定量生物體活動的能力。然而,在獲取生物上有意義的結果之前,完成一個次世代定序的分析需要一系列命令列下運作工具的參與,以及龐大的計算資源,給予生物學家以及臨床人員很高的分析進入門檻,他們甚而無法解讀自己的序結資料。因此本研究提出了一個線上分析次世代定序平台,BioCloud,它得以自動化常見的定序分析流程,並且根據分析結果產生總覽報表。進一步,使用者能設計自訂的分析流程並擴充現有的流程實作來支援更多種類的定序與分析方法。藉由在 BioCloud 上分析次世代定序,研究者能以更互動式與方便的方式來了解他們的資料,並且讓整個分析更容易地重現。

With the advent of next-generation sequencing (NGS), it has become one of the most important data sources in genome-wide study. Compared with traditional methods, NGS provides high throughput sequencing reads and ability to quantify expression of biological activities in feasible range of time and budget. However, before obtaining biologically meaningful results, a NGS data analysis involves series of command-line tools to process and requires extensive computation resources, which imposes a high barrier for biologists and clinicians to enter NGS analysis and even interpret their own data. Therefore, in this study, an online NGS analysis platform, BioCloud, is proposed to automate common analysis pipelines and generate summary report based on the analysis results. Furthermore, users can design their custom analysis pipelines and extends the existed implementation to support a wider set of NGS sequencing types and analysis methods. By conducting NGS analyses on BioCloud, researchers can understand their data in a more interactive and convenient way and the analyses results can be easily reproducible.


口試委員會審定書.................................. i
誌謝.......................................... ii
摘要.......................................... iii
Abstract........................................ iv
Contents........................................ v
List of Figures..................................... viii
List of Tables ..................................... x
1 Introduction.................................... 1
1.1 Motivation.................................. 1
1.2 Specific aims................................ 2
1.3 Next-generation sequencing ........................ 3
1.4 Genome reference.............................. 5
1.5 General NGS analysis workflow ...................... 6
2 Related work ................................... 8
2.1 Commercial online analysis platforms ................... 8
2.2 Open source donline analysis platform................... 9
2.3 Pipeline execution tools........................... 12
2.4 Analysis report generation ......................... 14
3 Methods...................................... 17
3.1 RNA-Seq pipelines ............................. 17
3.2 DNA-Seq pipelines............................. 20
3.3 BioCloud website.............................. 21
3.3.1 Overview.............................. 21
3.3.2 Data integrity check and authentication............... 24
3.3.3 User account management ..................... 26
3.3.4 Data source management...................... 27
3.3.5 Experiment design ......................... 27
3.3.6 Genome reference ......................... 29
3.3.7 Analysis submission ........................ 30
3.3.8 Job queue management....................... 30
3.3.9 Report and result access control .................. 31
3.4 Report generation.............................. 33
3.4.1 Analysis result structure and information............... 33
3.4.2 BCReport: result processing framework............... 35
3.5 Implementation ............................... 36
3.5.1 Website............................... 37
3.5.2 Deployment............................. 38
3.5.3 Report................................ 39
4 Results....................................... 40
4.1 Datasets................................... 40
4.2 Account registration and user dashboard.................. 42
4.3 Data source discovery............................ 45
4.4 Experiment design ............................. 45
4.5 Analysis design............................... 50
4.6 Job queue monitoring............................ 52
4.7 Summary report............................... 54
4.7.1 Quality control ........................... 55
4.7.2 Genome alignment – STAR .................... 56
4.7.3 Cuffdiff............................... 58
4.7.4 Integration with external genome browsers ......... 60
4.8 Admin interface............................... 62
5 Discussions .................................... 64
5.1 Data source uploading ........................... 64
5.2 Supported analyses ............................. 65
5.3 Pipeline extension.............................. 65
5.4 Integration with other frameworks ..................... 67
6 Conclusions.................................... 69
Bibliography ..................................... 71

1. DNAnexus <https://www.dnanexus.com/> (2016).
2. Partek Flow <http://www.partek.com/partekflow> (2016).
3. Goecks, J., Nekrutenko, A. & Taylor, J. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11, R86 (2010).
4. Griffith, M. et al. Genome Modeling System: A Knowledge Management Platform for Genomics. PLoS Comput Biol 11, 1–21 (2015).
5. Metzker, M. L. Sequencing technologies —the next generation. Nature Reviews Genetics 11, 31–46 (2010).
6. Van Dijk, E. L., Auger, H., Jaszczyszyn, Y. & Thermes, C. Ten years of next-generation sequencing technology. Trends in Genetics 30, 418–426 (2014).
7. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10, 57–63 (2009).
8. Nielsen, R., Paul, J. S., Albrechtsen, A. & Song, Y. S. Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics 12, 443–451 (2011).
9. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
10. Gawad, C., Koh, W. & Quake, S. R. Single-cell genome sequencing: current state of the science. Nature Reviews Genetics 17, 175–188 (2016).
11. NCBI GRCh38.p7 Assembly <http://www.ncbi.nlm.nih.gov/assembly/GCA_ 000001405.22> (2016).
12. E pluribus unum. Nature Methods 7, 331–331 (2010).
13. Leipzig, J. A review of bioinformatic pipeline frameworks. Briefings in Bioinformatics, bbw020 (2016).
14. bcbio-nextgen <https://bcbio-nextgen.readthedocs.io/> (2016).
15. Guimera, R. V. bcbio-nextgen: Automated, distributed next-gen sequencing pipeline. EMBnet.journal 17, p. 30 (B 2012).
16. Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).
17. IPython Parallel <http://ipyparallel.readthedocs.io/> (2016).
18. Amstutz, P. et al. Common Workflow Language, draft 3. <https://figshare.com/articles/Common_Workflow_Language_draft_3/3115156> (2016).
19. GNU Make <https://www.gnu.org/software/make/> (2016).
20. Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics, btw354 (2016).
21. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotech- nology 28, 511–515 (2010).
22. Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotechnology 31, 46–53 (2013).
23. cummeRbund <http://bioconductor.org/packages/cummeRbund/> (2016).
24. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, 550 (2014).
25. Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics (Oxford, England) 30, 923–930 (2014).
26. Anders, S., Pyl, P. T. & Huber, W. HTSeq—a Python framework to work with highthroughput sequencing data. Bioinformatics 31, 166–169 (2015).
27. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology 14, R36 (2013).
28. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12, 357–360 (2015).
29. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
30. Picard <https://broadinstitute.github.io/picard/> (2016).
31. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England) 25, 2078–2079 (2009).
32. Patro, R., Mount, S. M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nature Biotechnology 32, 462–464 (2014).
33. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34, 525–527 (2016).
34. Patro, R., Duggal, G. & Kingsford, C. Accurate, fast, and model-aware transcript expression quantification with Salmon. bioRxiv, 021592 (2015).
35. Pimentel, H. J., Bray, N., Puente, S., Melsted, P. & Pachter, L. Differential analysis of RNA-Seq incorporating quantification uncertainty. bioRxiv, 058164 (2016).
36. Alamancos, G. P., Pagès, A., Trincado, J. L., Bellora, N. & Eyras, E. Leveraging transcript quantification for fast computation of alternative splicing profiles. RNA (New York, N.Y.) 21, 1521–1531 (2015).
37. FastQC <http://www.bioinformatics.babraham.ac.uk/projects/fastqc/>.
38. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England) 25, 1754–1760 (2009).
39. Koboldt, D. C. et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research 22, 568–576 (2012).
40. Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Current Protocols in Bioinformatics 43, 11.10.1–33 (2013).
41. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20, 1297–1303 (2010).
42. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research 38, e164–e164 (2010).
43. CommonMark <http://commonmark.org/>.
44. iGenomes <http://support.illumina.com/sequencing/sequencing_software/igenome.html>.
45. YAML Ain’t Markup Language (YAML) Version 1.1 <http://yaml.org/spec/1.1/>.
46. Eswaran, J. et al. Transcriptomic landscape of breast cancers through mRNA sequencing. Scientific Reports 2, 264 (2012).
47. Himes, B. E. et al. RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells. PloS One 9, e99625 (2014).
48. Engström, P. G. et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nature Methods 10, 1185–1191 (2013).
49. Rapaport, F. et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biology 14, 3158 (2013).

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top