跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.91) 您好!臺灣時間:2025/01/19 20:25
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:謝秉翰
研究生(外文):Ping-Han Hsieh
論文名稱:探討轉錄體序列組裝對序列回貼以及基因表現定量的影響
論文名稱(外文):Effect of de novo transcriptome assembly on quality of read mapping and transcript quantification
指導教授:歐陽彥正歐陽彥正引用關係陳倩瑜
指導教授(外文):Yen-Jen OyangChien-Yu Chen
口試日期:2017-07-07
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:生醫電子與資訊學研究所
學門:工程學門
學類:生醫工程學類
論文種類:學術論文
論文出版年:2017
畢業學年度:105
語文別:英文
論文頁數:67
中文關鍵詞:核糖核酸定序技無參考序列轉錄體組裝組裝錯誤轉錄體表現量估計監督式機器學習
外文關鍵詞:RNA-Seqde novo transcriptome assemblywrongly-assembled contigsquantification of transcript abundancesupervised machine learning
相關次數:
  • 被引用被引用:0
  • 點閱點閱:318
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
利用核糖核酸的定序技術可以了解轉錄體在不同的生長階段或是生理狀態下的表現情形,進而了解生物體內的基因調控途徑。除此之外,由於核糖核酸的定序技術不需要事先使用參考的基因體或轉錄體序列,因此也特別適用於還未有詳盡註解基因體或是未曾被研究過的物種上。在沒有參考序列的情況下,研究者必須要利用定序出的小片段核糖核酸進行轉錄體序列的組裝與重建。然而,組裝過程中產生的多餘或是錯誤序列,很有可能對後續的定量分析造成嚴重的影響。因此,如何正確地用計算的方式估計轉錄體的表現量便是個相當重要的課題。本論文旨於評估轉錄體序列組裝的品質是如何影響轉錄體表現量的定量演算法。組裝後的序列會被分類為十二類不同意義的組裝序列類別,並且針對每個類別進行定量的分析與比較。結果顯示了在生物體中的轉錄體即便具有大量的相似性,對參考基因體或轉錄體的定量並沒有太大的影響,但卻會導致組裝錯誤進而造成組裝過後的序列定量具有較大的誤差,尤其針對於把多條相似序列的合併成一條序列的組裝錯誤,會得到最為嚴重的結果。除此之外,本論文也提出了一個預測組裝錯誤的監督式學習演算法,能幫助將來的研究者對於分析的組裝序列有更進一步的瞭解。總結來說,本研究利用多種組裝與定量演算法的比較,提供研究者在無參考序列物種的轉錄體組裝與定量更多的了解。
Correct quantification of transcript abundance is essential to understand the functional products of the genome in different physiological conditions and developmental stages. Recently, the development of high-throughput RNA sequencing (RNA-Seq) allows the researchers to perform transcriptome analysis for the organisms without the reference genome and transcriptome. For these practical projects, de novo transcriptome assembly must be carried out prior to quantification. However, a large number of fragmented contigs and redundant sequences produced by the assemblers may result in unreliable abundance estimation. In this regard, this study first investigates how assembly quality might affect the quality of read mapping and count estimation, and then proposes a classifier to characterize the assembled sequences. By the experiments and analyses conducted in this study, several important factors that might seriously affect the accuracy of the RNA-Seq analysis were comprehensively discussed. First, the effects of twelve distinctive assembly groups along with the intrinsic similarity presented in the reference transcriptome on quantification quality were examined. The results showed that the similar subsequences presented in the reference transcriptome only slightly influence mapping quality, but lead to many poorly-assembled contigs. The contigs that merge multiple transcripts into one most heavily decreased the reliability of abundance estimation. Second, a predicting algorithm was proposed to help researchers estimate the quantification reliability for further analyses. In summary, the analytic results conducted in this study provides valuable insights for future studies related to RNA-Seq data analysis.
Acknowledgement (i)
中文摘要 (ii)
Abstract (iii)
Content (1)
1 Introduction (1)
2 Literature Survey (3)
3 Materials and Methods (5)
3.1 Datasets (5)
3.2 Overall Analysis Workflow (5)
3.3 Identify Similar Subsequences (6)
3.4 De novo Transcriptome Assembly (7)
3.5 Relation Network and Assembly Types (8)
3.6 Read Mapping and Quantification (9)
3.7 Statistical Analysis and Evaluation Metrics (10)
3.8 Prediction of Assembly Type (11)
4 Results (12)
4.1 Unique Transcripts Showed Higher Quantification Accuracy (12)
4.2 De novo Transcriptome Assembly (12)
4.3 Quantification Error Varied Greatly in Different Assembly Types (14)
4.4 Prediction of Assembly Types (15)
5 Discussion (16)
5.1 Analysis of RNA-Seq from Experiments (16)
5.2 Effect of the Length of Contigs (16)
5.3 The Wrongly-Aggregation of Transcripts (17)
5.4 Sequence Attributes of Transcripts (17)
5.5 Performance of Prediction Algorithm (18)
6 Conclusion (19)
Tables (21)
Figures (32)
Supplementary Materials (43)
References (64)
1. Genome, K.C.o.S., Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered, 2009. 100(6): p. 659-74.
2. i, K.C., The i5K Initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment. J Hered, 2013. 104(5): p. 595-600.
3. Ekblom, R. and J. Galindo, Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity (Edinb), 2011. 107(1): p. 1-15.
4. Grabherr, M.G., et al., Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol, 2011. 29(7): p. 644-52.
5. Chang, Z., et al., Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol, 2015. 16: p. 30.
6. Robertson, G., et al., De novo assembly and analysis of RNA-seq data. Nat Methods, 2010. 7(11): p. 909-12.
7. Davidson, N.M. and A. Oshlack, Corset: enabling differential gene expression analysis for de novo assembled transcriptomes. Genome Biol, 2014. 15(7): p. 410.
8. Li, B., et al., Evaluation of de novo transcriptome assemblies from RNA-Seq data. Genome Biol, 2014. 15(12): p. 553.
9. Smith-Unna, R., et al., TransRate: reference-free quality assessment of de novo transcriptome assemblies. Genome Res, 2016. 26(8): p. 1134-44.
10. Wang, Z., M. Gerstein, and M. Snyder, RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 2009. 10(1): p. 57-63.
11. Schena, M., et al., Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 1995. 270(5235): p. 467-70.
12. Urschitz, J., et al., A serial analysis of gene expression in sun-damaged human skin. J Invest Dermatol, 2002. 119(1): p. 3-13.
13. Brenner, S., et al., Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol, 2000. 18(6): p. 630-4.
14. Mortazavi, A., et al., Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods, 2008. 5(7): p. 621-8.
15. Conesa, A., et al., A survey of best practices for RNA-seq data analysis. Genome Biol, 2016. 17: p. 13.
16. Kim, D., et al., TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol, 2013. 14(4): p. R36.
17. Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21.
18. Kim, D., B. Langmead, and S.L. Salzberg, HISAT: a fast spliced aligner with low memory requirements. Nat Methods, 2015. 12(4): p. 357-60.
19. Trapnell, C., et al., Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc, 2012. 7(3): p. 562-78.
20. Nicolae, M., et al., Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms Mol Biol, 2011. 6(1): p. 9.
21. Roberts, A. and L. Pachter, Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Methods, 2013. 10(1): p. 71-3.
22. Li, B. and C.N. Dewey, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 2011. 12: p. 323.
23. Langmead, B. and S.L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nat Methods, 2012. 9(4): p. 357-9.
24. Bray, N.L., et al., Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol, 2016. 34(5): p. 525-7.
25. Xie, Y., et al., SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics, 2014. 30(12): p. 1660-6.
26. Pachter, L., Models for transcript quantification from RNA-Seq. arXiv preprint arXiv:1104.3889, 2011.
27. Aken, B.L., et al., Ensembl 2017. Nucleic Acids Res, 2017. 45(D1): p. D635-D642.
28. Liu, D., et al., Molecular homology and difference between spontaneous canine mammary cancer and human breast cancer. Cancer Res, 2014. 74(18): p. 5045-56.
29. Nookaew, I., et al., A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res, 2012. 40(20): p. 10084-97.
30. Griebel, T., et al., Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res, 2012. 40(20): p. 10073-83.
31. Trapnell, C., et al., Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol, 2010. 28(5): p. 511-5.
32. Bolger, A.M., M. Lohse, and B. Usadel, Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 2014. 30(15): p. 2114-20.
33. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25(17): p. 3389-402.
34. Simpson, J.T., et al., ABySS: a parallel assembler for short read sequence data. Genome Res, 2009. 19(6): p. 1117-23.
35. Anderson, T.W. and D.A. Darling, Asymptotic theory of certain" goodness of fit" criteria based on stochastic processes. The annals of mathematical statistics, 1952: p. 193-212.
36. Levene, H., Robust tests for equality of variances. Contributions to probability and statistics, 1960. 1: p. 278-292.
37. Welch, B.L., The generalization ofstudent''s'' problem when several differentpopulation variances are involved. Biometrika, 1947. 34(1/2): p. 28-35.
38. Ruxton, G.D. and G. Beauchamp, Time for some a priori thinking about post hoc testing. Behavioral Ecology, 2008. 19(3): p. 690-693.
39. Breiman, L., Random forests. Machine learning, 2001. 45(1): p. 5-32.
40. Pedregosa, F., et al., Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 2011. 12(Oct): p. 2825-2830.
41. Li, W. and A. Godzik, Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006. 22(13): p. 1658-9.
42. Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009. 25(14): p. 1754-60.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top