跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.170) 您好!臺灣時間:2025/01/13 15:08
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:張寓雯
研究生(外文):CHANG, YU-WEN
論文名稱:以機器學習方法進行裸子植物質體基因體親緣演化分析
論文名稱(外文):Plastid Phylogenomic Analysis of Gymnosperms by Machine Learning
指導教授:蕭瑛東蕭瑛東引用關係
指導教授(外文):HSIAO, YING-TUNG
口試委員:陳柏宏吳玲梅
口試委員(外文):CHEN, PO-HUNGWU, LING-MEI
口試日期:2020-06-16
學位類別:碩士
校院名稱:國立臺北教育大學
系所名稱:資訊科學系碩士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2020
畢業學年度:108
語文別:中文
論文頁數:63
中文關鍵詞:質體基因體裸子植物親緣演化關係機器學習最大似然法貝氏推論法
外文關鍵詞:plastid genomegymnospermphylogenetic evolutionmachine learningmaximum likelihoodBayesian inference
DOI:10.6344/THE.NTUE.CS.010.2020.B02
相關次數:
  • 被引用被引用:0
  • 點閱點閱:167
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
裸子植物是種子植物的主要分支之一,其影響生態環境與人類經濟活動甚鉅。裸子植物已存活於地球超過三億年,但人們對它們的親緣演化關係仍存在爭議且尚待釐清。本研究開發Python程式,從美國國家生物技術之資訊中心的資料庫,下載與整理至少約一百種以上之裸子植物質體基因體的序列資料,並擷取其83個共有的蛋白質基因序列,經排序處理後的總長度約為105.5 Kb,大小約為12.7 Mb的二維DNA資料矩陣。執行相關的各種前置處理作業後,樣本以機器學習之自主抽樣或馬爾可夫鏈蒙地卡羅採樣的方式,進行距離法、最簡約法、最大似然估算法、貝葉斯推理法之不同權重模式的運算,在反覆且系統式演算訓練或改善後,推估相關之預測的結果,同時比較親緣演化關係的差異性。分析之結果顯示不論以何種方法,裸子植物的五大分類群(蘇鐵類、銀杏、柏類、買麻藤類及松類)都可以明顯的區分開來,且都擁有100 % 的支持度,符合對裸子植物分類的基本認知,表示本研究所開發的Python程式可準確地處理及轉換DNA資料。本研究從資料矩陣去除密碼子之第三個位置上的核苷酸,此舉除了降低分類群之演化速率,亦可減緩因演化模型因核苷酸替代飽和之現象所造成的影響。以此新的DNA矩陣所建構之親緣關係顯示,無論使用何種假說演算的分析方法,皆高度(89 – 100 %)支持蘇鐵類與銀杏的姊妹群關係,但買麻藤類與其他裸子植物分類群的關係仍待釐清,未來的研究可加入更多樣性的資料,如核基因體及粒線體基因體資料,以增加親緣關係推論的支持度。
Gymnosperms are one of the major branches of seed plants. They have a huge impact on the ecological environment and human economic activities. Gymnosperms have been living on the earth for more than 300 million years, but there is still controversy and clarification about their phylogenetic evolution. This study implemented a Python program for downloading and organizing the plastid genomes of at least 100 gymnosperm species from the database of the National Biotechnology Information Center. We extracted 83 common protein-coding9inm0omngfeddd gene sequences. After sorting, we obtained a two-dimensional DNA data matrix with a total length of about 105.5 Kb and a size of about 12.7 Mb. After performing various related preprocessing operations, these samples were subjected to the distance method, the simplest method, the maximum likelihood estimation method, and the Bayesian inference method by the machine learning or Markov Chain Monte Carlo for getting different weighing modes. After cycling and systematic training to improve the calculus, the results of the related predictions were estimated, and the differences in relationships were compared. The analysis results show that no matter what method is used, the five taxa of gymnosperms (cycads, ginkgo, cupressophytes, gnetophytes, and pine family) were distinguished with 100% support, which is in line with common knowledge of the classification. It reveals that the Python program implemented by this project can accurately process and convert DNA data. We removed the third codon positions from the data matrix to avoid the effect of the evolutionary model due to the saturation of nucleotide substitutions. All analytical methods based on this new DNA matrix congruently supported the sister relationship between cycads and ginkgo with supports ranging from 89 to 100%. However, the relationship between gnetophytes and other gymnosperms remains to be clarified. Future research can add more various data, such as nuclear and mitochondrial genome data, to enrich genetic information in phylogenetic inference of gymnosperms.
目錄
摘要………………………………………………….………………………………ⅰ
Abstract..………………………………………………………….…………………ⅱ
目錄………………………………………………………………………………ⅳ
表目錄………………………………………………………………………………ⅴ
圖目錄………………………………………………………………………………ⅵ
第一章 緒論..………………………….……………………………………………1
第一節 研究背景..………………………….………………………………………1
第二節 研究動機與目的..………………………….………………………………2
第三節 研究架構..…………………………….……………………………………3
第二章 文獻探討..………………………….…………………………………………4
第一節 機器學習..………………………….………………………………………4
第二節 裸子植物..…………………………….…………………………………12
第三節 質體基因體………………………….……………………………………15
第三章 研究方法..………………………….……………………………………18
第一節 研究樣本來源..………………………….………………………………18
第二節 研究處理及分析之流程..…………………………….…………………26
第三節 自主抽樣與馬爾可夫鏈蒙地卡羅法..…………….……………………27
第四章 結果與討論..………………………….……………………………………33
第一節 質體基因體序列之處理與分析…………………………………………33
第二節 裸子植物質體基因體親緣演化關係……………………………………41
第五章 結論與建議..………………………….……………………………………55
第一節 結論..……………………………….……………………………………55
第二節 建議..……………………………………….……………………………56
參考文獻..………………………….…………………………………………………57

表目錄
表 1. 序列資料簡表….……………….……………………………………………19
表 2. 基因名稱列表….……………….……………………………………………26
表 3. 特殊序列簡表….……………….……………………………………………34

圖目錄
圖 1. 研究架構圖……………………….………………………………………………3
圖 2. 自主抽樣法之分析步驟的範例簡圖….…………………………………………6
圖 3. 蒙地卡羅隨機模擬之積分的範例簡圖…………………………………………7
圖 4. 馬爾可夫鏈蒙地卡羅的範例簡圖………………………………………………7
圖 5. Python程式語法的範例簡圖………………………………………………8
圖 6. 估算親緣演化關係之程式語法的範例簡圖……………………………………9
圖 7. 模組區塊之程式語法的範例簡圖………………………………………………9
圖 8. GNU計劃之標誌圖………………….….…….…………………………10
圖 9. 程式開發人員最愛及最欲使用之程式語言的統計排名與百分比結果……11
圖 10. 現生之植物……………………………………………………………………13
圖 11. 受威脅物種於紅色名單中仍現存物種的比例統計結果……………………14
圖 12. 歷年來被定序與發表的質體基因體核苷酸數量之統計結果………………17
圖 13. 環境建置之編寫語言的範例…………………………………………………25
圖 14. 研究方法的流程簡介…………………………………………………………26
圖 15. 進行已撰寫檔案之執行圖示…………………………………………………27
圖 16. 家族譜的範例圖………………………………………………………………28
圖 17. 距離矩陣範例…………………………………………………………………29
圖 18. ML親緣關係分析之過程圖……………………………………………31
圖 19. 以NJ法建構之裸子植物質體基因體親緣演化的關係圖…………………41
圖 20. 以NJ法建構具演化速率之裸子植物質體基因體親緣演化的關係圖……42
圖 21. 以MP法建構之裸子植物質體基因體親緣演化的關係圖………………43
圖 22. 以MP法建構出分群之裸子植物質體基因體親緣演化的關係圖………44
圖 23. 以ML法建構之裸子植物質體基因體親緣演化的關係圖………………45
圖 24. 以ML法建構具演化速率之裸子植物質體基因體親緣演化的關係圖……46
圖 25. 以BI法建構之裸子植物質體基因體親緣演化的關係圖…………………47
圖 26. 以BI法建構具演化速率之裸子植物質體基因體親緣演化的關係圖……48
圖 27. 以NJ法建構不含第三個密碼子之質體基因體親緣演化的關係圖………50
圖 28. 以MP法建構不含第三個密碼子之質體基因體親緣演化的關係圖………51
圖 29. 以ML法建構不含第三個密碼子之質體基因體親緣演化的關係圖………52
圖 30. 以BI法建構不含第三個密碼子之質體基因體親緣演化的關係圖………53

翁卓立。2010。Linux進化特區:Ubuntu 10.04從入門到精通。
Andrieu C. 2003. An introduction to MCMC for machine learning. Machine Learning 50: 5-43.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. Journal of Molecular Biology 215: 403-410.
Ashley MV, Dow BD. 1994. The use of microsatellite analysis in population biology: background, methods and potential applications. Experientia Supplementum 69: 185-201.
Bassi, S. 2017. Python for bioinformatics. Boca Raton, FL: CRC Press.
Burleigh JG, Mathews S. 2004. Phylogenetic signal in nucleotide data from seed plants: implications for resolving the seed plant tree of life. American Journal of Botany 91: 1599-1613.
Ceroni F, Furini S, Giordano E, Cavalcanti S. 2010. Rational design of modular circuits for gene transcription: a test of the bottom-up approach. Journal of Biological Engineering 11: 14.
Chaw SM, Chang CC, Chen HL, Li WH. 2004. Dating the monocot-dicot divergence and the origin of core eudicots using whole chloroplast genomes. Journal of Molecular Evolution 58: 424-441.
Chaw SJ, Wu CS, Sudianto E. 2018. Evolution of gymnosperm plastid genomes. Plastid Genome Evolution 85: 195-222.
Cibrián-Jaramillo A, De la Torre-Bárcena JE, Lee EK, Katari MS, Little DP, Stevenson DW, Martienssen R, Coruzzi GM, DeSalle R. 2010. Using phylogenomic patterns and gene ontology to identify proteins of importance in plant evolution. Genome Biology and Evolution 2: 225-239.
Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ. 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422-3.
Condamine FL, Nagalingum N, Marshall C, Morlon H. 2015. Origin and diversification of living cycads: a cautionary tale on the impact of the branching process prior in Bayesian molecular dating. BioMed Central Evolutionary Biology 15: 65.
Cynthia G, Per J. 2001. Developing Bioinformatics Computer Skills.
Doebley JF, Guat BS, Smith BD. 2006. The molecular genetics of crop domestication. Cell 127: 1309-1321.
Efron B. 1979. Bootstrap methods: another look at the jackknife. Annals of Statistics 7 : 1-26.
Efron B, Tibshirani R. 1993. An introduction to the bootstrap. New York: Chapman & Hall/ CRC.
Felsenstein J. 1973. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Systematic Biology 22: 240-249.
Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783-791.
Fisher RA. 1912. On an absolute criterion for fitting frequency curves. Messenger of Mathematics 41: 155-160.
Fisher RA. 1922. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society London Series A 222: 309-368.
Fisher RA. 1925. Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society 22: 700-725.
Fisher RA. 1925. Statistical methods for research workers. Oliver and Boyd Edinburgh. Scotland.
Forestrybureau. 2019. ArchiveIndex. Retrieved July 10 2019 from https://if.forest.gov.tw/IF/FResourceArchive/ArchiveHome/ArchiveIndex
Goremykin VV, Hirsch-Ernst KI, Wo¨lfl S, Hellwig FH. 2004. The chloroplast genome of Nymphaea alba: whole-genome analyses and the problem of identifying the most basal angiosperm. Molecular Biology and Evolution 21: 1445-1454.
Goremykin VV, Viola R, Hellwig FH. 2009. Removal of noisy characters from chloroplast genome-scale data suggests revision of phylogenetic placements of Amborella and Ceratophyllum. Journal of Molecular Evolution 68: 197-204.
Hajibabaei M, Xia J, Drouin G. 2006. Seed plant phylogeny: gnetophytes are derived conifers and a sister group to Pinaceae. Molecular Phylogenetics and Evolution 40: 208-217.
Hastings WK. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57: 97-109.
Hohmann N, Wolf EM, Rigault P, Zhou W, Kiefer M, Zhao Y, Fu CX, Koch MA. 2018. Ginkgo biloba’s footprint of dynamic Pleistocene history dates back only 390,000 years ago. BioMed Central Genomics 19: 299.
Hou C, Wikström N, Rydin C. 2017. The chloroplast genome of Ephedra foeminea (Ephedraceae, Gnetales), an entomophilous gymnosperm endemic to the Mediterranean area. Mitochondrial DNA: the Journal of DNA Mapping, Sequencing, and Analysis 28: 330-331.
Huang CJ, Chu FH, Liu SC, Tseng YH, Huang YS, Ma LT, Wang CT, You YT, Hsu SY, Hsieh HC, Chen CT, Chao CH. 2018. Isolation and characterization of SSR and EST-SSR loci in Chamaecyparis formosensis (Cupressaceae). Applications in Plant Sciences 6: e01175.
IUCN. 2019. Summary statistics 2019. Retrieved January 5 2020 from https://www.iucnredlist.org/resources/summary-statistics
Jiao SQ, Sun YQ, Zhang DX, Gao Q, Jin Y, Liu H, Ma YP,Yang Y, Porth IL, Mao JF. 2019. Development of novel EST‐SSR markers for Ephedra sinica (Ephedraceae) by transcriptome database mining. Applications in Plant Sciences 7: e01212.
Kimura M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16: 111-120.
Li CY, Chiang TY, Chiang YC, Hsu HM, Ge XJ, Huang CC, Chen CT, Hung KH. 2016. Cross-species, amplifiable EST-SSR markers for Amentotaxus species obtained by next-generation sequencing. Molecules 21: 67.
Lin CP, Huang JP, Wu CS, Hsu CY, Chaw SM. 2010. Comparative chloroplast genomics reveals the evolution of Pinaceae genera and subfamilies. Genome Biology and Evolution 2: 504-517.
Liu L, Zhang S, Lian C. 2015. De novo transcriptome sequencing analysis of cDNA library and large-scale unigene assembly in Japanese red pine (Pinus densiflora). International Journal of Molecular Sciences 16: 29047-29059.
Lu Y, Ran JH, Guo DM, Yang ZY, Wang XQ. 2014. Phylogeny and divergence times of gymnosperms inferred from single-copy nuclear genes. Public Library of Science One 9: e107679.
Ouyang S, Thibaud-Nissen F, Childs KL, Zhu W, Buell CR. 2009. Plant genome annotation methods. Methods in Molecular Biology 513: 263-82.
Ruan X, Wang Z, Wang T, Su Y. 2019. Characterization and application of EST-SSR markers developed from the transcriptome of Amentotaxus argotaenia (Taxaceae), a relict vulnerable conifer. Frontiers in Genetics 10: 1014.
Ruhsam M, Rai HS, Mathews S, Ross TG, Graham SW, Raubeson LA, Mei W, Thomas PI, Gardner MF, Ennos RA, Hollingsworth PM. 2015. Does complete plastid genome sequencing improve species discrimination and phylogenetic resolution in Araucaria? Molecular Ecology Resources 15: 1067-1078.
Margush T, McMorris FR. 1981. Consensus n-trees. Bulletin of Mathematical Biology 43: 239-244.
Martin W, Rujan T, Richly E, Hansen A, Cornelsen S, Lins T, Leister D, Stoebe B, Hasegawa M, Penny D. 2002. Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proceedings of the National Academy of Sciences of the United States of America 99: 12246-12251.
Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E. 1953. Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21: 1087-1091.
NCBI. 2012. NC016986. Retrieved July 10 2019 from https://www.ncbi.nlm.nih.gov/nuccore/NC_016986.1
Perens B. 1997. Open source definition. Retrieved January 5 2020 from http://opensource.org/docs/osd
Poole D, Mackworth A, Goebel R. 1998. Computational intelligence: a logical approach. New York: Oxford University Press.
Rambaut A, Drummond AJ, Xie D, Baele G, Suchard MA. 2018. Posterior summarization in Bayesian phylogenetics using tracer 1.7. Systematic Biology 67: 901-904.
Saitou N, Nei M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4: 406-425.
Severance C. 2015. Guido van Rossum: the early years of Python. Computer 2: 7-9.
Stackoverflow. 2019. Developer survey results 2019. Retrieved July 10 2019 from https://insights.stackoverflow.com/survey/2019
Schafleitner R, Tincopa LR, Palomino O, Rossel G, Robles RF, Alagon R, Rivera C, Quispe C, Rojas LR, Pacheco JA, Solis J, Cerna D, Kim JY, Hou J, Simon R. 2010. A sweetpotato gene index established by de novo assembly of pyrosequencing and Sanger sequences and mining for gene-based microsatellite markers. BioMed Central Genomics. 11: 604.
Sukumaran J, Holder MT. 2010. DendroPy: a Python library for phylogenetic computing. Bioinformatics 26: 1569-71.
Talevich E, Invergo BM, Cock PJA, Chapman BA. 2012. Bio.Phylo: a unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython. BioMed Central Bioinformatics 13: 209.
Tautz D. 1989. Hypervariability of simple sequences as a general source of polymorphic DNA markers. Nucleic Acids Research 17: 6463-6471.
Tierney L. 1994. Markov chains for exploring posterior distributions. The Annals of Statistics 22: 1701-1728.
Wikipedia. 2019. Richard Stallman. Retrieved July 10 2019 from https://en.wikipedia.org/wiki/Richard_Stallman
Wikipedia. 2020. Markov chain. Retrieved June 16 2020 from https://en.wikipedia.org/wiki/Markov_chain
Wikipedia. 2020. Monte Carlo method. Retrieved June 16 2020 from https://en.wikipedia.org/wiki/ Monte_Carlo_method
Wikipedia. 2020. GNU. Retrieved June 16 2020 from https://en.wikipedia.org/wiki/ GNU
Van Rossum G, Drake FL. 2012. The Python language reference. Python Software Foundation.
Walter MF. 1971. Toward defining the course of evolution: minimum change for a specified tree topology. Systematic Zoology 20: 406-416.
Wu CS, Lai YT, Lin CP, Wang YN, Chaw SM. 2009. Evolution of reduced and compact chloroplast genomes (cpDNAs) in gnetophytes: selection toward a lower-cost strategy. Molecular Phylogenetics and Evolution 52: 115-24.
Wu CS, Wang YN, Hsu CY, Lin CP, Chaw SM. 2011. Loss of different inverted repeat copies from the chloroplast genomes of Pinaceae and cupressophytes and influence of heterotachy on the evaluation of gymnosperm phylogeny. Genome Biology and Evolution 3: 1284-1295.
Wu CS, Chaw SM, Huang YY. 2013. Chloroplast phylogenomics indicates that Ginkgo biloba is sister to cycads. Genome Biology and Evolution 5: 243-254.
Wu CS, Chaw SM. 2015. Evolutionary stasis in cycad plastomes and the first case of plastome GC-biased gene conversion. Genome Biology and Evolution 7: 2000-2009.
Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. Journal of Molecular Evolution 39: 306-314.
Zeng J, Chen J, Kou Y, Wang Y. 2018. Application of EST-SSR markers developed from the transcriptome of Torreya grandis (Taxaceae), a threatened nut-yielding conifer tree. PeerJ 6: e5606.
Zhong B, Yonezawa T, Zhong Y, Hasegawa M. 2010. The position of gnetales among seed plants: overcoming pitfalls of chloroplast phylogenomics. Molecular Biology and Evolution 27: 2855-2863.
Zhong B, Deusch O, Goremykin VV, Penny D, Biggs PJ, Atherton RA, Nikiforova SV, Lockhart PJ. 2011. Systematic error in seed plant phylogenomics. Genome Biology and Evolution 3: 1340-1348.

電子全文 電子全文(網際網路公開日期:20250718)
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top