跳到主要內容

臺灣博碩士論文加值系統

(44.222.82.133) 您好!臺灣時間:2024/09/21 00:13
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:王文廷
研究生(外文):Wen-Ting Wang
論文名稱:結合ChIP-Seq和蛋白質結構分析蛋白質序列、結構和DNA結合序列特徵之相關性
論文名稱(外文):Analysis of protein sequence, structure and DNA binding motifs by incorporating ChIP-Seq and protein structure data
指導教授:陳倩瑜
指導教授(外文):Chien-Yu Chen
口試委員:歐陽彥正吳君泰
口試委員(外文):Yan-Jheng Ou YangJune-Tai Wu
口試日期:2019-07-10
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:生物產業機電工程學研究所
學門:工程學門
學類:機械工程學類
論文種類:學術論文
論文出版年:2019
畢業學年度:107
語文別:中文
論文頁數:37
中文關鍵詞:轉錄因子DNA結合域ChIP-Seq蛋白質結構相似度結合序列相似度深度學習
DOI:10.6342/NTU201903464
相關次數:
  • 被引用被引用:0
  • 點閱點閱:127
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
分子生物學中心法則的大意是:去氧核醣核酸(DNA)製造核醣核酸(RNA),RNA製造蛋白質。而蛋白質會輔助上述兩項流程,其中轉錄因子與DNA的結合是基因調控的主要環節,進而調控細胞的不同表現,也因此轉錄因子會與哪些轉錄因子結合位結合,是問題的重點。近年來,蛋白質與DNA共存的結構資料日益增加,給了我們許多關於DNA與蛋白質交互作用的資訊;然而,透過觀察可以得到DNA與蛋白質間的交互作用並非簡單的一對一的鹼基與殘基關係,還需要考量到三維幾何結構上變化。本實驗室過去發表的PiDNA工具,針對PDB (Protein Data Bank, PDB)資料庫中的蛋白質-DNA複合物結構,進行DNA結合序列特徵的預測,提供結構與序列之間關連性。近年來,基於機器學習領域的蓬勃發展,同時生物資訊學領域的複雜性也讓資訊學家們深感興趣,便有了一系列深度學習於生物資訊領域的應用。其中,DeepBind使用了卷積神經網路 (CNN) 進行單一轉錄因子與DNA 序列的結合預測,其預測的準確度超越過去的其他預測工具,DeepBind的成功證明了使用深度學習能夠解決抓取結合序列特徵的問題。
本研究中將選擇ENCODE資料庫的染色體免疫沉澱定序資料(Chromatin Immunoprecipitation Sequencing, ChIP-Seq)作為DNA序列資料輸入,並使用從PDB資料庫收集而得的蛋白質序列-DNA複合物結構資料,抓取蛋白質與DNA的結合序列特徵與結構相似程度,進一步分析在同一個PFam家族中的轉錄因子DNA結合域序列、轉錄因子結合序列特徵與轉錄因子結構之間的關聯性,並藉此來檢驗DeepBind是否能夠更好的辨別ChIP-Seq資料的結合集之間的異同。
The binding of transcription factors to DNA is the main process of gene regulation. Transcription factors will bind to their binding sites, which is the focus of the problem. In recent years, the increasing structural data of protein and DNA complexes, giving us information about the interaction between DNA and protein. PiDNA, previously developed by our lab, used the structure of protein and DNA complexes in the PDB (Protein Data Bank) database to predict binding motifs. On the other hand, based on the advance of deep learning, information scientists have applied deep learning to many applications in the field of Bioinformatics. DeepBind used CNN (Convolution Neural Network) to demonstrate that the DNA sequence has binding characteristics that can be recognized by specific proteins. The success of DeepBind revealed the value of deep learning in characterizing binding sequences.
In this study, ChIP-Seq (Chromatin Immunoprecipitation Sequencing) from the ENCODE database was collected as the DNA sequence data input, and the protein sequence and structural data collected from the PDB database were used to capture the binding sequence characteristics of DNA of proteins in the same family. This study further analyzed the relationship between DNA binding domain sequence, transcription factor binding sequence characteristics and transcription factor structure for several Pfam families, revealing the importance of utilizing deep learning and protein-DNA complex structure in this important computational biology problem. In this way, it is tested whether DeepBind can distinguish the differences between the binding sets of ChIP-Seq data.
誌謝 i
摘要 ii
Abstract iii
目錄 v
圖目錄 vii
表目錄 ix
第一章 背景 1
第二章 文獻探討 3
2.1 分子生物學的中心法則 (Central Dogma of Molecular Biology) 3
2.2 轉錄因子 (Transcription Factor) 4
2.2.1 DNA結合域 (DNA Binding Domain, DBD) 4
2.2.2 轉錄因子結合位 (Transcription Factor Binding Site) 4
2.3 染色質免疫沉澱定序(ChIP-Seq) 6
2.4 資料庫介紹 6
2.4.1 ENCODE 資料庫 6
2.4.2 TRANSFAC資料庫 7
2.4.3 PDB資料庫 7
2.5 結合序列特徵提取與比較 8
2.5.1 DeepBind 10
2.6 蛋白質結構比較 10
第三章 研究方法 12
3.1 結合序列特徵分析 12
3.2 主要實驗流程 13
3.2.1 資料庫搜尋 15
3.2.2 資料前處理 16
3.2.3 轉錄因子結合位相似度比較 17
3.2.4 蛋白質結構相似度比較 17
3.2.5 DBD序列相似度比較 17
3.2.6 結果相關性分析及整理 18
3.3 分析資料集 18
第四章 結果與討論 19
4.1 結合序列特徵 19
4.2 序列結合特徵的相似度與結構相似度分析 24
第五章 結論 34
參考文獻 36
1.Lin, C.-K. and C.-Y. Chen, PiDNA: predicting protein–DNA interactions with structural models. Nucleic Acids Research, 2013. 41(W1): p. W523-W530.
2.Bailey, T.L., et al., MEME Suite: tools for motif discovery and searching. Nucleic Acids Research, 2009. 37(suppl_2): p. W202-W208.
3.Matys, V., et al., TRANSFAC ® : transcriptional regulation, from patterns to profiles. Nucleic Acids Research, 2003. 31(1): p. 374-378.
4.Alipanahi, B., et al., Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology, 2015. 33: p. 831.
5.The ENCODE (ENCyclopedia Of DNA Elements) Project. Science, 2004. 306(5696): p. 636.
6.Park, P.J., ChIP–seq: advantages and challenges of a maturing technology. Nature Reviews Genetics, 2009. 10: p. 669.
7.Bernstein, F.C., et al., The protein data bank: A computer-based archival file for macromolecular structures. Journal of Molecular Biology, 1977. 112(3): p. 535-542.
8.McGinnis, S. and T.L. Madden, BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research, 2004. 32(suppl_2): p. W20-W25.
9.Zhang, Y. and J. Skolnick, TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Research, 2005. 33(7): p. 2302-2309.
10.Crick, F., Central Dogma of Molecular Biology. Nature, 1970. 227(5258): p. 561-563.
11.Gonzalez, D.H., Introduction to transcription factor structure and function, in Plant Transcription Factors. 2016, Elsevier. p. 3-11.
12.Hollenhorst, P.C., L.P. McIntosh, and B.J. Graves, Genomic and Biochemical Insights into the Specificity of ETS Transcription Factors. Annual Review of Biochemistry, 2011. 80(1): p. 437-471.
13.Hsu, C.-M., C.-Y. Chen, and B.-J. Liu, WildSpan: mining structured motifs from protein sequences. Algorithms for Molecular Biology, 2011. 6(1): p. 6.
14.Lis, M. and D. Walther, The orientation of transcription factor binding site motifs in gene promoter regions: does it matter? BMC Genomics, 2016. 17(1): p. 185.
15.Bank, P.D., Protein data bank. Nature New Biol, 1971. 233: p. 223.
16.Pietrokovski, S., Searching Databases of Conserved Sequence Regions by Aligning Protein Multiple-Alignments. Nucleic Acids Research, 1996. 24(19): p. 3836-3845.
17.Wang, T. and G.D. Stormo, Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics, 2003. 19(18): p. 2369-2380.
18.Schones, D.E., P. Sumazin, and M.Q. Zhang, Similarity of position frequency matrices for transcription factor binding sites. Bioinformatics, 2004. 21(3): p. 307-313.
19.Gupta, S., et al., Quantifying similarity between motifs. Genome biology, 2007. 8(2): p. R24.
20.Skolnick, J., J.S. Fetrow, and A. Kolinski, Structural genomics and its importance for gene function analysis. Nature biotechnology, 2000. 18(3): p. 283.
21.Baker, D. and A. Sali, Protein structure prediction and structural genomics. Science, 2001. 294(5540): p. 93-96.
22.Zhang, Y. and J. Skolnick, Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 2004. 57(4): p. 702-710.
23.Levitt, M. and M. Gerstein, A unified statistical framework for sequence comparison and structure comparison. Proceedings of the National Academy of sciences, 1998. 95(11): p. 5913-5920.
24.Gordân, R., et al., Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell reports, 2013. 3(4): p. 1093-1104.
25.R Development Core Team, R., R: A language and environment for statistical computing. 2011, R foundation for statistical computing Vienna, Austria.
26.Xu, J. and Y. Zhang, How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics, 2010. 26(7): p. 889-895.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top