跳到主要內容

臺灣博碩士論文加值系統

(18.205.192.201) 您好!臺灣時間:2021/08/05 02:41
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:李建慶
研究生(外文):Chien-Ching Li
論文名稱:利用GeneOntology與文獻透過模糊相似度分析進行基因的分群
論文名稱(外文):Clustering Genes by Gene Ontology and Literatures with Fuzzy Measure-based Similarity
指導教授:王惠嘉王惠嘉引用關係
指導教授(外文):Hei-Chia Wang
學位類別:碩士
校院名稱:國立成功大學
系所名稱:資訊管理研究所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
畢業學年度:96
語文別:中文
論文頁數:70
中文關鍵詞:模糊相似性分析計算基因本體論基因分群
外文關鍵詞:Fuzzy similarity measureGenes clusteringGene Ontology
相關次數:
  • 被引用被引用:1
  • 點閱點閱:307
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
蛋白質或基因相似性分析在傳統上最常以序列相似性來做分析,但卻未考慮到
他們的功能相似性。利用GO,文獻等做功能相似性分析的研究在近年越來越多,Gene
Ontology(GO)是Gene Ontology Consortium 此組織以註解生物基因與蛋白質的資料
建立了一套具有動態形式的控制字彙,來解釋真核生物的基因在細胞內所扮演的角色
及生醫學方面的知識;而這個組織把這些字彙建立了基因本體論資料庫。
所有真核生物的基因或蛋白質皆可在GO 的系統下轉換成GO 註解(GO Term)
的集合;但是在傳統上,利用GO 分群皆以Information Content 或Edge Counting 等
本體論的技術來對Gene Product 做分群;考慮到GO Term 以何種方式被註解,與其
相關文獻之間的影響,因此本篇論文針對此方向來做研究,以模糊理論中的模糊二元
關聯式(Fuzzy Binary Relation)來找出文獻之間的關係。本研究的目標是希望能以一
個嶄新不同的相似性分析方法,對Gene Products 做相似性分析,讓生物學家能從另
一角度看出Gene Products 之間的關係。近年來由於蛋白質功能公開發表的數量急速
增加,將蛋白質透過分群,可提高生物技術分析結果的正確性。而本篇論文系統研究
架構在於三大模組,依照順序分別是文獻擷取模組、相似性計算模組、以及分群模組。
文獻擷取模組是利用GO 給蛋白質的相對應註解,針對此註解的集合,分別找
出此集合在PubMed 中相對應的相關論文;相似性計算模組則是以Fuzzy Similarity
Measure 的方法計算出文獻與文獻的分數並給予權重,進而算出集合與集合間的相似
性;接著,分群模組則是將相似性計算模組算出的相似性分數矩陣將Gene Product
做凝聚階層式分群後找出每個群組所代表的蛋白質,以利生物學家做後續蛋白質或基
因分析的工作。經由本研究所定義的相似性公式計算,與其他傳統的相似性計算方法
的結果做比較,本研究的結果不僅能找出正確的群集,甚至能看出群集內成員間的相
似性關係,因此本研究在相似性計算能正確的找出已知的群集。
Traditionally, the similarity of genes and proteins often analyse by their sequence,
but never consider their similarity of functions. There are more and more researchs of
functional similarity analysis by using GO、literatures in recent years. Because of more and
more annotated data of genes are generated, a organization, named Gene Ontology
Consortium, built a set of dynamical controlled vocabulary to explain the role of genes or
proteins playing in the cell and the knowledge of biological medicine of Eukaryotes.
From the GO point of view, a gene or a protein can be annotated by three domains,
which are biological process, molecular function, and cellular component. The GO
researchers collect the genes or proteins of different species of Eukaryotes, such as SGD,
MGI, FlyBase, …etc, to annotate and classify all the genes or proteins.
We can say that the genes and proteins of all Eukaryotes can be converted into GO
annotation by GO system. Traditionally, ontology techniques such as Information Content
or Edge Counting are applied to cluster gene products. Recently, the number of sequences
of proteins and genes prompt increase. The objective of our research is to use a different
and new similarity analysis method to consider more concepts about gene functions. We
expect to raise clustering precision through analyzing GO terms and related PubMed
literatures parallelly. From different side of view, we hope this kind of similarity measure
can help biologists find the relation of genes. In this, fuzzy similarity measure is adapted to
calculate the scores of each pair literatures, so we can count out the similarities of each set
and then cluster the gene products to find the represented gene cluster. This research is also
have good evaluation results to compare with Information Content, Edge Count and
Blastclust which is a sequential similarity measure tool of NCBI.
1. 緒論...........................................................................................................................- 1 -
1.1. 研究背景........................................................................................................... - 1 -
1.2. 研究動機與目的............................................................................................... - 4 -
1.3. 研究流程........................................................................................................... - 6 -
1.4. 研究範圍與限制............................................................................................... - 7 -
1.5. 論文架構........................................................................................................... - 7 -
1.6. 小結................................................................................................................... - 9 -
2. 文獻探討.................................................................................................................- 10 -
2.1. NCBI ............................................................................................................... - 10 -
2.1.1. PubMed .......................................................................................................- 10 -
2.1.2. MeSH ..........................................................................................................- 11 -
2.2. GENE ONTOLOGY............................................................................................. - 12 -
2.3. SIMILARITY MEASURES.................................................................................... - 15 -
2.3.1. Pair-based Similarity Measure ....................................................................- 16 -
  Node-based(Information Content)Approach .........................................- 16 -
  Edge-based(Distance)Approach ............................................................- 17 -
  Pairwise similarity with Average Function .................................................- 18 -
2.3.2. Set-based Similarity Measure......................................................................- 19 -
2.3.3. Graph Similarity Techniques.......................................................................- 21 -
2.3.4. Fuzzy Measure-based Similarity.................................................................- 21 -
  Fuzzy Sugeno Measure Similarity ..............................................................- 22 -
  Fuzzy Binary Relations ...............................................................................- 23 -
2.4. CLUSTERING ALGORITHMS............................................................................... - 24 -
v
2.4.1. 圖形式分群演算法.....................................................................................- 24 -
2.4.2. 分割式分群演算法.....................................................................................- 25 -
2.4.3. 階層式分群演算法.....................................................................................- 25 -
2.4.4. 模型式分群演算法.....................................................................................- 28 -
2.4.5. 密度式與網格式分群演算法.....................................................................- 29 -
2.5. 小結................................................................................................................. - 29 -
3. 研究方法.................................................................................................................- 30 -
3.1. 研究架構......................................................................................................... - 31 -
3.2. DOCUMENT RETRIEVING MODULE................................................................... - 32 -
3.2.1. Retrieving GO Terms from Gene Ontology ................................................- 32 -
3.2.2. Retrieving Documents from PubMed .........................................................- 33 -
3.3. FUZZY MEASURE-BASED SIMILARITY .............................................................. - 33 -
3.4. HAC CLUSTERING MEASURE........................................................................... - 36 -
3.5. 小結................................................................................................................. - 37 -
4. 實作與驗證.............................................................................................................- 40 -
4.1. 系統實作設計................................................................................................. - 40 -
4.1.1. 前處理.........................................................................................................- 41 -
4.1.2. Document Retrieving Module.....................................................................- 41 -
4.1.3. Fuzzy Measure-based Similarity Module ...................................................- 42 -
4.1.4. HAC Clustering Measure Module...............................................................- 42 -
4.2. 實驗方法......................................................................................................... - 43 -
4.2.1. 比較對象與資料來源.................................................................................- 43 -
4.2.2. 評估指標.....................................................................................................- 45 -
4.2.3. 實驗方法設計.............................................................................................- 46 -
vi
4.3. 實驗結果與分析............................................................................................. - 47 -
4.3.1. 本研究方法之實驗結果.............................................................................- 47 -
4.3.2. Blastclust 之實驗結果................................................................................- 53 -
4.3.3. Imformation content 以及Edge count 之實驗結果...................................- 56 -
4.3.4. 實驗分析.....................................................................................................- 61 -
5. 結論與未來研究方向.............................................................................................- 63 -
5.1. 研究結果與貢獻............................................................................................. - 63 -
5.2. 未來研究方向................................................................................................. - 64 -
參考文獻.........................................................................................................................- 65 -
Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic
Subspace Clustering of High Dimensional Data for Data Mining Applications.
ACM SIGMOD Record 27(2), 94-105.
Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999). OPTICS: Ordering
Points To Identify the Clustering Structure. Paper presented at the Proc.
ACM SIGMOD 1999 International Conference on Management of Data,
Philadelphia PA.
Aoki, K. F., Yamaguchi, A., Okuno, Y., Akutsu, T., Ueda, N., Kanehisa, M., et al.
(2003). Efficient Tree-Matching Methods for Accurate Carbohydrate
Database Queries. Genome Informatics(14), 134-143.
Berkhin, P. (2002). Survey Of Clustering Data Mining Techniques (Technical
report). CA: Accrue Software.
Bezdek, J. C. (1981). Pattern Recognition With Fuzzy Objective Function
Algorithms. New York: Plenum.
Cao, S. L., Qin, L., He, W. Z., Zhong, Y., Zhu, Y. Y., & Li, Y. X. (2004). Semantic
Search among Heterogeneous Biological Databases Based on Gene
Ontology. Acta Biochimica et Biophysica Sinica, 36(5), 365-370.
Chen, C. Y., Oyang, Y. J., & Juan, H. F. (2004). Incremental generation of
summarized clustering hierarchy for protein family analysis. Bioinformatics,
20(16), 2586-2596.
Chen, Y., Reilly, K. D., Sprague, A. P., & Guan, Z. (2006). SEQOPTICS: a protein
sequence clustering system. BMC Bioinformatics, 7(4), 1-9.
Cheng, J., Cline, M., Martin, J., Finkelstein, D., Awad, T., Kulp, D., et al. (2004). A
Knowledge-Based Clustering Algorithm Driven by Gene Ontology. Journal
of Biopharmaceutical Statistics, 14(3), 687-700.
- 66 -
Delfs, R., Doms, A., Kozlenkov, A., & Schroeder, M. (2004). GOPubMed:
ontology-based literature search applied to Gene Ontology and PubMed.
Paper presented at the Proc. German Bioinformatics Conference.
Dembele, D., & Kastner, P. (2003). Fuzzy C-means method for clustering
microarray data. Bioinformatics, 19(8), 973-980.
Du, P., Gong, J., Wurtele, E. S., & Dickerson, J. A. (2005). Modeling Gene
Expression Networks using Fuzzy Logic. IEEE Transactions on Systems,
Man, and Cybernetics, Part B: Cybernetics, 35(6), 1351- 1359.
Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise. Paper
presented at the Proceedings of 2nd International Conference on
Knowledge Discovery and Data Mining, Porland, Organ.
Ganesan, P., Molina, H. G., & Widom, J. (2003). Exploiting Hierarchical Domain
Structure to Compute Similarity. ACM Transactions on Information Systems,
21(1), 64-93.
Grabisch, M. (2000). Fuzzy Measures and Integrals: Theory and Applications.
Jenssen, T.-K., Lægreid, A., Komorowski, J., & Hovig, E. (2001). A literature
network of human genes for high-throughput analysis of gene expression.
Nature Genetics, 28, 21-28.
Jiang, J. J., & Conrath, D. W. (1997). Semantic Similarity Based on Corpus
Statistics and Lexical Taxonomy. Paper presented at the In Proceedings of
International Conference Research on Computational Linguistics, Taiwan.
Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A.
Y. (2002). An Efficient k-Means Clustering Algorithm: Analysis and
Implementation. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 24(7), 881-892.
Klir, G. J., & Yuan, B. (2005). Fuzzy Sets and Fuzzy Logics: Theory and
Applications (4 ed.): Pearson Education Taiwan Ltd.
- 67 -
Kosala, R., & Blockeel, H. (2000). Web Mining Research: A Survey. Paper
presented at the ACM SIGKDD Explorations Newsletter.
Lee, M., Wang, W., & Yu, H. (2006). Exploring supervised and unsupervised
methods to detect topics in biomedical text. BMC Bioinformatics, 7(140).
Lei, Z., & Dai, Y. (2006). Assessing protein similarity with Gene Ontology and its
use in subnuclear localization prediction. BMC Bioinformatics, 7(491).
Lord, P. W., Stevens, R. D., Brass, A., & Goble, C. A. (2003). Investigating
Semantic Similarity Measures across the Gene Ontology: The Relation
between Sequence and Annotation. Bioinformatics, 19(10), 1275-1283.
MacQueen, J. B. (1967). Some methods for classification and analysis of
multivariate observations. Paper presented at the Proceedings of the Fifth
Berkeley Symposium on Mathematical Statistics and Probability, University
of California, Berkeley, United States.
Manning, C. D., & Schutze, H. (2001). Foundations of Statistical Natural Language
Processing: MIT Press.
Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering
technique. Pattern Recognition, 33, 1455-1465.
Mirkin, B. (2005). Clustering for data mining : a data recovery appraoch.
Myers, G. (1999). Whole-Genome DNA-Sequencing. Computing in Science and
Engineering, 1(3), 33-43.
Ontrup, J., Nattkemper, T. W., Gerstung, O., & Ritter, H. (2003). A MeSH Term
based Distance Measure for Document Retrieval and Labeling Assistance.
Paper presented at the Proceedings of the 25'th Annual International
Conference of the IEEE EMBS, Cancun, Mexico.
Perez-Iratxeta, C., Keer, H. S., Bork, P., & Andrade, M. A. (2002). Computing fuzzy
associations for the analysis of biological literature. BioTechniques, 32(6),
1380-1385.
- 68 -
Popescu, M., Keller, J. M., & Mitchell, J. A. (2006). Fuzzy Measures on the Gene
Ontology for Gene Product Similarity. IEEE/ACM Transactions on
Computational Biology and Bioinformatics, 3(3), 263-274.
Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). Development and Application
of a Metric on Semantic Nets. IEEE Transactions on systems, man, and
cybernetics, 19(1), 17-30.
Raychaudhuri, S., Chang, J. T., Imam, F., & Altman, R. B. (2003). The
computational analysis of scientific literature to define and recognize gene
expression clusters. Nucleic Acids Research, 31(15), 4553-4560.
Resnik, P. (1995). Using Information Content to Evaluate Semantic Similarity in A
Taxonomy. Paper presented at the Proceedings of the 14th International
Joint Conference on Artificial Intelligence.
Resnik, P. (1999). Semantic Similarity in a Taxonomy: An Information-Based
Measure and its Application to Problems of Ambiguity in Natural Language.
Journal of Artificial Intelligence Research, 11, 95-130.
Richardson, R., & Smeaton, A. F. (1995). Using wordnet in a knowledge-based
approach to information retrieval. Ireland: School of Computer Applications,
Dublin City University
Shatkay, H., Edwards, S., & Boguski, M. (2002). Information Retrieval meets Gene
Analysis. IEEE Intelligent Systems, 17(2), 45-53.
Shatkay, H., & Feldman, R. (2003). Mining the Biomedical Literature in the
Genomic Era: An Overview. Journal of Computational Biology, 10(6),
821-855.
Speer, N., Spieth, C., & Zell, A. (2004). A Memetic Clustering Algorithm for the
Functional Partition of Genes Based on the Gene Ontology. Paper
presented at the Proceedings of IEEE Symp. Computational Intelligence in
Bioinformatics and Computational Biology.
Tao, Y. C., & Leibel, R. L. (2002). Identifying functional relationships among human
genes by systematic analysis of biological literature. BMC Bioinformatics,
3(16), 1-9.
- 69 -
Torsello, A., Hidovic, D., & Pelillo, M. (2004). Four Metrics for Efficiently Comparing
Attributed Trees. Paper presented at the Proceedings of the 17th
International Conference on Pattern Recognition.
Venter, J. C., Adams, M. D., Myers, E. W., & Li, P. W. (2001). The Sequence of the
Human Genome. Science, 291(5507), 1304 - 1351.
Vesanto, J., & Alhoniemi, E. (2000). Clustering of the Self-Organizing Map. IEEE
Transactions on Neural Networks, 11(3), 586-600.
Vinterbo, S. A., Kim, E. Y., & Machado, L. O. (2005). Small, fuzzy and interpretable
gene expression based classifiers. Bioinformatics, 21(9), 1964-1970.
Wang, J. Z., Du, Z., Payattakool, R., Yu, P. S., & Chen, C.-F. (2007). A new method
to measure the semantic similarity of GO terms. Bioinformatics, 23(10),
1274-1281.
Wang, W., Yang, J., & Muntz, R. (1997). STING : A Statistical Information Grid
Approach to Spatial Data Mining. Paper presented at the Twenty-Third
International Conference on Very Large Data Bases, Athens, Greece.
Xu, R., & Wunsch, D., II. (2005). Survey of clustering algorithms. IEEE
Transactions on Neural Networks, 16(3), 645-678.
Zhong, J., Zhu, H., Li, J., & Yu, Y. (2002 ). Conceptual Graph Matching for
Semantic Search. Paper presented at the Proceedings of the 10th
International Conference on Conceptual Structures: Integration and
Interfaces.
Zhong, W., Altun, G., Harrison, R., Tai, P. C., & Pan, Y. (2005). Improved K-means
clustering algorithm for exploring local protein sequence motifs representing
common structural property IEEE Transactions on Nanobioscience, 4(3),
255-265.
網站資料
基因本體論網站:http://www.geneontology.org/
- 70 -
The universal protein resource:http://www.pir.uniprot.org/
PubMed:http://www.pubmed.gov/
Subnuclear Compartments Prediction System:http://array.bioengr.uic.edu/subnuclear.htm
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top