跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.109) 您好!臺灣時間:2026/04/20 19:11
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:張喜媄
研究生(外文):Hsi-mei Chang
論文名稱:用於循序資料分群之混合式尋找特徵方法
論文名稱(外文):Hybrid Algorithms of Finding Features for Clustering Sequential Data
指導教授:張玉盈
指導教授(外文):Ye-In Chang
學位類別:碩士
校院名稱:國立中山大學
系所名稱:資訊工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2010
畢業學年度:98
語文別:英文
論文頁數:92
中文關鍵詞:蛋白質資料庫特徵循序資料分群
外文關鍵詞:ClusteringProtein databasesSequential dataFeatures
相關次數:
  • 被引用被引用:0
  • 點閱點閱:175
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
蛋白質是生物細胞與組織的結構元素,它對於生命有機體是個重要的
建構元素。重覆片段是指在蛋白質序列中出現的頻率夠大。這樣的片
段可以定義出蛋白質中重要的功能區域,分辨此蛋白質的家族,及找
出此蛋白質的功能。此外,它也提供了一些關物種進化有價值的資
訊。蛋白質序列分群把有著類似結構的分在同一群,如此一來便有助
於研究蛋白質序列的相似功能。有許多已經被提出的演算法都是根據
蛋白質序列的相似性去作蛋白質的分群,即根據在蛋白質資料庫中序
列的重複片段去作蛋白質的分群,例如,Feature-based clustering
演算法的全域方法和局部方法。他們利用 mining sequential
patterns 的演算法找出序列重複片段來解決在蛋白質序列資料庫中
沒有 gap 限制的序列重複片段問題,然後分別去找出全域特徵和局部
特徵來作分群。Feature-based clustering 演算法是一個截然不同
的蛋白質分群方法,不需要作所有對於所有的分析,並且使用趨近於
線性複雜度的 K - means 分群演算法。雖然 Feature-based
clustering 演算法具有可擴展性,並且可以得到相當不錯的分群結
果,但是它分開執行全域方法和局部方法會耗費較多的時間。因此,
在這論文中,我們提出了 Hybrid 演算法來找出並標記為特徵來改進
Feature-based clustering 演算法。我們從局部特徵與封閉頻繁序
列重複片段之間的關係,觀察到一個有趣的結果。我們的重要發現
是,在封閉頻繁序列重複片段中的某些特徵可以拆成幾個局部的特
徵,而且這幾個局部特徵出現在資料庫中的總數量會等於在封閉頻繁
序列重複片段中的所對應的特徵出現在資料庫中的數量。全域方法和
局部方法在找出序列重覆片段之後都有兩個階段,尋找特徵和標記特
徵。在我們的 Hybrid 演算法中的方法 1(LocalG),我們首先找到局
部的特徵並且作標記。然後,我們找到了全域的特徵。最後,我們從
已經標記好的局部特徵向量有效率地去標記出全域特徵的向量。在我
們的 Hybrid 演算法的方法 2(CLoseLG),我們首先直接找出封閉頻
繁序列的重覆片段。接下來,我們有效率地從封閉頻繁序列的重覆片
段找到局部的候選特徵,然後標記出局部特徵。最後,我們找出全域
特徵並且作標記。根據模擬的結果,我們證明了我們提出的 Hybrid
演算法 比 Feature-based clustering 演算法更有效率。
Proteins are
the structural components of living cells and tissues, and thus an
important building block in all living organisms. Patterns in
proteins sequences are some subsequences which appear frequently.
Patterns often denote important functional regions in proteins and
can be used to characterize a protein family or discover the
function of proteins. Moreover, it provides valuable information
about the evolution of species. Grouping protein sequences that
share similar structure helps in identifying sequences with similar
functionality. Many algorithms have been proposed for clustering
proteins according to their similarity, i.e., sequential
patterns in protein databases, for example, feature-based clustering
algorithms of the global approach and the local approach. They use
the algorithm of mining sequential patterns to solve the
no-gap-limit sequential pattern problem in a protein sequences
database, and then find global features and local features
separately for clustering. Feature-based clustering algorithms are
entirely different approaches to protein clustering that do not
require an all-against-all analysis and use a near-linear
complexity K-means based clustering algorithm. Although
feature-based clustering algorithms are scalable and lead to
reasonably good clusters, they consume time on performing the global
approach and the local approach separately. Therefore, in this
thesis, we propose hybrid algorithms to find and mark features for
feature-based clustering algorithms. We observe an interesting
result from the relation between the local features and the closed
frequent sequential patterns. The important observation which we
find is that some features in the closed frequent sequential
patterns can be taken apart to several features in the local
selected features and the total support number of these features in
the local selected features is equal to the support number of the
corresponding feature in the closed frequent sequential patterns.
There are two phases, find-feature and mark-feature, in the global
approach and the local approach after mining sequential patterns. In
our hybrid algorithms of Method 1 (LocalG), we first find and mark
the local features. Then, we find the global features. Finally, we
mark the bit vectors of the global features efficiently from the bit
vector of the local features. In our hybrid algorithms of Method 2
(CLoseLG), we first find the closed frequent sequential patterns
directly. Next, we find local candidate features efficiently from
the closed frequent sequential patterns and then mark the local
features. Finally, we find and mark the global features. From our
performance study based on the biological data and the synthetic
data, we show that our proposed hybrid algorithms are more efficient
than the feature-based algorithm.
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Patterns in Protein Sequences . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Definitions of Protein Sequence Patterns . . . . . . . . . . . . 2
1.3 The K-means Clustering Algorithm . . . . . . . . . . . . . . . . . . . 8
1.4 The Problem of Protein Clustering . . . . . . . . . . . . . . . . . . . 9
1.5 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 16
2. A Survey of Algorithms for Protein Clustering . . . . . . . . . . . 17
2.1 Mining Sequential Patterns . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 The PrefixSpan Algorithm . . . . . . . . . . . . . . . . . . . 19
2.2 The Vector-Space K-Means Clustering Algorithm . . . . . . . . . . . 21
2.3 Feature-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 CloSpan: Closed Sequential Patterns Mining . . . . . . . . . . . . . . 28
2.5 The CONTOUR Algorithm . . . . . . . . . . . . . . . . . . . . . . . 29
3. Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Method 1 (LocalG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Method 2 (CLoseLG) . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Page
3.4 Counterexamples for CONTOUR . . . . . . . . . . . . . . . . . . . . 45
4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 The Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Biological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
[1] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. of the 11th
Int. Conf. on Data Eng. , pp. 3 – 14, 1995.
[2] S. Altschul, T. Madden, A. Schafer, J. Zhang, Z. Zhang, W. Miller, and D. Lip-
man, “Gapped Blast and Psi-Blast: A New Generation of Protein Database
Search Program,” Proc. of the Nucleic Acids Research, pp. 3389 – 3402, 1997.
[3] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu, “Sequential Pattern Mining Us-
ing a Bitmap Representation,” Proc. of the 8th ACM SIGKDD Int. Conf. on
Knowledge Discovery and Data Mining, pp. 429–435, 2002.
[4] M. Crochemore and M. Sagot, Motifs in Sequences: Localization and Extraction.
Marcel Dekker, first ed., 2001.
[5] P. G. Ferreira and P. J. Azevedo, “Query Driven Sequence Pattern Mining,”
Proc. of the XXI Simpsio Brasileiro de Banco de Dados(SBBD), 2006.
[6] P. G. Ferreira and P. J. Azevedo, “Evaluating Deterministic Motif Significance
Measures in Protein Databases,” Algorithms fo Molecular Biology, Vol. 2, No. 16,
Dec. 2007.
[7] V. Guralnik and G. Karypis, “A Scalable Algorithm for Clustering Sequential
Data,” Proc. of IEEE Int. Conf. on Data Mining, pp. 179–186, 2001.
[8] J. Han, J. Pei, and Y. Yin, “CURE: An Efficient Clustering Algorithm for Large
Databases,” Information Systems, Vol. 26, No. 1, pp. 35–58, March 2001.
[9] J. Han, J. Pei, B. M. Asl, Q. Chen, U. Dayal, and M. C. Hsu, “FreeSpan:
Frequent Pattern-Projected Sequential Pattern Mining,” Proc. of the 6th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 355–359,
2000.
[10] I. Jonassen, “http://www.ii.uib.no/ inge/patterns.html,” Patterns in biose-
quences.
[11] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An Introduction
to Cluster Anslysis,” Wiley Series in Probability and Mathematical Statistics
Applied Probability and Statistics, New York: Wiley, 1990.
[12] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An Introduction
to Cluster Anslysis,” Wiley Series in Probability and Mathematical Statistics
Applied Probability and Statistics, New York: Wiley, 1990.
[13] C. Lee, “Computational Biology,”
http://www.csie.ncnu.edu.tw/ rctlee/biology.html.
[14] M. Lin and S. Lee, “Incremental Update on Sequential Patterns in Large
Database,” Proc. of the 10th IEEE Int. Conf. Tools with Artificial Intelligence,
pp. 24 – 31, 1998.
[15] G. K. M.J. Joshi and V. Kumar, “Universal Formulation of Sequential Pat-
terns,” Technical report, University of Minnesota, Department of Computer Sci-
ence Minneapolis, 1999.
[16] M.J.Zaki, “Efficient Enumeration of Frequent Sequences,” Proc. of the 7th
Int. Conf. on Information and Knowledge Management, pp. 68 – 75, 1998.
[17] J. Pei, J. Han, M. A. Behzad, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu,
“PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern
Growth,” Int. Conf. on Data Engineering, pp. 215–224, 2001.
[18] R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and
Performance Improvements,” Proc. of the 5th Int. Conf. on Extending Database
Technology, pp. 3 – 17, 1996.
[19] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clus-
tering Techniques,” Proc. of the Knowledge Discovery and Data Mining, pp. 1 –
2, 2000.
[20] J. Wang and J. Han, “BIDE: Efficient Mining of Frequent Closed Sequences,”
Proc. of the 20th Int. Conf. on Data Engineering, pp. 79–90, 2004.
[21] J. Wang, Y. Zhang, L. Zhou, G. Karypis, and C. C. Aggarwal, “Discriminating
Subsequence Discovery for Sequence Clustering,” Proc. of the Sequential Data
Mining, pp. 605–610, 2007.
[22] J. Wang, Y. Zhang, L. Zhou, G. Karypis, and C. C. Aggarwal, “CONTOUR: an
Efficient Algorithm for Discovering Discriminating Subsequences,” Proc. of the
Data Mining and Knowledge Discovery, pp. 1 – 29, 2009.
[23] Wikipedia, “http://en.wikipedia.org/wiki/Protein,” .
[24] X. Yan, J. Han, and R. Afshar, “CloSpan: Mining Closed Sequential Patterns
in Large Datasets,” Proc. of the Sequential Data Mining, pp. 166–177, 2003.
[25] J. Yang, J. S. Deogun, and Z. Sun, “A New Scheme for Protein Sequence Motif
Extraction,” Proc. of the 38th Hawii Int. Conf. on System Sciences, pp. 280a –
280a, 2005.
[26] M. Zaki and C. Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset
Mining,” Proc. of the SIAM Int. Conf. on Data Mining, pp. 457 – 473, 2002.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文