臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.109) 您好！臺灣時間：2026/04/20 19:11

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
紙本論文
QR Code

本論文永久網址:

研究生:

張喜媄

研究生(外文):

Hsi-mei Chang

論文名稱:

用於循序資料分群之混合式尋找特徵方法

論文名稱(外文):

Hybrid Algorithms of Finding Features for Clustering Sequential Data

指導教授:

張玉盈

指導教授(外文):

Ye-In Chang

學位類別:

碩士

校院名稱:

國立中山大學

系所名稱:

資訊工程學系研究所

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2010

畢業學年度:

語文別:

英文

論文頁數:

中文關鍵詞:

蛋白質資料庫、特徵、循序資料、分群

外文關鍵詞:

Clustering、Protein databases、Sequential data、Features

相關次數:

被引用:0
點閱:175
評分:
下載:0
書目收藏:0

蛋白質是生物細胞與組織的結構元素，它對於生命有機體是個重要的
建構元素。重覆片段是指在蛋白質序列中出現的頻率夠大。這樣的片
段可以定義出蛋白質中重要的功能區域，分辨此蛋白質的家族，及找
出此蛋白質的功能。此外，它也提供了一些關物種進化有價值的資
訊。蛋白質序列分群把有著類似結構的分在同一群，如此一來便有助
於研究蛋白質序列的相似功能。有許多已經被提出的演算法都是根據
蛋白質序列的相似性去作蛋白質的分群，即根據在蛋白質資料庫中序
列的重複片段去作蛋白質的分群，例如，Feature-based clustering
演算法的全域方法和局部方法。他們利用 mining sequential
patterns 的演算法找出序列重複片段來解決在蛋白質序列資料庫中
沒有 gap 限制的序列重複片段問題，然後分別去找出全域特徵和局部
特徵來作分群。Feature-based clustering 演算法是一個截然不同
的蛋白質分群方法，不需要作所有對於所有的分析，並且使用趨近於
線性複雜度的 K - means 分群演算法。雖然 Feature-based
clustering 演算法具有可擴展性，並且可以得到相當不錯的分群結
果，但是它分開執行全域方法和局部方法會耗費較多的時間。因此，
在這論文中，我們提出了 Hybrid 演算法來找出並標記為特徵來改進
Feature-based clustering 演算法。我們從局部特徵與封閉頻繁序
列重複片段之間的關係，觀察到一個有趣的結果。我們的重要發現
是，在封閉頻繁序列重複片段中的某些特徵可以拆成幾個局部的特
徵，而且這幾個局部特徵出現在資料庫中的總數量會等於在封閉頻繁
序列重複片段中的所對應的特徵出現在資料庫中的數量。全域方法和
局部方法在找出序列重覆片段之後都有兩個階段，尋找特徵和標記特
徵。在我們的 Hybrid 演算法中的方法 1（LocalG），我們首先找到局
部的特徵並且作標記。然後，我們找到了全域的特徵。最後，我們從
已經標記好的局部特徵向量有效率地去標記出全域特徵的向量。在我
們的 Hybrid 演算法的方法 2（CLoseLG），我們首先直接找出封閉頻
繁序列的重覆片段。接下來，我們有效率地從封閉頻繁序列的重覆片
段找到局部的候選特徵，然後標記出局部特徵。最後，我們找出全域
特徵並且作標記。根據模擬的結果，我們證明了我們提出的 Hybrid
演算法比 Feature-based clustering 演算法更有效率。

Proteins are
the structural components of living cells and tissues, and thus an
important building block in all living organisms. Patterns in
proteins sequences are some subsequences which appear frequently.
Patterns often denote important functional regions in proteins and
can be used to characterize a protein family or discover the
function of proteins. Moreover, it provides valuable information
about the evolution of species. Grouping protein sequences that
share similar structure helps in identifying sequences with similar
functionality. Many algorithms have been proposed for clustering
proteins according to their similarity, i.e., sequential
patterns in protein databases, for example, feature-based clustering
algorithms of the global approach and the local approach. They use
the algorithm of mining sequential patterns to solve the
no-gap-limit sequential pattern problem in a protein sequences
database, and then find global features and local features
separately for clustering. Feature-based clustering algorithms are
entirely different approaches to protein clustering that do not
require an all-against-all analysis and use a near-linear
complexity K-means based clustering algorithm. Although
feature-based clustering algorithms are scalable and lead to
reasonably good clusters, they consume time on performing the global
approach and the local approach separately. Therefore, in this
thesis, we propose hybrid algorithms to find and mark features for
feature-based clustering algorithms. We observe an interesting
result from the relation between the local features and the closed
frequent sequential patterns. The important observation which we
find is that some features in the closed frequent sequential
patterns can be taken apart to several features in the local
selected features and the total support number of these features in
the local selected features is equal to the support number of the
corresponding feature in the closed frequent sequential patterns.
There are two phases, find-feature and mark-feature, in the global
approach and the local approach after mining sequential patterns. In
our hybrid algorithms of Method 1 (LocalG), we first find and mark
the local features. Then, we find the global features. Finally, we
mark the bit vectors of the global features efficiently from the bit
vector of the local features. In our hybrid algorithms of Method 2
(CLoseLG), we first find the closed frequent sequential patterns
directly. Next, we find local candidate features efficiently from
the closed frequent sequential patterns and then mark the local
features. Finally, we find and mark the global features. From our
performance study based on the biological data and the synthetic
data, we show that our proposed hybrid algorithms are more efficient
than the feature-based algorithm.

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Patterns in Protein Sequences . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Definitions of Protein Sequence Patterns . . . . . . . . . . . . 2
1.3 The K-means Clustering Algorithm . . . . . . . . . . . . . . . . . . . 8
1.4 The Problem of Protein Clustering . . . . . . . . . . . . . . . . . . . 9
1.5 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 16
2. A Survey of Algorithms for Protein Clustering . . . . . . . . . . . 17
2.1 Mining Sequential Patterns . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 The PrefixSpan Algorithm . . . . . . . . . . . . . . . . . . . 19
2.2 The Vector-Space K-Means Clustering Algorithm . . . . . . . . . . . 21
2.3 Feature-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 CloSpan: Closed Sequential Patterns Mining . . . . . . . . . . . . . . 28
2.5 The CONTOUR Algorithm . . . . . . . . . . . . . . . . . . . . . . . 29
3. Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Method 1 (LocalG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Method 2 (CLoseLG) . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Page
3.4 Counterexamples for CONTOUR . . . . . . . . . . . . . . . . . . . . 45
4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 The Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Biological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

[1] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. of the 11th
Int. Conf. on Data Eng. , pp. 3 – 14, 1995.
[2] S. Altschul, T. Madden, A. Schafer, J. Zhang, Z. Zhang, W. Miller, and D. Lip-
man, “Gapped Blast and Psi-Blast: A New Generation of Protein Database
Search Program,” Proc. of the Nucleic Acids Research, pp. 3389 – 3402, 1997.
[3] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu, “Sequential Pattern Mining Us-
ing a Bitmap Representation,” Proc. of the 8th ACM SIGKDD Int. Conf. on
Knowledge Discovery and Data Mining, pp. 429–435, 2002.
[4] M. Crochemore and M. Sagot, Motifs in Sequences: Localization and Extraction.
Marcel Dekker, first ed., 2001.
[5] P. G. Ferreira and P. J. Azevedo, “Query Driven Sequence Pattern Mining,”
Proc. of the XXI Simpsio Brasileiro de Banco de Dados(SBBD), 2006.
[6] P. G. Ferreira and P. J. Azevedo, “Evaluating Deterministic Motif Significance
Measures in Protein Databases,” Algorithms fo Molecular Biology, Vol. 2, No. 16,
Dec. 2007.
[7] V. Guralnik and G. Karypis, “A Scalable Algorithm for Clustering Sequential
Data,” Proc. of IEEE Int. Conf. on Data Mining, pp. 179–186, 2001.
[8] J. Han, J. Pei, and Y. Yin, “CURE: An Efficient Clustering Algorithm for Large
Databases,” Information Systems, Vol. 26, No. 1, pp. 35–58, March 2001.
[9] J. Han, J. Pei, B. M. Asl, Q. Chen, U. Dayal, and M. C. Hsu, “FreeSpan:
Frequent Pattern-Projected Sequential Pattern Mining,” Proc. of the 6th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 355–359,
2000.
[10] I. Jonassen, “http://www.ii.uib.no/ inge/patterns.html,” Patterns in biose-
quences.
[11] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An Introduction
to Cluster Anslysis,” Wiley Series in Probability and Mathematical Statistics
Applied Probability and Statistics, New York: Wiley, 1990.
[12] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An Introduction
to Cluster Anslysis,” Wiley Series in Probability and Mathematical Statistics
Applied Probability and Statistics, New York: Wiley, 1990.
[13] C. Lee, “Computational Biology,”
http://www.csie.ncnu.edu.tw/ rctlee/biology.html.
[14] M. Lin and S. Lee, “Incremental Update on Sequential Patterns in Large
Database,” Proc. of the 10th IEEE Int. Conf. Tools with Artificial Intelligence,
pp. 24 – 31, 1998.
[15] G. K. M.J. Joshi and V. Kumar, “Universal Formulation of Sequential Pat-
terns,” Technical report, University of Minnesota, Department of Computer Sci-
ence Minneapolis, 1999.
[16] M.J.Zaki, “Efficient Enumeration of Frequent Sequences,” Proc. of the 7th
Int. Conf. on Information and Knowledge Management, pp. 68 – 75, 1998.
[17] J. Pei, J. Han, M. A. Behzad, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu,
“PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern
Growth,” Int. Conf. on Data Engineering, pp. 215–224, 2001.
[18] R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and
Performance Improvements,” Proc. of the 5th Int. Conf. on Extending Database
Technology, pp. 3 – 17, 1996.
[19] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clus-
tering Techniques,” Proc. of the Knowledge Discovery and Data Mining, pp. 1 –
2, 2000.
[20] J. Wang and J. Han, “BIDE: Efficient Mining of Frequent Closed Sequences,”
Proc. of the 20th Int. Conf. on Data Engineering, pp. 79–90, 2004.
[21] J. Wang, Y. Zhang, L. Zhou, G. Karypis, and C. C. Aggarwal, “Discriminating
Subsequence Discovery for Sequence Clustering,” Proc. of the Sequential Data
Mining, pp. 605–610, 2007.
[22] J. Wang, Y. Zhang, L. Zhou, G. Karypis, and C. C. Aggarwal, “CONTOUR: an
Efficient Algorithm for Discovering Discriminating Subsequences,” Proc. of the
Data Mining and Knowledge Discovery, pp. 1 – 29, 2009.
[23] Wikipedia, “http://en.wikipedia.org/wiki/Protein,” .
[24] X. Yan, J. Han, and R. Afshar, “CloSpan: Mining Closed Sequential Patterns
in Large Datasets,” Proc. of the Sequential Data Mining, pp. 166–177, 2003.
[25] J. Yang, J. S. Deogun, and Z. Sun, “A New Scheme for Protein Sequence Motif
Extraction,” Proc. of the 38th Hawii Int. Conf. on System Sciences, pp. 280a –
280a, 2005.
[26] M. Zaki and C. Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset
Mining,” Proc. of the SIAM Int. Conf. on Data Mining, pp. 457 – 473, 2002.

國圖紙本論文

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

無相關論文

1.	蘇麗玉。〈布瑞頓及其聖誕頌歌合唱曲之研究〉。《藝術學報》54期(6月份，1994)。頁119-131。
2.	潘宇文。〈歌聲培訓的基本常識與暖身活動〉。《國教天地》148期(6月份，2002)。頁86-93。
3.	潘宇文。〈合唱隊形配置之解析〉。《國教天地》140期(7月份，2000)。頁61-69。
4.	吳博明。〈國小器樂合奏樂曲改編之我見〉。《國民教育》22期(4月份，1979)。28-36頁。

1.	利用Mathematica計算分佈函數與特徵函數
2.	非同步多跳隨意無線網路中最佳完全動態調適之電源管理協定
3.	從系統動力學觀點審視「投資現金流量」與「營運現金流量」以探討台灣DRAM產業之投資策略
4.	結構化與物件導向系統分析與設計之比較：塑模工具與產出
5.	台灣印度洋熱帶鮪類遠洋漁船動態研究
6.	探討SMAD4基因訊息網絡抑制胰臟癌轉移與抗藥性能力
7.	金奈米粒子偵測同半胱胺酸與研究S–腺苷高半胱胺酸水解酶活性
8.	摻鉻釔鋁石榴石雙纖衣晶體光纖主動元件之近場模態分析
9.	以系統動力學方法探討藥價政策對醫療機構績效之研究---以某地區教學醫院為例
10.	淨水程序與配水管網中生物可分解有機質變化與預測之研究
11.	以活性碳生物程序去除水處理系統中生物可利用有機碳及消毒副產物潛勢之研究
12.	Sn/3.0Ag/0.5Cu錫球潛變與推力實驗之研究
13.	IEEE802.15.4協定堆疊實作與硬體ASIC設計以及應用於生醫監控之生理訊號轉送器
14.	利用位移鎖相迴路技術之寬頻頻率合成器設計與實現
15.	功率放大器之記憶效應分析與功率結合設計

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室