跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.82) 您好!臺灣時間:2025/02/07 02:57
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:朱家漢
研究生(外文):Jia-Han Chu
論文名稱:具可變長度及容忍度特徵之字串搜尋比對演算法
論文名稱(外文):Efficient Pattern Matching Algorithms for Variable-Length and Tolerant Strings
指導教授:白敦文
指導教授(外文):Tun-Wen Pai
學位類別:碩士
校院名稱:國立臺灣海洋大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2004
畢業學年度:92
語文別:中文
論文頁數:74
中文關鍵詞:雜湊編碼階梯式步進法階梯式區間跳躍法投票演算法
外文關鍵詞:Hashing EncodingSteppingInterval JumpingVoting Algorithm
相關次數:
  • 被引用被引用:0
  • 點閱點閱:247
  • 評分評分:
  • 下載下載:23
  • 收藏至我的研究室書目清單書目收藏:0
本論文提出一個可以供多重核甘酸與蛋白質序列進行共同區段之比對演算法,應用數值編碼之唯一性及階梯步進式或區間跳躍式比對之法則,增進共同區段之搜尋速度。演算法共分為三個步驟:編碼、排序與搜尋,編碼階段負責將鹼基或胺基酸轉換至數值空間集合,排序階段採快速排序演算法完成排序任務,而搜尋階段則依數列大小進行步進式比對或依比對樣本長度進行數值區間最佳之均勻切割或位元切割,以增加區間跳躍之機率並提升比對之速度。實驗結果證明本演算法可以將傳統比對所需之時間複雜度由O(mLi(Li-m+1)+mLj(Lj-m+1))降低至O(|Ii|+|Ij|)。爲了能夠提供核甘酸序列兩兩比對的正確次序,本論文運用不同之統計方式針對RNase家族、靈長與非靈長類家族、Yeast GAL promoter 與Yeast MATalpha2 promoter資料分別進行分析,並且可以預測出擁有較少共同子序列統計量的兩條核甘酸序列,除此之外,應用投票演算法機制提供出現率特徵與開放式基座容忍度設定之共同子字串搜尋功能。期望經由結合這些演算法能夠提供具實用價值並且高效率的查詢方法,從多重核甘酸序列中或蛋白質家族序列快速地搜尋出具有依特徵或具有容忍度的共同子序列。
關鍵字:雜湊編碼(Hash coding)、階梯式步進法(Ladderlike Stepping)、階梯式區間跳躍法(Ladderlike Interval Jumping)、投票演算法(Voting Algorithm)。
In this thesis, a novel algorithm for searching common segments in multiple DNA or animo acid sequences is designed. To improve efficiency in pattern searching, combination of hashing encoding, quick sorting and ladderlike stepping and/or interval jumping techniques are applied. Since multiple sequence alignment of DNA or animo acid sequences from the giant genomic database is usually time consuming, we develop a three-phase methodology to search common sub-strings and reduce the time complexity in the comparison. In the first coding phase, DNA or animo acid sequences are transformed into a numerical space set. Subsequently, the quick sort algorithms are employed in the second sorting stage to reorder the encoded data. In the last searching phase, ladderlike stepping and interval jumping rules are proposed to increase efficiencies of numerical comparison. In addition, uniform and bitwise interval segmentation techniques are applied prior to interval jumping procedures. The segmenting methodologies are designed according to the length of searching pattern, and the proposed ladderlike searching algorithms provide improved performance. Experimental results show that the algorithms are capable of reducing time complexity from O(mLi(Li-m+1)+mLj(Lj-m+1)) to O(|Ii|+|Ij|). In order to provide the right pair comparing order for multiple DNA or animo acid sequences, several strategies based on statistics are designed for several classified sequences. In addition to the analysis of comparing order, based on voting mechanism, the representation and tolerant features of sub-string among sequences are also taken into consideration in this thesis.
Keywords: Stepping, Interval Jumping, Hashing Encoding, Voting Algorithm
論文口試委員‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ I
中文摘要‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ II
英文摘要‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ III
目錄‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ IV
圖目錄‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ V
表目錄‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ VII
方程式目錄‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ VIII

壹、背景與動機‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 1
貳、系統架構‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 4
參、符號定義‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 9
肆、演算法說明‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 10
一、階梯式步階比對演算法‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 10
(一)編碼階段‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 10
(二)排序階段‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 11
(三)搜尋階段‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 11
二、階梯式區間跳躍比對演算法‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 12
三、延伸式多重序列比對演算法‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 21
(一)建立共同子字串索引矩陣‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 21
(二)建立具連續性標籤或標記之群組‧‧‧‧‧‧‧‧‧‧‧‧‧ 21
(三)選擇候選群組並找出所有可能之標記組合‧‧‧‧‧‧‧‧‧ 23
(四)找尋共同最小群組之組合‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 24
四、具出現率特徵之共同子字串搜尋演算法‧‧‧‧‧‧‧‧‧‧‧‧ 24
五、具開放式基座容忍度設定之共同子字串搜尋演算法‧‧‧‧‧‧‧ 32
伍、多重序列配對組合統計模式‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 36
一、預測兩序列可能的比對次數‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 37
二、預測兩序列可能的共同子字串數量‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 38
三、章節結論‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 39
四、兩序列最多比對次數與最少比對次數之證明‧‧‧‧‧‧‧‧‧‧ 44
陸、結果與討論‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 51
一、演算法比較與分析‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 51
二、系統比較與實驗結果分析討論‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 53
(一)系統比較‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 53
(二)實驗結果分析討論 ‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 54
柒、未來展望‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 62
参考文獻‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 64
附錄‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧ 66
[1] Lincoln Stein, “Genome annotation : from sequence to biology”, Nature reviews genetics, 2:493 – 503, 2001.
[2] Jacek Majewski and Jurg Ott, “Distribution and characterization of regulatory elements in the human genome”, Genome research, 12:1827~1836, 2002.
[3] Len A. Pennacchio and Edward M. Rubin, “Genomic strategies to identify mammalian regulatory sequences” , Nature reviews genetics, vol. 2, 100~109, 2001.
[4] http://www.genome.ad.jp/ , Aug. 14 2003.
[5] http://www-btls.jst.go.jp/ , Aug. 14 2003.
[6] Yutaka Suzuki, Tatsuhiko Tsunoda, Jun Sese, Hirotoshi Taira et.al., “Identification and characterization of the potential promoter regions of 1031 kinds of human genes” , Genome research,11:677–684, 2001.
[7] Gabriela G. Loots, Ivan Ovcharenko, Lior Pachter, Inna Dubchak, and Edward M. Rubin, ”rVista for comparative sequence-based discovery of functional transcription factor binding sites”, Genome research, 12:832–839, 2002.
[8] Ramana V. Davuluri, Ivo Grosse, Michael Q. Zhang , “Computational identification of promoters and first exons in the human genome”, Nature Genetics, 29: 412 – 417, 2001.
[9] Nathan D. Trinklein, Shelley J. Force Aldred, Alok J. Saldanha,and Richard M. Myers, “Identification and functional analysis of human transcriptional promoters”, Genome research, 13:308~312, 2003.
[10] Jacek Majewski and Jurg Ott, “Distribution and characterization of regulatory elements in the human genome”, Genome research, 12:1827~1836, 2002.
[11] Yitzhak Pilpel, Priya Sudarsanam, George M. Church, “Identifying regulatory networks by combinatorial analysis of promoter elements”, Nature Genetics 29:153~159, 2001.
[12] Buhler,J., ”Effective large-scale sequence comparison by locality-sensitive hashing.Bioinformatics”, 17 ,419-428, 2001.
[13] Ning Z,Cox AJ, Mullikin JC., ”SSAHA:a fast search method for large DNA databases”, Genome Research, vol.11,1725-1729, October 2001.
[14] Natalia Volfovsky,Brian J Haas and Steven L Salberg, ”A clustering method for repeat analysis in DNA sequences”, Genome Biology, 2001.
[15] Pierre Baldi and Pierre-François Baisnée, “Sequence analysis by additive scales:DNA structure for sequences and repeats of all lengths”, Bioinformatics, 16, 865-889, 2000.
[16] KARP R.M., RABIN M.O., “Efficient randomized pattern-matching algorithms”, IBM J. Res. Dev. 31(2):249-260, 1987.
[17] CROCHEMORE, M., LECROQ, T., “Pattern matching and text compression algorithms”, in CRC Computer Science and Engineering Handbook, A. Tucker ed., Chapter 8, pp 162-202, CRC Press Inc., Boca Raton, FL, 1996.
[18] KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R., “Fast pattern matching in strings”, SIAM Journal on Computing 6(1):323-350, 1977.
[19] Knuth,D.E., 1998, vol. 3 of The Art of Computer Programming, 2nd edn, Addison Wesley.
[20] BOYER R.S., MOORE J.S., “A fast string searching algorithm.” Communications of the ACM., 20:762-772, 1977.
[21] COLE, R., Tight bounds on the complexity of the Boyer-Moore pattern matching algorithm, SIAM Journal on Computing 23(5):1075-1091, 1994.
[22] SUNDAY D.M., “A very fast substring search algorithm, Communications of the ACM ”, 33(8):132-142, 1990.
[23] Ricardo Baeza-Yates, Gaston H. Gonnet, “A new approach to text searching”, Communications of the ACM, 10, 74-82, vol. 35, October 1992.
[24] Hoare, C. A. R., “Quicksort”, The Computer Journal 5, 10-15, 1962.
[25] H-L Tai, Y-H Chang, J-H Chu, T-W Pai, Margaret D-T Chang, “Primate Promoter Analysis”, 第十九屆生物醫學聯合學術年會, April 2004.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top