跳到主要內容

臺灣博碩士論文加值系統

(35.153.100.128) 您好!臺灣時間:2022/01/19 04:15
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:黃櫻雪
研究生(外文):Ying-Hsueh Huang
論文名稱:突變時基因序列相似性之研究
論文名稱(外文):A study of statistical measures of DNA sequence similarity under mutation
指導教授:吳鐵間
指導教授(外文):Tiee-Jian Wu
學位類別:碩士
校院名稱:國立成功大學
系所名稱:統計學系碩博士班
學門:數學及統計學門
學類:統計學類
論文種類:學術論文
論文出版年:2002
畢業學年度:90
語文別:英文
論文頁數:53
中文關鍵詞:不相似性測量BLAST基因序列Kullback-Leibler discrepancy突變標準化的歐式距離
外文關鍵詞:MutationStandardized Euclidean distanceKullback-Leibler discrepancyDNA sequencesBLASTDissimilarity measures
相關次數:
  • 被引用被引用:0
  • 點閱點閱:467
  • 評分評分:
  • 下載下載:48
  • 收藏至我的研究室書目清單書目收藏:2
摘 要
一直以來,在分子生物學的發展上,如何將生物序列間的不相似性或相似性量化是很重要的一個議題。在過去的研究探討中,已經發展出一些測量基因序列間不相似性或相似性的方法,而本研究的目的可分為以下四部份:
(1) 利用大規模的電腦模擬,比較BLAST、歐式距離 ( )、標準化的歐
式距離 ( ),以及Kullback-Leibler discrepancy ( ) 此四種測量方法
的優劣性;
(2) 分別探討 、 或 和突變量之間的關係;
(3) 在不同的基因序列長度下,利用模擬分別決定 、 和 此三種
測量方法最佳的字長度;
(4) 利用大規模的模擬,其中包含100,000個隨機生成序列和其突變序列
間的距離分數,建造 的直方圖,並且利用它去對兩條序列間的不相似性作統計推論。
因此,透過模擬研究和實際資料分析,本論文發現:
(1) 對於任何一種測量方法,最理想的字大小(word size)會隨著視窗大小(window size)的增加而增加;
(2) 當最理想的字大小被選擇時,Kullback-Leibler discrepancy對於基因序列突變時的敏感度表現是最好的;
(3) 將差補法應用在選擇的視窗大小在100至1,600間,並且利用 去估
計和檢定兩條基因序列間的不相似性。
Abstract
In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Several measures of DNA sequence dissimilarity have been developed in the past. The purpose of this thesis is fourfold. Firstly, we use extensive simulation to compare the performance of several word-based methods with the benchmark method BLAST that requires sequence alignment. The word-based methods we consider include (Euclidean distance), (standardized Euclidean distance), and (Kullback-Leibler discrepancy). Secondly, we study the relations between the measures , or and the amount of mutation for two sequences. Thirdly, we use simulation to determine the optimal word size for each measure , and at different sequence lengths (or window sizes). Fourthly, we use a large-scale simulation, which consist of 100,000 comparisons between a randomly generated sequence and a mutated version of that sequence, to construct a histogram of the Kullback-Leibler discrepancy . This histogram in turn can be used to make statistical inference about the degree of dissimilarity between two DNA sequences. Our simulation study and real data analysis show that (i) for each measure the optimal word size increases as window size increases, (ii) when the optimal word size is used, the Kullback-Leibler discrepancy performs the best and (iii) the method of interpolation can be applied so that we can use a sliding window size between 100 and 1,600 and Kullback-Leibler discrepancy to estimate and test the degree of dissimilarity between any pair of DNA sequences.
Table of Contents

Chapter 1 Introduction ……………………………………………1
 1.1 DNA Sequence ………………………………………………4
 1.2 Mutations in the DNA ………………………………………5
 1.3 Literature Review …………………………………………7

Chapter 2 Dissimilarity Measures Based on Distances between Frequencies of Words ………………………………………………10

Chapter 3 Simulation Study ………………………………………14
 3.1 Simulation Design …………………………………………14
 3.2 Sensitivity of Dissimilarity Measures ……………………17
 3.3 Performances of Dissimilarity Measures …………………21
 3.4 Estimation and Test of the Degree of Dissimilarity between two DNA sequences ………………………………………27

Chapter 4 Real Data Analysis ………………………………………29

Chapter 5 Conclusions and Further Research ………………………33

References …………………………………………………………34

Appendix ……………………………………………………………36



List of Tables

Table 1.1 GenBank Data from 1982 to 2001 …………………………1
Table 3.1 Example of calculating rank penalty scores using BLAST as a similarity search tool ……………………………22
Table 3.2 Example of calculating rank penalty scores using,or as a dissimilarity search tool ………………………22
Table 4.1 The degree of dissimilarity and similar scores using and BLAST, respectively, are sorted from the highest to lowest similarity ……………………………………………32

Appendix
Table A1 Comparison of average rank penalty scores of dissimilarity measures over 5,000 sequences ………………36

Table A2 (a)-(b)
Estimation of the degree of dissimilarity using Kullback-Leibler discrepancy ………………………………………………37

Table A3 Upper points for the distribution of Kullback-Leibler discrepancy ……………………………………………40

Table A4 Query (HSLIPAS) and test dataset of 39 sequences …………50


List of Figures

Figure 1.1 Growth of GenBank Data from 1982 to 2001 ………2
Figure 1.2 Types of point mutations ……………………………7
Figure 3.1 (a)Data structure for Section 3.2 ………………15
Figure 3.1 (b)Data structure for Section 3.3-3.4 …………16
Figure 3.2 Relations between the amount of mutation and the average value of d …………18
Figure 3.3 Relations between the amount of mutation and the average value of S …………19
Figure 3.4 Relations between the amount of mutation and the average value of I …………20
Figure 3.5 Average rank penalty scores over 5,000 sequences at different window sizes …………24
Figure 3.6 Average rank penalty scores over 5,000 sequences for several word-based dissimilarity measures ……25
Figure 3.7 The optimal word size of Kullback-Leibler discrepancy and Euclidean distance at different window sizes ………26
References
Alberts, Bruce (1994). Molecular Biology of the Cell. Garland Pub.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology 215, 403-410.
Blaisdell, B. E. (1986). A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Science, U.S.A. 83, 5155-5159.
Blaisdell, B. E. (1989a). Effectiveness of measures requiring and not requiring prior sequence alignment of estimating the dissimilarity of natural sequences. Journal of Molecular Evolution 29, 526-537.
Blaisdell, B. E. (1989b). Average value of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch count requiring sequence alignment for a computer-generated model system. Journal of Molecular Evolution 29, 538-549.
Churchill, A. (1992). Hidden Markov chains and the analysis of genome structure. Computers in Chemistry 16, 107-115.
Davison, D. (1984). Sequence similarity searching for molecular biologists. Bulletin of Mathematical Biology 46, 437-474.
Fichant, G. and Gautier, C. (1987). Statistical method for predicting protein coding regions in nucleic acid sequences. CABIOS 3, 287-295.
Fickett, J., Torney, D., and Wolf, D. (1992). Base compositional structure of genomes. Genomics 13, 1056-1064.
Gentleman, J. F. and Mullin, R. C. (1989). The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. Biometrics 45, 35-52.
Hide, W., Burke, J., and Davison, D. (1994). Biological evaluation of , an algorithm for high performance sequence comparison. Journal of Computa- tional Biology 1, 199-215.
Karlin, S. and Brendel, V. (1993). Patchiness and correlation in DNA sequences. Science 259, 677-679.
Karlin, S., Ost, F., and Blaisdell, B. E. (1989). Patterns in DNA and amino acid sequences and their statistical significance. In Mathematical Methods for DNA Sequences, M. S. Waterman (ed), 133-157. Boca Raton, Florida: CRC.
Pevzner, P. A. (1992). Statistical distance between texts and filtration methods in sequence comparison. CABIOS 8, 121-127.
Torney, D. C., Burks, C., Davison, D., and Sirkin, K. M. (1990). Computation of : A measure of sequence dissimilarity. In Computers and DNA, Santa Fe Institute Studies in the Sciences of Complexity, G. Bell and T. Mrarr (eds), 109-125. New York: Addision-Wesley.
Waterman, M. S. (ed.). (1989). Mathematical Methods for DNA Sequences. Boca Raton, Florida: CRC.
Wu, T.-J., Burk, J. P., and Davison, D. B. (1997). A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics 53, 1431-1439.
Wu, T.-J., Hsieh, Y.-C., and Li, L.-A. (2001). Statistical Measures of DNA Sequences Dissimilarity under Markov Chain Models of Base Composition. Biometrics 57, 441-448.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top