(3.215.180.226) 您好!臺灣時間:2021/03/06 15:40
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:李美賢
研究生(外文):Mei-Hsien Lee
論文名稱:兩種進行相關性分析的方法:(1)群聚單倍體之概似函數檢定;(2)核函數之典型相關分析
論文名稱(外文):Two Methods in Association Analysis: (1) Likelihood Ratio Test with Clustered Haplotypes (2) Kernel Canonical Correlation Analysis
指導教授:蕭朱杏蕭朱杏引用關係陳素雲陳素雲引用關係
學位類別:博士
校院名稱:國立臺灣大學
系所名稱:流行病學研究所
學門:醫藥衛生學門
學類:公共衛生學類
論文種類:學術論文
畢業學年度:96
語文別:英文
論文頁數:68
中文關鍵詞:相關性檢定生物資訊群聚演化單倍體單倍體組成不明確核函數之典型相關性分析概似函數統計學習SNP
外文關鍵詞:association studybioinformaticsclusteringevolutionhaplotypehaplotype ambiguouskernel canonical correlationlikelihood functionstatistical learningSNP
相關次數:
  • 被引用被引用:0
  • 點閱點閱:303
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
  相關性分析(association analysis)是常見的統計方法,例如:近年來探索複雜性疾病(complex disease)常利用基因與遺傳性狀之間的相關性來尋找影響性狀的基因位置,這種研究方法又稱為相關性研究(association study),而機器學習理論主要在探究兩群多維資料的相關分析。
  在相關性研究方面,若以實驗設計及研究對象的選取方式來區分,有兩種主要的研究方法:第一種是以族群資料為主的病例對照研究(population-based case-control study),另一種是以家庭資料為主的相關性研究(family-based association study)。過去的研究與常用之遺傳統計分析,常將這兩種研究方法分開討論,並獨立發展各自專屬的應用軟體;至於分析的方法則又可分為無母數統計方法與有母數統計方法。但是,不論哪一種方法,在遇到單倍體頻率過低、維度太高、以及資料量過大時,皆有運用上的困難。本論文的第一個研究目的將針對單倍體資料,透過概似函數的概念,引入演化的觀點,將屬於同源祖先的單倍體群聚為一類,降低模式中參數的維度,並考慮下傳與不下傳之單倍體的不確定性;來解決因單倍體資料組成不明確(ambiguity)導致模式參數過多、檢定力不佳、效率不高(not efficient)的問題,以提升相關性研究方法的檢定力,進而萃取複雜性疾病中具有微量效應的遺傳因子。最後以家族資料的模擬研究呈現此方法在各種情況下的型一誤差與檢定力。
  本論文的第二部分將針對大型資料,以統計學習理論(statistical learning theory)的觀點,討論兩組多維變數之間的相關性。這一部分研究的重點將以生物資訊為主要考量。Hotelling (1936)提出的典型相關分析(canonical correlation analysis)可以量度兩組多變量資料(multivariate data)之間的線性關係,然而當兩組資料不是呈現線性相關、或是資料非多維常態分佈時,典型相關分析的方法就無法適切地擷取資料所富含的訊息,因此,本研究提出核函數之典型相關性分析方法(kernel canonical correlation analysis, KCCA)以量度兩組資料之相關程度,且進一步檢定兩組資料之相關性;此外,本研究引入基底選取的概念,將更能有效處理大量資料在運算上的問題;除此之外,採用KCCA分析大型資料或是全基因體資料(genome-wide)的相關性問題,將可避開單倍體型式與頻率無法估計的問題,最後,將以模擬研究與兩個例子說明此方法在相關性檢定與分類研究上的執行成果。
Association analysis is a common method in statistical analysis. For instance, to investigate the association between diseases and genetic markers, scientists conduct association studies to detect the liability loci. This kind of studies is called association studies. There are basically two different study designs, the population-based case-control studies and the family-based association studies. Researches usually focus on a specific study design and then develop methodology for analysis. Current statistical analysis can be categorized roughly to nonparametric and parametric methods. Difficulties arise, however, when some haplotypes are with small frequencies, when degree of freedom in the association test is large, and when the size of data is enormous. In the first part of this thesis, we will adopt the parametric likelihood approach, use the evolutionary clustering tool for minor haplotypes, reduce the dimensionality corresponding to the number of haplotypes, and take into account the uncertainty in the transmission phase. Simulation studies and comparison with Famhap and FBAT show that the likelihood ratio test with clustered haplotypes outperforms.
The second part of this thesis tackles the association test from the perspective of statistical learning theory. The emphasis of this part is more on the bioinformatics viewpoint. To measure the association between two sets of random variables, Hotelling (1936) proposed the classical linear canonical correlation analysis (LCCA). However, its application is limited to linear association and normality assumption. We introduce a nonparametric kernel canonical correlation analysis (KCCA) for nonlinear association measures between two sets of variables and propose a new independence test under KCCA. The KCCA can be applied directly on genotype data, and avoid the inference of haplotype phase and estimation of haplotype frequencies. Implementation issues are discussed and numerical experiments with other nonparametric methods are presented.
Introduction…………………………………………………………1

2 Likelihood Ratio Test with Clustered Haplotypes……………5
2.1 Notations …………………………………………………………5
2.2 Clustered Haplotypes ……………………………………………7
2.3 Recoding Haplotype Frequencies ………………………………9
2.4 Likelihood Ratio Test …………………………………………11

3 Simulation Studies for LRT……………………………………13
3.1 Simulation Scheme ………………………………………………13
3.2 Results …………………………………………………………18

4 Kernel Canonical Correlation Analysis………………………24
4.1 Kernel Canonical Correlation Analysis ……………………26
4.2 Implementation of KCCA ………………………………………30
4.2.1 Regularization ………………………………………………31
4.2.2 Parameter Selection …………………………………………36
4.3 Measure and Test of Association ……………………………37

5 Simulation Studies and Applications for KCCA………………40
5.1 Two Synthetic Data Sets ………………………………………41
5.2 Tests of Independence …………………………………………48
5.3 Applications ……………………………………………………51

6 Future Work…………………………………………………………57

Appendix 1 The flowcharts of likelihood ratio tests with clustered haplotypes…………………………………………………60
Appendix 2 Formula for prevalence, penetrance and liability allele frequencies……………………………………………………63
References………………………………………………………………64
Agresti, A. (2002) Categorical data analysis. 2nd New York: John Wiley & Sons, Inc.

Akaho, S. (2001). A kernel method for canonical correlation analysis. International Meeting of Psychometric Society (IMPS2001).

Anderson, T. W. An Introduction to Multivariate Statistical Analysis, 3rd ed., Wiley, New York, 2003.

Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. J. Mach. Learning Res., 3, 1--48.

Barrett, J. C., Fry, B., Maller, J., and Daly, M. J. (2005). Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21, 263--265.

Bartlett, M. S. (1947a). Multivariate analysis. Supp. J. Roy. Statist. Soc., 9, 176--197.

Bartlett, M. S. (1947b). The general canonical correlation distribution. Ann. Math. Statist., 18, 1--17.

Becker, T., and Knapp, M. (2004) Maximum-likelihood estimation of haplotype frequencies in nuclear families. Genet. Epidemiol., 27: 21--32.

Clayton, D. (1999) A generalization of the transmission/ disequilibrium test for uncertain-haplotype trans mission. Am. J. Hum. Genet., 65: 1170--1177.

Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273--279.

Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and OtherKernel-Based Learning Methods. Cambridge University Press, Cambridge, UK.

Dauxois, J. and Nkiet, G. M. (1997). Canonical analysis of two Euclidean subspaces and its applications. Linear Algebra Appl., 264, 355--388.

Dauxois, J. and Nkiet, G. M. (1998). Nonlinear canonical analysis and independence tests. Ann. Statist., 26, 1254--1278.

Dauxois, J. and Nkiet, G. M. (2002). Measure of association for Hilbert subspaces and some applications. J. Multivariate. Anal., 82, 263--298.

Dauxois, J., Nkiet, G. M. and Romain, Y. (2004). Canonical analysis relative to a closed subspace. Linear Algebra Appl., 388, 119--145.

Dauxois, J., Romain, Y. and Viguier, S. (1993). Comparison of two factor subspaces. J. Multivariate Anal., 44, 160--178.

Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum likelihood from incomplete data via EM algorithm (with discussion) J. R. Statist. Soc. B, 39(1): 1--38.

Eubank, R., Hsing, T., (2006). Canonical correlation for stochastic processes. preprint. http://www.stat.osu.edu/~hsing/papers/CCpaper-rev1.pdf.

Falk, C. T. and Rubinstein, P. (1987) Haplotype relative risk: an easy reliable way to construct a proper control sample for risk calculation. Ann. Hum. Genet., 51: 227--233.

Gretton, A., Herbrich, R. and Smola, A. (2003). The kernel mutual information. Technical Report, MPI for Biological Cybernetics, Tuebingen,Germany.

Gretton, A., Herbrich, R., Smola, A., Bousquet, O. and Schölkopf, B. (2005). Kernel methods for measuring independence. J. Machine Learning Research, 6, 2075--2129.

Hardoon, D. R., Szedmak, S. and Shawe-Taylor, J. (2004). Canonical correlation analysis: an overview with application to learning methods. Neural Computation, 16, 2639--2664.

Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference,and Prediction. Springer-Verlag, New York.

Horvath, S., Xu, X., Lake, S. L., Silverman, E. K., Weiss, S. T. and Laird, N. M. (2004) Family-based tests for association haplotypes with general phenotype data: application to asthma genetics. Genet. Epidemiol., 26: 61--69.

Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321--377.

Hsing, T., Liu, L.-Y., Brun, M. and Dougherty, E. R. (2005). The coefficient of intrinsic dependence. Pattern Recognition, 38, 623--636.

Huang, C. M., Lee, Y. J., Lin, D. K. J and Huang, S. Y. (2007). Model selection for support vector machines via uniform design. A special issue on Machine Learning and Robust Data Mining of Computational Statistics and Data Analysis, 52:335--346, 2007.

Huang, S. Y. and Hwnag, C. R. (2006). Kernel Fisher discriminant analysis in Gaussian reproducing kernel Hilbert spaces -Theory. Institute of Statistical Science, Academia Sinica, technical report.
http://www.stat.sinica.edu.tw/syhuang/.

Huang, S. Y., Lee M. H, and Hsiao, C. K. (2007). Nonlinear measures of association with kernel canonical correlation analysis and applications. submitted.

Jensen, D. R. and Mayer, L. S. (1977). Some variational results and their applications in multiple inference. Ann. Statist., 5, 922--931.

Kuss, M. and Graepel, T. (2003). The geometry of kernel canonical correlation analysis. Technical report, Max Planck Institute for Biological Cybernetics, Germany.

Laird, N. M., Horvath, S. and Xu, X. (2000) Implementing a unified approach to family-based tests of association. Genet. Epidemiol., 19(Suppl 1): S36--S42.

Lee, Y. J. and Huang, S. Y. (2007). Reduced support vector machines: a statistical theory. IEEE Trans. Neural Networks, 18, 1--13.

Lee, Y. J. and Mangasarian, O. L. (2001). RSVM: reduced support vector machines. Proceeding 1st International Conference on Data Mining, SIAM.

Liang, L., Zöllner, S and Abecasis, G. R. (2007) Genome: a rapid coalescent-based whole genome simulator. Bioinformatics, 23: 1565--1567.

Newman, D. J., Hettich, S., Blake, C. L. and Merz, C. J. (1998). UCI Repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University of California, Department of Information and Computer Science.

Ott, J. (1989) Statistical properties of the haplotype relative risk. Genet. Epidemiol., 6: 127--130.

Ramsay, J. O. and Silverman, B. W. (1997). Functional Data Analysis. Springer.

Schaid, D. J., Rowland, C. M., Tines, D. E., Jacobson, R. M. and Poland, G. A. (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet., 70: 425--434.

Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization,Optimization, and Beyond. MIT Press, Cambridge, MA.

Sham, P. (1998). Statistics in human genetics. Arnold, New York, N.Y.

Shannon, C. E. (1948) A mathematical theory of communication. Bell System Tech. J., 27: 379-423, 623--656.

Smola, A. and Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proc. 17th International Conf. on Machine Learning, 911--918. Morgan Kaufmann, San Francisco, CA.

Snelson, E. and Ghahramani, Z. (2005). Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Schölkopf and J. Platt, editors, Advances in Neural Information Processing Systems, 18, MIT Press, Cambridge, MA.
Spielman, R. S., McGinnis, R. E. and Ewens, W. J. (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet., 52: 506--516.

Spielman, R. S. and Ewens, W. J. (1996). The TDT and other family-based tests for linkage disequilibrium and association. Am. J. Hum. Genet., 59: 983--989.

Tzeng, J. Y., Devlin, B., Wasserman, L. and Roeder K. (2003) On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am. J. Hum. Genet., 72: 897--902.

Tzeng, J. Y. (2005) Evolutionary-based grouping of haplotypes in association analysis. Genet. Epidemiol., 28: 220--231.

Tzeng, J. Y., Wang, C. H., Kao, J. T., and Hsiao, C. K. (2006) Regression-based association analysis with clustered haplotypes using genotypes. Am. J. Hum. Genet., 78: 231--242.

Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.

Wang, J., Neskovic, P. and Cooper, L.N. (2005). Training data selection for support vector machines. In Lipo Wang, Ke Chen and Yew-Soon Ong, editors, Advances in Natural Computation: Proceedings, Part I, First International Conference, Lecture Notes in Computer Science 3610, 554--564, Springer-Verlag, Berlin.

Williams, C. K. I. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems, 13, 682--688, Cambridge, MA, MIT Press.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔