跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.102) 您好!臺灣時間:2025/12/04 10:00
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:莊涵宇
研究生(外文):Han-Yu Chuang
論文名稱:目標導向式特徵篩選法解生物晶片上最佳化實驗設計與重要基因偵測問題
論文名稱(外文):An Objective-Oriented aspect of Feature Selection for Optimal Experimental Designs and Detection of Significant Genes on Microarray
指導教授:高成炎高成炎引用關係
指導教授(外文):Cheng-Yan Kao
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2003
畢業學年度:91
語文別:英文
論文頁數:81
中文關鍵詞:目標導向特徵篩選資料熔合演化式計算生物晶片最佳化實驗設計差異基因表現
外文關鍵詞:Objective-OrientedFeature selectionData fusionEvolutionary approachMicroarrayOptimal experimental designdifferential expressed gene
相關次數:
  • 被引用被引用:0
  • 點閱點閱:234
  • 評分評分:
  • 下載下載:26
  • 收藏至我的研究室書目清單書目收藏:2
特徵篩選是機器學習上非常重要的課題。從圖形辨識到時間序列預測,不論是在科學、工程與生物資訊的議題裡,都有特徵篩選存在的必要性。這篇論文提出一個名為目標導向式的新特徵篩選方法(Objective-Oriented approach of feature selection, ObO),並將之應用在生物晶片資料分析上,包含最佳化實驗設計與重要基因偵測問題。
ObO 提出,特徵篩選的施行,其目標可分為兩類:1)對資料中的特徵逐一衡量其解釋資料的效力,並加以排名; 2)選取一組特徵,組合用之可幫助進階分析達到最大效能。對於屬於第一類的特徵篩選問題,ObO 使用資料熔合技巧來達到將特徵排名的需求。處理第二類型,ObO使用演化式計算來獲取所欲最佳特徵組。
ObO在生物晶片的多項生物資訊議題上,獲得亮眼的成果。在最佳化實驗設計問題上,我們應用ObO的演化式計算於多種不同的實驗目標,展現其解法大有可為:1)所有生物晶片上欲比較的試驗均為同等重要; 2)某些特定的試驗較為重要。實驗結果顯示,此法可有效地為每個特定實驗量身訂做最佳化的實驗模式。在重要基因偵測問題上,我們使用ObO發展出三種方法。其中兩種採用ObO的資料熔合概念,對生物晶片上幾千個基因,根據其基因表現與實驗條件的關連度作重要性排名。另一種方法則是使用ObO的演化式計算來獲取最具樣本區辨力的少量基因組合。由實驗結果可見,這些發展自ObO的方法均可達到現行常用方法之成效,並提供更具普遍性的結果。
For science, engineering, and bioinformatics, feature selection are found in all machine learning tasks, including pattern recognition and time-series prediction. This thesis proposes a new approach named objective-oriented (ObO) approach of feature selection and presents several related applications for microarray analysis, categorized optimal experimental design and detection of significant genes.
With the frame of ObO, the objectives of feature selection are two classes: 1) ranking features individually with ability to provide a better realization of the underlying concept that generates the data, and 2) combining a set of features to derive the best possible prediction performance of the target learning algorithm. For the first objective, ObO uses “Data Fusion” techniques to rank features by their capability of interpreting the aimed knowledge. For the second one, Obo uses evolutionary approaches to acquire optimal sets of features with the best performance for the target.
ObO has produced promising results in these bioinformatics problems on microarray. For optimal experimental design problem, we demonstrate that the evolutionary approach of ObO is promising for several kinds of experimental objectives: 1) comparisons between all pairs of treatments of equal interest; 2) comparisons between some pairs of treatments of greatest interest. The experimental results indicate that the proposed method can find all the optima for tested problems. For the problems of detection of significant genes, we derive three methods. Two of them are based on data fusion techniques of ObO to rank genes using expression levels for relevance of given experimental conditions, and the other one uses the evolutionary approach of ObO to search the optimal gene sets with small size and the best discriminability between samples. The experimental results show these methods based on ObO are competitive with commonly used feature selection methods and provide more general solutions.
Contents
Chapter 1 Introduction 1
1.1 Featrue Selection 2
1.2 Feature Selection Problems of Microarray 4
1.2.1 Brief stories of Microarray 4
1.2.2 Related Feature Selection problems 7
1.3 Thesis Overview 9
Chapter 2 The Obective-Oriented Approach of Feature Selection 12
2.1 Objectives of Feature Selection 12
2.1.1 Rank for Relevance 12
2.1.2 Optimal sets for best performance 13
2.2 Overview 13
2.3 Data Fusion Techniques 14
2.3.1 System architecture 14
2.3.2 Rank and Combination Analysis 15
2.4 Evolutionary Approaches ……..17
2.4.1 Evolutionary algorithms 17
2.4.2 System architecture 18
Chapter 3 Finding Optimal Array Sets for Microarray Experiments 21
3.1 Introduction 21
3.2 Problem Definition 24
3.2.1 Graph representation of designs 24
3.2.2 Statistical model and assuption for evaluating Microarray Designs 25
3.3 GA for Optimal Experimental Designs 27
3.3.1 Chromosome representation 28
3.3.2 Crossover and Mutation 29
3.4 Experimental Results of Optimal Designs 30
3.4.1 Equal interest 30
3.3.2 Weighted interest 35
3.5 Summary 38
Chapter 4 Ranking Genes for Discriminability on Microarray Data 39
4.1 Introduction 39
4.2 WEighted Punishment on Overlap (WEPO) 41
4.2.1 Scaling for comparable genes 41
4.2.2 Sorting, estimating and scoring 42
4.3 Evaluation 44
4.3.1 Description of Datasets 45
4.3.2 Classification - SVM 46
4.3.3 Informative testing 46
4.3.4 Sensitive testing 48
4.3.4 Significance testing 49
4.4 Experimental Results 53
4.5 Summary 53
Chapter 5 Combination of Methods for Identifying Informative Genes from Microarray Data 54
5.1 Introduction 54
5.2 Comparison Study of Feature Selection methods 57
5.2.1 Parametric approaches 57
5.2.2 Non-parametric approaches 59
5.3 Combination method 60
5.4 Experiments: Material and Methods 60
5.4.1 Classification accuracy using SVM and LOOCV 61
5.4.2 Weighted Recall of known informative genes 61
5.4.3 Datasets 61
5.4.4 Evaluation methods 61
5.5 Experimental Results 62
5.6 Summary 64
Chapter 6 Finding the most discriminant gene sets on Microarray 66
6.1 Introduction 66
6.2 A hybrid system for finding optimal gene sets 67
6.2.1 Preprocessing for reserving discrimitive candidate genes 68
6.2.2 Fitness function of EA step: Gamma test + Pearson correlation 69
6.3 Experimental Results 70
6.4 Summary 72
Chapter 7 Conclusions 73
7.1 Summary 73
7.2 Future works 74
Bibliography 76
Appendix A List of Publications 81
Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marti, G.E., Moore, T., Hudson, J., Lu, L., Lewish, D.B., Tibshirani, R., Sherlock, G., Chan, W.C., Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C., Botstein, D., Brown, P.O., and Staudt, L.M. (2000) Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling. Nature, 403, 503-511.
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and Lvine, A.J. (1999) Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proceedings of the National Academy of Sciences, 96, 6745-6750.
Baker, K., Harris, P. and O’Brien, J. (1989) Data Fusion: An Appraisal and Experimental Evaluation. Journal of the Market Research Society, 31 (2), 152-212.
Bäck, T. (1996) Evolutionary algorithm in theory and practice. Oxford University Press, New York, USA.
Bechhofer, R.E. and Tamhane, A.C. (1981) Incomplete Block Designs for Comparing Treatments With a Control: General Theory. Technometrics, 23, 45-57.
Ben-Dor, A., Shamir, R., and Yakhini, Z. (1999) Clustering Gene Expression Patterns. Journal of Computational Biology, 6, 281-297.
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. and Yakhini, Z. (2000) Tissue Classification with Gene Expression Profiles. Journal of Computational Biology, 7, 559-583.
Bioshop, C. (1995). Neural networks for pattern recognition. Oxford University Press, New York.
Blum, A. and Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97, 245-271.
Box, G. E. P., Hunter, W. G. and Hunter, J. S. (1978) Statistics for Experimenters: An Introduction to Design, Data Analysis and Model Building. New York: Wiley.
Brown, P.O. and Botstein, D. (1999) Exploring the New World of the Genome with DNA Microarrays. Nature Genetics, 21(1 Suppl), 33-37.
Burges, C. J. C. (1998) A tutorial on Support Vector Machined for pattern recognition, Data Mining and Knowledge Discovery, 2, 121-167.
Cartes, C. and Vapnik, V. (1995) Support vector machines. Machine Learning, 20, 273-297.
Chang, C. C. and Lin, C. J. (2001) LIBSVM : a library for support vector machines, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Cheng, C.-S. and Bailey, R.A. (1991) Optimality of Some Two-Associate-Class Partially Balanced Incomplete-Block Designs. Annals of Statistics, 19, 1667-1671.
Chu, P. C. (1998) A Genetic Algorithm for the Multidimensional Knapsack Problem. Journal of Heuristics, 4, 63-86.
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Bostein, D., Brown, P.O., and Hershkowitz, I. (1998) The Transcriptional Program of Sporulation in Budding Yeast. Science, 282, 699-705.
Chuang, H. Y., Tsai, H. K., Tsai, Y. F., and Kao, C. Y. (2003) Ranking genes for discriminability on Microarray Data. Journal of Information Science and Engineering, to appear.
Churchill, G. A. (2002) Fundamentals of experimental design for cDNA microarrays. Nature Genetics Supplement, 32, 490-495.
Cox, D. R. (1958) Planning of Experiments. New York: Wiley.
Dandekar, T. and Argos, P. (1994) Folding the main chain of small proteins with the genetic algorithm. Journal of Molecular Biology, 236, 844- 861.
Davis, L. (1991) Handbook of Genetic Algorithm. Van Nostrand Reinhold, New York.
DeRisi, J.L., Iyer, V.R., and Brown, P.O. (1997) Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale. Science, 278, 680-686.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998) Clustering analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95, 14863-14868.
Fisher, R.A. (1926) The arrangement of field experiments. J. Min. Agric. Gr. Br., 33, 503-513.
Fogel, D. B. (1995) Evolutionary Computation: Toward a New Philosophy of Machine Intelligent. NJ: IEEE Press, Piscataway.
Furey, T.S., Duffy, N., Cristianini, N., Bednarski, D., Schummer, M., Haussler, D. (2000) Support Vector Machine Classification and Validation of Cancer Tissue Samples Using Microarray Expression Data. Bioinformatics, 16 (10), 906-914.
Giun, C.M.R., Willett, P. and Bradshaw, J. (2000) Combination of Molecular Similarity Measures Using Data Fusion, Perspectives in Drug Discovery and Design. Kluwer/ESCOM, 20, 1-16.
Goldberg, D. E. (1989) Genetic algorithms in search, optimization & machine learning. Reading, MA: Addison-Wesley.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D. and Lander, E.S. (1999) Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531-537.
Hall, M.A. and Smith, L.A. (1998) Practical feature subset selection for machine learning. In McDonald,C. (ed.), Proceedings of Australasian Computer Science Conference. Springer, Singapore, 181—191.
Harris, P. and Baker, K. (1998). Data Fusion. Admap, June 1998
Hart, W. E. (1994) Adaptive global optimization with local search. PhD thesis, University of California, San Diego.
Hartemink, A.J., Gifford, D.K., Jaakkola, T.S. and Young, R.A. (2003) Maximum likelihood estimation of optimal scaling factors for expression array normalization. http:// www.psrg.lcs.mit.edu.
Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O.P., Wilfond, B., Borg, A. and Trent, J. (2001) Gene-expression profiles in hereditary breast cancer. New England J. Med., 8, 344-539.
Hettemansperger, T. P. (1984) Statistical Inference based on ranks. Wiley, New York.
Heydermann, M.C. (1997) Cayley graphs and interconnection networks. Graph Summetry, 161-224.
Heyer, L.J., Kruglyak, S., and Yooseph, S. (1999) Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Research, 9, 1106-1115.
Holland, J. (1975) Adaptation in Neural and Artificial Systems. University of Michigan Press, Ann Arbor.
Hsu, D.F., Shapiro, J. and Taksa, I. (2002) Methods of Data Fusion in Information Retreival: Rank vs. Score Combination. DIMACS Technical Report, 58.
Jaeger, J., Sengupta, R., and Ruzzo, W.L. (2003) Improved gene selection for classification of microarrays. Pacific Symposium on Biocomputing, 8, 53-64.
Kerr, M. K. and Churchill, G. A. (2001a) Experimental design for gene expression microarrays. Biostatistics, 2, 183-201.
Kerr, M. K. and Churchill, G. A. (2001b) Statistical design and the analysis of gene expression microarrays. Genetic Research, 77, 123-128.
Kohavi, R. and John, G. (1979) Wrapper for feature subset selection. Artificial Intelligence, 97, 273-324.
Langley, P. (1994) Selection of relevant features in machine learning. Proceedings of the AAAI Fall Symposium on Relevance. AAAI Press.
Lazzeroni, L. and Owen, A. (2000) Plaid Models for Gene Expression Data. Technical Report 211, Department of Biostatistics, Stanford University.
Lee, M.—L. T., Lu, W., Whitmore, G.A., and Beier, D. (2001) Models for microarray gene expression data. preprint.
Li, L., Weinberg, C.R., Darden, T.A. and Pedersen, L.G. (2001) Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 17, 1131—1142.
Little, R.J.A. and Rubin, D.B. (1987) Statistical analysis with missing data. Wiley, New York.
Lockhart, D.J., Dong, H.L., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H. and Brown, E.L. (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Natrue Biotechnology, 14, 1675-1680.
Marden, J.I. (1995) Analysing and Modeling Rank Data. Chapman & Hall.
Massart, D.L., Vandeginste, B.G.M., Deming, S.N., Michotte, Y., and Kaufman, L. (1998) The k-nearest neighbor method. In Chemometrics: a textbook (Data Handling in Science and Technology, vol 2), Elsevier Science B. V: New York, 395-397.
Ng, K.B. and Kantor, P.B (2000) Predicating the effectiveness of Naïve Data Fusion on the basis of system characteristics. JASIS, 51, 1177-1189.
Notterman, D. A., Alon, U., Sierk, A. J., and Levine, A. J. (2001) Transcriptional Gene Expression Profiles of Colorectal Adenoma, Adenocarcinoma, and Normal Tissue Examined by Oligonucleotide Arrays. Cancer Research, 61, 3124—3130.
Ooi, C.H. and Tan, P. (2003) Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics, 19, 37-44.
Park, P. J., Pagano, M., and Bonetti, M. (2001) A Nonparametric Scoring Algorithm for Identifying Informative Genes from Microarray Data. Pacific Symposium on Biocomputing, 6, 52-63.
Perou, C.M., Jeffrey, S.S., Van de Rijn, M., Rees, C.A., Eisen, M.B., Ross, D.T., Pergamenschikov, A., Williams, C.F, Zhu, S.X., Lee, J.C.F., Lashkari, D., Shalon, D., Brown, P.O., and Botstein, D. (1999) Distinctive Gene Expression Patterns in Human Mammary Epithelial Cells and Breast Cancers. Proceedings of the National Academy of Sciences, 16, 9212-9217.
Pollack, R., Perou, C. M., Alizadeh, A.A., Eisen, M.B., Pergamenschikov, A., Williams, C.F., Jeffrey, S.S., Botstein, D. and Brown, P.O. (1999) Genome-wide analysis of dna copy-number changes using cdna microarrays. Nature Genetics, 23, 41-46.
Raghavarao, D. (1971) Constructions and Combinatorial Problems in Design of Experiments. New York: Wiley.
Rechenberg, I. (1973) Optimierung Technischer Nach Prinzipien der Biologischen Information. Frommann Verlag, Stuttgrat.
Ross, D.T., Scherf, U., Eisen, M.B., Perou, C.M., Rees, C., Spellman, P., Iyer, V., Je_rey, S.S., Van deRijn, M., Waltham, M., Pergamenschikov, A., Lee, J.C.F., Lashkari, D., Shalon, D., Myers, T.G., Weinstein, J.N., Botstein, D., Brown P. (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics, 24, 227-235.
Schena, M., Shalon, D., Davis, R.W., and P.O. Brown (1995) Quantitative monitoring of gene expression patterns with a complementary dna microarrays. Science, 270, 467-470.
Schena, M. editor (1999) DNA Microarrays : A Pratical Approach. Oxford University Press.
Schwefel, H. P. (1977) Numerische Optimierung von Compuer-Moddellen Mittels der Evolution-sstrategie. Birhhäuser, Basel.
Shah, K.R. and Sinha, B.K. (1989) Theory of optimal designs. Springer-Verlag, New York.
Slonim, D.K. (2002) From patterns to pathways: gene expression data analysis comes of age. Nature Genetics, 32, 502-508.
Snedecor, G.W. and Cochran, W.G. (1989) Statistical Methods, Eighth Edition. Iowa State University Press.
Staunton, J.E., Slonim, D.K., Coller, H.A., Tamayo, P., Angelo, M.J., Park, J., Scherf, U., Lee, J.K., Reinhold, W.O., Weinstein, J.N., Mesirov, J.P., Lander, E.S., and Golub, T.R. (2001) Chemosensitivity prediction by transcriptional profiling. PNAS, 98, 10787—10792.
Stefensson, A., Koncar, N., and Jones, A.J. (1997) A Note on the Gamma Test. Neural Comput. Applic., 5, 131-133.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Smitrovsky, E., Lander, E., and Golub, T.R. (1999) Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation. Proceedings of the National Academy of Sciences, 96, 2907-2912.
The Chipping Forecast (1999). Supplement to Nature Genetics, 21, 1-60.
Tsai, H. K., Yang, J. M., and Kao, C. Y. (2003) Entropy Selection Genetic Algorithms for Traveling Salesman Problems. Engineering Optimization, to appear.
Vapnik, V. (1999) Statistical Learning Theory. John Wiley & Sons, New York.
Vogt, C.C. and Cotrell, G.W. (1999) Fusion via a linear combination of scores, Info. Ret., 1, 151-172.
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., and Vapnik, V. (2001) Feature selection for SVMs. In Advances in Neural Information Processing Systems, volume 13. MIT Press, Cambrige, MA. In press.
Xu, L., Krzyzak, A., and Suen, C.Y. (1992) Method of Combining Multiple Classifiers and their Application to Handwriting Recognition. IEEE Trans SMC, 22, 418-435.
Yang, J. M. (2001) A Family Competition Evolutionary Approach of Global Optimization in Neural Networks, Optical Thin-film Design, and Structure-based Drug Design. Ph. D. thesis, National Taiwan University, Taiwan.
Yang, Y. H. and Speed, T. (2002) Design issue for cDNA microarray experiments. Nature Reviews, 3, 579-588.
Yen, J., Yip, J. C., and Pao, Y. H. (1998) Combinatorial optimization with use of guided evolutionary simulated annealing. IEEE Transactions on Systems, Mans, and Cybernetics —Part B, 28(2), 173-191.
Yeung, K. Y. and Ruzzo, W. L. (2001) Principal component analysis for clustering gene expression data. Bioinformatics, 17, 763-774.
Youden, W.J. (1969) In Precision Measurement and Calibration: Statistical Concepts and Procedures. Special Publication 300, National Bureau of Standards, United States Department of Commerce, Washington, D.C., 1, 146-151.
Zhang, H., Yu, C.Y., Singer, B. and Xiong, M. (2001) Recursive partitioning for tumor classification with gene expression microarray data. PNAS, 98, 6730—6735.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top