跳到主要內容

臺灣博碩士論文加值系統

(44.200.77.92) 您好!臺灣時間:2024/02/25 02:35
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:唐瑋君
研究生(外文):Wei-Chun Tang
論文名稱:以多目標演化式最佳化演算法挑選具有非線性關係之有用特徵
論文名稱(外文):Selection of Useful Features with Non-linear Dependency Using a Multi-objective Evolutionary Optimization Algorithm
指導教授:鍾翊方
指導教授(外文):I-Fang Chung
學位類別:碩士
校院名稱:國立陽明大學
系所名稱:生物醫學資訊研究所
學門:生命科學學門
學類:生物化學學類
論文種類:學術論文
論文出版年:2014
畢業學年度:102
語文別:英文
論文頁數:84
中文關鍵詞:基因選取多目標最佳化演化式演算法非線性關聯性
外文關鍵詞:gene selectionmultiobjective evolutionary optimization algorithmnon-linear relationship
相關次數:
  • 被引用被引用:0
  • 點閱點閱:121
  • 評分評分:
  • 下載下載:5
  • 收藏至我的研究室書目清單書目收藏:0
從高維度資料中確認一組有用的基因(特徵,feature)於設計一個分類機制(或診斷系統)是個很重要及有趣的議題。在一般的研究中,研究人員通常傾向尋找與類別關係強烈的基因(high relevance),換句話說,此作法所挑選出的每個基因都與類別標籤具有高度的相關性(correlation)或相互資訊(mutual information),而造成所挑選出基因之間大部分具有高度的線性相關性(linear dependency)(雖然基於不同的基因選取作法可能挑選出部分基因間具有非線性相關性)。對於一些生物學的研究,它可能是個有趣的議題於找尋一組的基因而讓基因與類別標籤具有高度關聯性且基因彼此間具有非線性的關係,藉此或許會發現基因之間具有特殊的關係,而這正是本研究主要著重的重心。雖然本研究主要從微陣列基因晶片資料中挑選出此類型的基因組,但此方法可擴展應用至不同研究類型的問題。
我們以多目標最佳化演算法的方式去找尋有用的基因,並透過多目標最佳化策略控制所選用的基因數目、基因之間的非線性關聯性與基因對於類別的區別能力。在這項研究中,我們使用一個有名的多目標最佳化演化式演算法(Multiobjective Evolutionary Algorithm Based on Decomposition, MOEA/D)方法並設計三個目標函式來確認有用的基因,利用同時最佳化條件的策略去挑選出符合我們目標設定的基因。據我們所知,這是第一個嘗試使用多目標策略去挑選非線性關聯性基因的基因選取方法。我們主要使用兩個知名的癌症微陣列基因晶片資料(SRBCT與Leukemia)證明此方法的有效性,此外,我們更進一步的針對這些基因組去做Enrichment analysis,透過分析結果幫助我們能夠更加了解這些具有非線性關係的基因之間的生物意義。

It is an interesting and important issue to identify a small set of useful genes (features) from a high dimensional data that can be used to design a classification mechanism (or diagnostic system). Usually, researchers prefer to find the genes that have high relevance, in the sense that the correlation of each of those genes with class labels is high or the mutual information between each of the genes and class labels is high. Such approaches usually end up finding genes that may be linearly dependent with each other, of course, depending on the method of selection, it may pick up some genes which have non-linear dependency. For some biological studies, it may be interesting to find a set of genes, which have high relevance with the class labels as well as nonlinear dependency between each other –we explicitly want to exclude relevant genes that are linearly correlated among themselves. Although our primary focus in this study is to find such genes from microarray datasets, such genes may also be important in other studies.
We formulate this problem as a multi-objective optimization as multi-objective strategy is a good way to simultaneously control selection of number of useful genes and optimize the relevance between the selected genes and the class labels, and the nonlinear dependency between the selected genes. We design three new objectives and optimize them using a very well-known multi-objective optimization method, Multiobjective Evolutionary Algorithm Based on Decomposition (MOEA/D). To our knowledge, this is the first attempt to feature (gene) selection along with identification of non-linear dependency between genes using a multi-objective method. We have demonstrated this very effective gene selection method on two well-known cancer microarray datasets. In addition, we have further performed gene set enrichment analysis on the selected sets of useful genes to find the biological relevance of these genes with cancer.

Contents
Acknowledgments i
English Abstract ii
Chinese Abstract iv
Contents v
List of Figures vii
List of Tables x
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation and specific aims 4
Chapter 2 Materials and Methods 5
2.1 Datasets and data preprocessing 7
2.2 Introduction of mutual information 9
2.3 Three objectives 12
2.3.1 Objective 1 12
2.3.2 Objective 2 13
2.3.3 Objective 3 14
2.4 Multiobjective Evolutionary Optimization Algorithm 16
2.4.1 Initialization step 16
2.4.2 Evaluation step 19
2.4.3 Evolution step 20
2.4.4 Update step 23
2.5 Performance evaluation 25
2.6 Pathway 26
Chapter 3 Results and discussions 27
3.1 The effectiveness of the objective function 27
3.1.1 Impact of the objective 1 function 28
3.1.2 The performance of each gene set 37
3.1.3 Influence of the different parameter setting 73
3.2 Comparison of the performance of gene sets and top genes 74
3.3 Biological application 76
Chapter 4 Conclusion and future work 82
Reference 83

List of Figures
Figure 1、Flow chart of ours method. 6
Figure 2、The effect of the tunable parameter q setting in objective 3. 15
Figure 3、The diagram of initialization step. 18
Figure 4、The figure is the illustration of evaluation step. 19
Figure 5、The diagram of the evolution step. 22
Figure 6、The diagram of the update step. 24
Figure 7、Histogram of the number of genes in a solution set (for SRBCT dataset), based on the weight 3 equal to 0.05 and different parameter α. 30
Figure 8、Histogram of the number of genes in a solution set (for SRBCT dataset), based on the weight 3 equal to 0.1 and different parameter α. 31
Figure 9、Histogram of the number of genes in a solution set (for SRBCT dataset), based on the weight 3 equal to 0.2 and different parameter α. 32
Figure 10、Histogram of the number of gene in each optimal gene set from Leukemia dataset (weight 3 equal to 0.05 and different α value). 33
Figure 11、Histogram of the number of gene in each optimal gene set from Leukemia dataset (weight 3 equal to 0.1 and different α value). 34
Figure 12、Histogram of the number of gene in each optimal gene set from Leukemia dataset (weight 3 equal to 0.05 and different α value). 35
Figure 13、Exponential graphs of the influence of the exponential function. 36
Figure 14、Scatter plot of the randomly selected solutions for different α at initialization step (for SRBCT dataset). 42
Figure 15、Scatter plot of the randomly selected solutions for different α at initialization step (for Leukemia dataset). 43
Figure 16、A graph of plotted points that show the relationship between two objectives from the SRBCT optimal gene sets. 44
Figure 17、A scatterplot of two object values of optimal gene sets in the Leukemia dataset. 45
Figure 18、The distribution plot of the amount of genes of the optimal sets from SRBCT dataset, the value of weight vector in objective 3 is equal to 0.05. 46
Figure 19、Distribution plot of the number of genes in optimal sets, the weight value of objective 3 is equal to 0.1. 47
Figure 20、Distribution plot of the number of genes in optimal sets, the weight value of objective 3 is equal to 0.2. 48
Figure 21、Distribution plot of the number of genes in optimal sets from Leukemia dataset, the weight value of objective 3 is equal to 0.05. 49
Figure 22、Distribution plot of the amount of genes in optimal sets from Leukemia dataset, the weight value of objective 3 is equal to 0.1. 50
Figure 23、Distribution plot of the number of genes in optimal sets for Leukemia dataset, the weight value of objective 3 is equal to 0.2. 51
Figure 24、Comparison of cumulative density curves of the average relevance based on different α and same weight value of weight 3. 55
Figure 25、Comparison of cumulative density curves based on different α and same weight value of weight 3 in average relevance of Leukemia dataset. 56
Figure 26、Cumulative density curves of the average mutual information based on different α and same weight value of weight 3. 57
Figure 27、The cumulative density plot of the average mutual information in different α constraint and same weight vector setting. 58
Figure 28、This is a cumulative density plot of the average correlation in different α constraint and same weight vector setting in SRBCT dataset. 59
Figure 29、This is a cumulative density plot of the average correlation in different α constraint and same weight vector setting in Leukemia dataset. 60
Figure 30、The cumulative density plot of the average relevance based on the different weight value of objective 3 and same α constraint. (SRBCT dataset) 62
Figure 31、The cumulative density plot of the average relevance based on the different weight value of objective 3 and same α constraint. (Leukemia dataset) 64
Figure 32、The cumulative density plot of the average mutual information based on the different weight value of objective 3 and same α constraint. (SRBCT dataset) 66
Figure 33、The cumulative density plot of the average mutual information based on the different weight value of objective 3 and same α constraint. (Leukemia dataset) 68
Figure 34、The cumulative density plot of the average correlation based on the different weight value of objective 3 and same α constraint. (SRBCT dataset) 70
Figure 35、The cumulative density plot of the average correlation based on the different weight value of objective 3 and same α constraint. (Leukemia dataset) 72
Figure 36、The result of pathway (for weight 3 = 0.1 and α = 2). 77
Figure 37、The result of pathway (for weight 3 = 0.1 and α = 3). 77
Figure 38、The pathway map of MAPK signaling pathway (for α = 2). 78
Figure 39、The pathway map of pathways in cancer (for α = 2) 78
Figure 40、The pathway map of acute myeloid leukemia (for α = 2). 79
Figure 41、The pathway map of MAPK signaling pathway (for α = 3) 79
Figure 42、The pathway map of pathways in cancer (for α = 3) 80
Figure 43、The pathway map of acute myeloid leukemia (for α = 3). 80
Figure 44、The relationship of FLT3 and IKK in acute myeloid leukemia pathway. 81
Figure 45、The relationship of FLT3 and p70S6K in acute myeloid leukemia pathway. 81

List of Tables
Table 1、Dataset information. 8
Table 2、The component 1 variant in objective 1 in SRBCT and Leukemia dataset 29
Table 3、Performance table of all optimal sets in the SRBCT dataset (the value of weight vector in objective 3 is equal to 0.05). 52
Table 4、Performance table of all optimal sets in SRBCT dataset (weight 3 of 0.1). 52
Table 5、The performance table of optimal sets from SRBCT dataset, the value of weight vector in objective 3 is equal to 0.2. 53
Table 6、Performance table of all optimal sets in Leukemia dataset (weight 3 of 0.05). 53
Table 7、The performance table of optimal sets from Leukemia dataset, the value of weight vector in objective 3 is equal to 0.1. 54
Table 8、Performance table of all optimal sets in Leukemia dataset. The value of weight vector in objective 3 is equal to 0.2. 54
Table 9、The performance of top 50 genes for SRBCT dataset. 74
Table 10、The performance of optimal gene sets for SRBCT dataset. 75
Table 11、The performance of top 50 genes for Leukemia dataset. 75
Table 12、The performance of optimal gene sets for Leukemia dataset. 75


Armstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R., den Boer, M.L., Minden, M.D., Sallan, S.E., Lander, E.S., Golub, T.R., and Korsmeyer, S.J. (2002). MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature genetics 30, 41-47.

Chen, W., Drakos, E., Grammatikakis, I., Schlette, E.J., Li, J., Leventaki, V., Staikou-Drakopoulou, E., Patsouris, E., Panayiotidis, P., and Medeiros, L.J. (2010). mTOR signaling is activated by FLT3 kinase and promotes survival of FLT3-mutated acute myeloid leukemia cells. Mol Cancer 9, 6.

Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. Evolutionary Computation, IEEE Transactions on 6, 182-197.

Dhanasekaran, S.M., Barrette, T.R., Ghosh, D., Shah, R., Varambally, S., Kurachi, K., Pienta, K.J., Rubin, M.A., and Chinnaiyan, A.M. (2001). Delineation of prognostic biomarkers in prostate cancer. Nature 412, 822-826.

Ding, C., and Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology 3, 185-205.

Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., and Haussler, D. (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906-914.

Heller, M.J. (2002). DNA microarray technology: devices, systems, and applications. Annual review of biomedical engineering 4, 129-153.

Kallioniemi, O.-P., Wagner, U., Kononen, J., and Sauter, G. (2001). Tissue microarray technology for high-throughput molecular profiling of cancer. Human molecular genetics 10, 657-662.

Kanehisa, M. (2000). Post-genome informatics (Oxford University Press (OUP)).

Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., and Peterson, C. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine 7, 673-679.

Kim, J.W., and Wang, X.W. (2003). Gene expression profiling of preneoplastic liver disease and liver cancer: a new era for improved early detection and treatment of these deadly diseases? Carcinogenesis 24, 363-369.

Li, K.-C., Palotie, A., Yuan, S., Bronnikov, D., Chen, D., Wei, X., Choi, O.-W., Saarela, J., and Peltonen, L. (2007). Finding disease candidate genes by liquid association. Genome Biol 8, R205.
Miettinen, K. (1999). Nonlinear multiobjective optimization, Vol 12 (Springer).
Piao, Y., Piao, M., Park, K., and Ryu, K.H. (2012). An ensemble correlation-based gene selection algorithm for cancer classification with gene expression data. Bioinformatics 28, 3306-3315.

Saeys, Y., Inza, I., and Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. bioinformatics 23, 2507-2517.

Scheubert, L., Luštrek, M., Schmidt, R., Repsilber, D., and Fuellen, G. (2012). Tissue-based Alzheimer gene expression markers–comparison of multiple machine learning approaches and investigation of redundancy in small biomarker sets. BMC bioinformatics 13, 266.

SHANNON, C. (1948). A Mathematical Theory of Communication.

Stoughton, R.B. (2005). Applications of DNA microarrays in biology. Annu Rev Biochem 74, 53-82.

van Hal, N.L., Vorst, O., van Houwelingen, A.M., Kok, E.J., Peijnenburg, A., Aharoni, A., van Tunen, A.J., and Keijer, J. (2000). The application of DNA microarrays in gene expression analysis. Journal of Biotechnology 78, 271-280.

Watkinson, J., Wang, X., Zheng, T., and Anastassiou, D. (2008). Identification of gene interactions associated with disease from gene expression data using synergy networks. BMC systems biology 2, 10.

Weisberg, E., Barrett, R., Liu, Q., Stone, R., Gray, N., and Griffin, J.D. (2009). FLT3 inhibition and mechanisms of drug resistance in mutant FLT3-positive AML. Drug Resistance Updates 12, 81-89.

Yang, K., Cai, Z., Li, J., and Lin, G. (2006). A stable gene selection in microarray data analysis. BMC bioinformatics 7, 228.

Zhang, Q., and Li, H. (2007). MOEA/D: A multiobjective evolutionary algorithm based on decomposition. Evolutionary Computation, IEEE Transactions on 11, 712-731.

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top