跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.10) 您好!臺灣時間:2025/09/30 17:47
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:李育任
研究生(外文):Lee, Yu Ren
論文名稱:以機械學習方式預測藥物之小腸吸收度
論文名稱(外文):Predicting Drug Human Intestinal Absorption using Machine Learning
指導教授:林志侯
指導教授(外文):Lin, Thy Hou
學位類別:碩士
校院名稱:國立清華大學
系所名稱:分子醫學研究所
學門:醫藥衛生學門
學類:醫學學類
論文種類:學術論文
論文出版年:2015
畢業學年度:103
語文別:中文
論文頁數:85
中文關鍵詞:機械學習小腸吸收特徵選取支持向量機十折交叉比對結構最佳化
外文關鍵詞:Machine learningHuman intestinal absorptionFeature selectionSupport vector machine10-fold cross validationstructure optimization
相關次數:
  • 被引用被引用:0
  • 點閱點閱:276
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
摘要
一開始我們從NCBI網站獲得的180個擁有不同小腸吸收率藥物分子。由於做任何分子計算前必須先將分子結構最佳化,我們用國家高速網路中心(NCHC)的Gaussian09,以DFT(density function)/6-31方法,B3LYP函數,將藥物分子最佳化。由於Gaussain09所運算之MOL檔是以座標形式儲存之座標檔,故再將檔案以Discovery Studio 3.1 轉成真正的3D立體結構。Padel-molecular descriptor是計算分子特徵以及特性軟體,是由新加坡國立大學(NUS)發展,擁有強大計算能力並可計算1875種平面和立體分子特徵。WEKA是紐西蘭Wekato大學發展之機械學習軟體,用於機械學習(machine learning)、資料探勘(data mining )以及特徵選取(feature selection)等。而特徵選取(feature selection)的主要依據計算方法是以最佳特徵選取(CfsSubsetEval)並配合學習粒子群最佳化(Particle Swarm Optimization, PSO)、演化演算法(Evolutionary algorithm)以及其它五種輔助的演算法來選取。支持向量機(Support Vector Machine, SVM)是本研究分類的工具,它有三種主要參數,分別是C (cost)、gamma (γ)以及ε。由於不佳的參數選取會導致分類結果不理想或過度擬合,因此以Pearson相關係數做為參數選取的依據。分類與選取是交替進行的,直到最後選取的特徵無法再繼續縮小範圍為止,而特徵選取是為了要讓分類結果更完整,最後幾個階段選出來的特徵也是最具有代表性的。在分類與選取交替使用後,依序得到與吸收度越發相關的分子特徵,此特徵群體分別是1104, 625, 280, 177, 98, 50 以及 37等特徵數目。最後再以non validated statistics, R^2 和 10-fold statistics, Q^2,得到選取特徵數(NFS)=98是具預測能力最佳的一組特徵,並將預測結果與原來小腸吸收度比較,得到線性之相關係數correlation coefficient R^2 = 0.887與0.5431。因此這98是所選之最有預測能力的分子特徵。最後再將額外的13個藥物分子以所建立的模型預測之,得到 q^2=0.729 以及correlation coefficient R^2=0.7536 .

Abstract
In the beginning there are 180 drug compounds with different human intestinal absorption (HIA) values obtained from literatures. From NCBI these 180 compound 3D structures are obtained. Before when starts any chemical compounds calculation, it is necessary to have them optimized. We use Gaussian09 in NCHC (National Center for High-performance Computing) with DFT, 6-31G via B3LYP energy levels to optimize those compounds. Discovery Studio is responsible to convey the coordinates to real 3D structures. Padel(Pharmaceutical Data Exploration Laboratory)-molecular descriptors is developed from National University of Singapore (NUS), is a powerful software contains 1875 molecular descriptors from 2D to 3D. And then we calculated 2D and 3D descriptors via Padel.
WEKA is a data exploring machine learning soft ware developed via Wekato University, New Zealand. WEKA offers classification selecting attributes function. Feature selection is mainly based on selecting attributes function. The selecting methods are dominantly calculating via best-first evaluator with PSO (Particle Swarm Optimization) and EA (evolutionary algorithm) methods, another 5 algorithms are used to compensate whether there are some possible related to HIA features maybe being lost .
SVM is used to classify and valid the results via choosing the proper parameters. There are three parameters important to classification: Cost (C), Gamma (γ) and epsilon (ε). Bad parameter chooses induces incorrect classification or overffiting results. In this research we have tried many series of parameters to obtain the best set validated via Pearson correlation coefficient and cross validate statistics.
Classification and feature selection are used alternately until the ranges of selected features are not narrowed. By means of feature selection, the classification is getting better. Selecting features step by step its number of selected features (NFS) vary from 1875, 1104, 625, 280, 177, 98, 50 and 37.
Using non validated statistics and 10-fold statistics, NFS=98 is the best predictive feature sets and its correlation coefficient R^2 is 0.88. That is these 98 molecular descriptors are highly correlated to intestinal absorption. Finally the testing task is completed via external validation. 13 drug compounds with different HIA values are used to the external validation. The q^2 is 0.729 and R^2 is 0.7536.

Contents
Introduction………………………………………………………………........1
Materials and methods
2.1 Optimization of compound structures…………………………….……....….....3
2.1.1 from NCBI………………………………………………….….….....….…3
2.1.2 gaussian09 calculation…………………………………...…...…....…….…3
2.1.3 convey files via DS………………………………………….…....….….….3
2.2 Calculation of molecular descriptors……………………………….…..…..……4
2.2.1 2D&3D molecular descriptors…………………………………..……..…...4
2.2.2 MMFF94 force field………………………………………….….…..……..4
2.3 libSVM……………………………………………………….…….….….……..5
2.3.1 principle and formula…………………………………...……..…..………..5
2.3.2 parameter choose……………………………….…………….…….………7
2.3.3 RBF kernel function………………………………………….…….……….8
2.4 Feature selection……………………………………………………..…………..8
2.4.1 CfsSubsetEval…………………………………………….….…………..8
2.4.2 search methods…………………………………………….…………….8
Best first………………………………………………….……………….9
Evolutionary Search Algorithm (EA)……………………….……………9
Greedy stepwise…………………………………………………...…….10
Linear forward selection………………………………………………...10
Particle Swarm Optimization (PSO) algorithm…………………………10
Subset Size Forward Selection………………………………………….11
Tabu Search……………………………………………………………..11

2.5 Pearson correlation coefficient…………………………………………………11
2.6 Q square, R square (non cross and 10-fold cross validations)………………….12

Results
3.1 optimization…………………………………………………………..….13
3.2 molecular descriptor……………………………………………………..26
3.3 Feature selection…………………………………………………..……..27
Discussion
4.1 The bigger HIA, the predictions are closed to original values……..….…56
4.2 Feature selection can reduce calculating time and source……………..…56
Reference……………………………………………………………….…….57
Appendix……………………………………………………………………………..61

List of figures
Fig.1: non-optimized Aspirin (HIA=100%)………………………………………….14
Fig.2: Coordinates of non-optimized Aspirin (HIA=100%)…………………………14
Fig.3: Non-optimized Chlorphenesin (HIA=100%)……………………………….…15
Fig.4: Coordinates of non-optimized Chlorphenesin (HIA=100%)………………….15
Fig.5 Coordinates of non-optimized Fenclofenac (HIA=100%)…………...………16
Fig.6: Coordinates of non-optimized Fenclofenac (HIA=100%)…………………….16
Fig.7 Coordinates of non-optimized Acetohexamide (HIA=80%)………………...17
Fig.8 Coordinates of non-optimized Acetohexamide (HIA=80%)………………...17
Fig.9 Coordinates of non-optimized Cefetametpivoxil (HIA=47%)………………18
Fig.10 Coordinates of non-optimized Cefetametpivoxil…………………………...18
Fig.11 Coordinates of non-optimized Lactulose (HIA=0.6%)……………………..19
Fig.12 Coordinates of non-optimized Lactulose (HIA=0.6%)……………………..19
Fig.13 Optimized Aspirin (HIA=100%)…………………...……………………….20
Fig.14 Coordinates of optimized Aspirin (HIA=100%)……………………………20
Fig.15 3D structure of optimized Aspirin (HIA=100%)…………………………...20
Fig.16 Optimized Chlorphenesin (HIA=100%)……………………………………21
Fig.17 Coordinates of optimized Chlorphenesin (HIA=100%)……………………21
Fig.18 3D structure of optimized Chlorphenesi (HIA=100%)……………………..21
Fig.19 Optimized Fenclofenac (HIA=100%)………………………………………22
Fig.20 Coordinates of optimized Fenclofenac (HIA=100%)………………………22
Fig.21 3D structure of Fenclofenac (HIA=100%)…………………………………22
Fig.22 Optimized Acetohexamide (HIA=80%)……………………………………23
Fig.23 Coordinates of optimized Acetohexamide (HIA=80%)…………………….23
Fig.24 3D structure of Fenclofenac Acetohexamide (HIA=80%)………………….23
Fig.25 Optimized Cefetametpivoxil (HIA=47%)…………………………………..24
Fig.26 Coordinates of optimized Cefetametpivoxil (HIA=47%)…………………..24
Fig.27 3D structure of optimized Cefetametpivoxil (HIA=47%)………………….24
Fig.28 Optimized Lactulose (HIA=0.6%)………………………………………….25
Fig.29 Coordinates of optimized Lactulose (HIA=0.6%)………………………….25
Fig.30 3D structure of optimized Lactulose (HIA=0.6%)…………………………25
Fig. 31 Predicted HIA values versus original values. Correlation coefficient R^2=0.887………………………………………………………………………...…53
Fig. 32 Predicted HIA values versus original values. Correlation coefficient R^2=0.5431………………………………………………………………………….53
Fig. 33 Predicted HIA values versus original ones in external validation. The correlation coefficient R^2 is 0.7536………………………………………………….55 
List of Tables
Table 1 The PaDEL output molecular descriptors…………………………………26
Table 2 Pearson correlation of using full data as training set when features=1875..28
Table 3 Pearson correlation of 10-fold cross validation when features=1875……..29
Table 4 Continue adding cost values when features=1875………………………...29
Table 5 Pearson correlation of using full data as full training set when features=1104……………………………………………………………...…………30
Table 6 Pearson correlation of 10-fold cross validation when features=1104……..30
Table 7 The prediction of using full training data when features=1104……………31
Table 8 The prediction of 10-fold cross validation when features=1104……..……32
Table 9 The Pearson correlation of using full training data set when selected features=625. The Pearson correlation is 0.9677…………………………………….33
Table 10 The Pearson correlation of 10-fold cross validation when selected features=625………………………………………………………………………….33
Table 11 The prediction of using full training set when features=625……………..34
Table 12 The predicted HIA of 10-fold cross validation set when features=625…..35
Table 13 The Pearson correlation of using full training set when selected features=280………………………………………………………………………….36
Table 14 The Pearson correlation of 10-fold cross validation
when features=280……………………………………………………….36
Table 15 The predicted HIA of using full training set when features=280………...37
Table 16 The predicted HIA of 10-flod cross validation when features=280……...38
Table 17 The Pearson correlation of using full training set when selected features=177………………………………………………………………………….39
Table 18 The Pearson correlation of 10-fold cross
validation when features=177……………………………………………39
Table 19 C varies from 600 to 1100 with g=0.0056 and g=0.001………………….40
Table 20 The predicted HIA of using full training set when features=177………...41
Table 21 The predicted HIA of 10-flod cross validation when features=177……...42
Table 22 The Pearson correlation of using full training set when selected features=98…………………………………………………………………………...43
Table 23 The Pearson correlation of 10-fold cross validation when features=98….43
Table 24 The predicted HIA of using full training set when features=98………….44
Table 25 The predicted HIA of 10-flod cross validation when features=98……….45
Table 26 The Pearson correlation of using full training set when selected features=50…………………………………………………………………………...46
Table 27 The Pearson correlation of 10-fold cross validation when features=50….46
Table 28 The predicted HIA of using full training set when features=50………….47
Table 29 The predicted HIA of 10-flod cross validation when features=98……….48
Table 30 The Pearson correlation of using full training set when selected features=37…………………………………………………………………………...49
Table 31 The Pearson correlation of 10-fold cross validation when features=37….49
Table 32 The predicted HIA of using full training set when features=37………….50
Table 33 The predicted HIA of using full training set when features=37………….51
Table 34 The statistics of feature selection………………………………………...52
Table 35 Testing compounds with HIA%.................................................................54
Table 36 The external validations when NFS=98………………………………….55

1. Klopman, G., L.R. Stefan, and R.D. Saiakhov, ADME evaluation. 2. A computer model for the prediction of intestinal absorption in humans. Eur J Pharm Sci, 2002. 17(4-5): p. 253-63.
2. Eddershaw, P.J., A.P. Beresford, and M.K. Bayliss, ADME/PK as part of a rational approach to drug discovery. Drug Discovery Today, 2000. 5(9): p. 409-414.
3. Selick, H.E., A.P. Beresford, and M.H. Tarbit, The emerging importance of predictive ADME simulation in drug discovery. Drug Discovery Today, 2002. 7(2): p. 109-116.
4. Fung, M., et al., Evaluation of the characteristics of safety withdrawal of prescription drugs from worldwide pharmaceutical markets-1960 to 1999*. Drug Information Journal, 2001. 35(1): p. 293-317.
5. Rodrigues, A.D., Preclinical drug metabolism in the age of high-throughput screening: an industrial perspective. Pharmaceutical research, 1997. 14(11): p. 1504-1510.
6. Butina, D., M.D. Segall, and K. Frankcombe, Predicting ADME properties in silico: methods and models. Drug discovery today, 2002. 7(11): p. S83-S88.
7. Li, A.P., Screening for human ADME/Tox drug properties in drug discovery. Drug Discov Today, 2001. 6(7): p. 357-366.
8. DiMasi, J.A., Risks in new drug development: approval success rates for investigational drugs. 2001.
9. Gertrudes, J.C., et al., Machine learning techniques and drug design. Curr Med Chem, 2012. 19(25): p. 4289-97.
10. Byvatov, E., et al., Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. J Chem Inf Comput Sci, 2003. 43(6): p. 1882-9.
11. Heikamp, K. and J. Bajorath, Support vector machines for drug discovery. Expert Opin Drug Discov, 2014. 9(1): p. 93-104.
12. Dougherty, J., R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. in Machine learning: proceedings of the twelfth international conference. 1995.
13. Zhao, Z. and H. Liu. Spectral feature selection for supervised and unsupervised learning. in Proceedings of the 24th international conference on Machine learning. 2007. ACM.
14. Li, B.K., et al., In silico prediction of spleen tyrosine kinase inhibitors using machine learning approaches and an optimized molecular descriptor subset generated by recursive feature elimination method. Comput Biol Med, 2013. 43(4): p. 395-404.
15. Yap, C.W., PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem, 2011. 32(7): p. 1466-74.
16. Halgren, T.A., Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. Journal of computational chemistry, 1996. 17(5‐6): p. 490-519.
17. Borhani, D.W. and D.E. Shaw, The future of molecular dynamics simulations in drug discovery. J Comput Aided Mol Des, 2012. 26(1): p. 15-26.
18. Jiang, Y. and R.B. Cole, Oligosaccharide analysis using anion attachment in negative mode electrospray mass spectrometry. J Am Soc Mass Spectrom, 2005. 16(1): p. 60-70.
19. Barrows, S.E., et al., Relative Stability of Alternative Chair Forms and Hydroxymethyl Conformations of Beta-D-Glucopyranose. Carbohydrate Research, 1995. 276(2): p. 219-251.
20. Barrows, S.E., et al., Factors controlling, relative stability of anomers and hydroxymethyl conformers of glucopyranose. Journal of Computational Chemistry, 1998. 19(10): p. 1111-1129.
21. Lii, J.H., B.Y. Ma, and N.L. Allinger, Importance of selecting proper basis set in quantum mechanical studies of potential energy surfaces of carbohydrates. Journal of Computational Chemistry, 1999. 20(15): p. 1593-1603.
22. Tanaka, M., et al., An ONIOM Study of a Guanidinium Salt Ionic Liquid. Experimental and Computational Characterization of N,N,N '' N '' N ''-Pentabutyl-N ''-benzylguanidinium Bromide. Zeitschrift Fur Naturforschung Section B-a Journal of Chemical Sciences, 2009. 64(6): p. 765-772.
23. Caruana, R. and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. in Proceedings of the 23rd international conference on Machine learning. 2006. ACM.
24. Bertelli, L., et al., Kernelized Structural SVM Learning for Supervised Object Segmentation. 2011 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), 2011.
25. Chang, C.-C. and C.-J. Lin, LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2011. 2(3): p. 27.
26. Schölkopf, B., et al., Input space versus feature space in kernel-based methods. Neural Networks, IEEE Transactions on, 1999. 10(5): p. 1000-1017.
27. Pirooznia, M., et al., A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics, 2008. 9 Suppl 1: p. S13.
28. Lin, S.-l. and Z. Liu, Parameter selection in SVM with RBF kernel function. Journal-Zhejiang University of Technology, 2007. 35(2): p. 163.
29. Chang, Y.-W., et al. Low-degree polynomial mapping of data for svm. in Journal of Machine Learning Research. 2010. Citeseer.
30. Keerthi, S.S. and C.-J. Lin, Asymptotic behaviors of support vector machines with Gaussian kernel. Neural computation, 2003. 15(7): p. 1667-1689.
31. Ben-Hur, A. and J. Weston, A user’s guide to support vector machines, in Data mining techniques for the life sciences. 2010, Springer. p. 223-239.
32. Guyon, I. and A. Elisseeff, An introduction to variable and feature selection. The Journal of Machine Learning Research, 2003. 3: p. 1157-1182.
33. Hall, M.A. and L.A. Smith, Feature subset selection: A correlation based filter approach. Progress in Connectionist-Based Information Systems, Vols 1 and 2, 1998: p. 855-858.
34. Kohavi, R. and G.H. John, Wrappers for feature subset selection. Artificial Intelligence, 1997. 97(1-2): p. 273-324.
35. Hall, M.A. and L.A. Smith, Practical feature subset selection for machine learning. Proceedings of the 21st Australasian Computer Science Conference, Acsc'98, 1998. 20(1): p. 181-191.
36. Hu, X.S., G. Greenwood, and J.G. D'Ambrosio, An evolutionary approach to hardware/software partitioning, in Parallel Problem Solving from Nature—PPSN IV. 1996, Springer. p. 900-909.
37. Malhotra, R., N. Singh, and Y. Singh, Genetic algorithms: Concepts, design for optimization of process controllers. Computer and Information Science, 2011. 4(2): p. p39.
38. 林豐澤, 演化式計算上篇: 演化式演算法的三種理論模式.
39. Zacharaki, E.I., V.G. Kanas, and C. Davatzikos, Investigating machine learning techniques for MRI-based classification of brain neoplasms. International journal of computer assisted radiology and surgery, 2011. 6(6): p. 821-828.
40. Gütlein, M., et al. Large-scale attribute selection using wrappers. in Computational Intelligence and Data Mining, 2009. CIDM'09. IEEE Symposium on. 2009. IEEE.
41. Eberhart, R.C. and Y.H. Shi, Particle swarm optimization: Developments, applications and resources. Proceedings of the 2001 Congress on Evolutionary Computation, Vols 1 and 2, 2001: p. 81-86.
42. Kennedy, J. and R. Eberhart, Particle swarm optimization. 1995 Ieee International Conference on Neural Networks Proceedings, Vols 1-6, 1995: p. 1942-1948.
43. Akbari, R. and K. Ziarati, A rank based particle swarm optimization algorithm with dynamic adaptation. Journal of Computational and Applied Mathematics, 2011. 235(8): p. 2694-2714.
44. Hedar, A.-R., J. Wang, and M. Fukushima, Tabu Search for Attribute Reduction in Rough Set Theory. 2006.
45. Glover, F., Future paths for integer programming and links to artificial intelligence. Computers & operations research, 1986. 13(5): p. 533-549.
46. Hauke, J. and T. Kossowski, Comparison of values of Pearson's and Spearman's correlation coefficients on the same sets of data. Quaestiones Geographicae, 2011. 30(2): p. 87-93.
47. Mukaka, M., A guide to appropriate use of Correlation coefficient in medical research. Malawi Medical Journal, 2012. 24(3): p. 69-71.
48. Swinscow, T.D.V., Statistics at square one. 2002.
49. Lin, T.H. and T.L. Tsai, Constructing a linear QSAR for some metabolizable drugs by human or pig flavin-containing monooxygenases using some molecular features selected by a genetic algorithm trained SVM. J Theor Biol, 2014. 356: p. 85-97.

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊