臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.10) 您好！臺灣時間：2025/09/30 17:47

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
紙本論文
論文連結
QR Code

本論文永久網址:

研究生:

李育任

研究生(外文):

Lee, Yu Ren

論文名稱:

以機械學習方式預測藥物之小腸吸收度

論文名稱(外文):

Predicting Drug Human Intestinal Absorption using Machine Learning

指導教授:

林志侯

指導教授(外文):

Lin, Thy Hou

學位類別:

碩士

校院名稱:

國立清華大學

系所名稱:

分子醫學研究所

學門:

醫藥衛生學門

學類:

醫學學類

論文種類:

學術論文

論文出版年:

2015

畢業學年度:

103

語文別:

中文

論文頁數:

中文關鍵詞:

機械學習、小腸吸收、特徵選取、支持向量機、十折交叉比對、結構最佳化

外文關鍵詞:

Machine learning、Human intestinal absorption、Feature selection、Support vector machine、10-fold cross validation、structure optimization

相關次數:

被引用:0
點閱:276
評分:
下載:0
書目收藏:0

摘要
一開始我們從NCBI網站獲得的180個擁有不同小腸吸收率藥物分子。由於做任何分子計算前必須先將分子結構最佳化，我們用國家高速網路中心(NCHC)的Gaussian09，以DFT(density function)/6-31方法，B3LYP函數，將藥物分子最佳化。由於Gaussain09所運算之MOL檔是以座標形式儲存之座標檔，故再將檔案以Discovery Studio 3.1 轉成真正的3D立體結構。Padel-molecular descriptor是計算分子特徵以及特性軟體，是由新加坡國立大學(NUS)發展，擁有強大計算能力並可計算1875種平面和立體分子特徵。WEKA是紐西蘭Wekato大學發展之機械學習軟體，用於機械學習(machine learning)、資料探勘(data mining )以及特徵選取(feature selection)等。而特徵選取(feature selection)的主要依據計算方法是以最佳特徵選取(CfsSubsetEval)並配合學習粒子群最佳化(Particle Swarm Optimization, PSO)、演化演算法(Evolutionary algorithm)以及其它五種輔助的演算法來選取。支持向量機(Support Vector Machine, SVM)是本研究分類的工具，它有三種主要參數，分別是C (cost)、gamma (γ)以及ε。由於不佳的參數選取會導致分類結果不理想或過度擬合，因此以Pearson相關係數做為參數選取的依據。分類與選取是交替進行的，直到最後選取的特徵無法再繼續縮小範圍為止，而特徵選取是為了要讓分類結果更完整，最後幾個階段選出來的特徵也是最具有代表性的。在分類與選取交替使用後，依序得到與吸收度越發相關的分子特徵，此特徵群體分別是1104, 625, 280, 177, 98, 50 以及 37等特徵數目。最後再以non validated statistics, R^2 和 10-fold statistics, Q^2，得到選取特徵數(NFS)=98是具預測能力最佳的一組特徵，並將預測結果與原來小腸吸收度比較，得到線性之相關係數correlation coefficient R^2 = 0.887與0.5431。因此這98是所選之最有預測能力的分子特徵。最後再將額外的13個藥物分子以所建立的模型預測之，得到 q^2=0.729 以及correlation coefficient R^2=0.7536 .

Abstract
In the beginning there are 180 drug compounds with different human intestinal absorption (HIA) values obtained from literatures. From NCBI these 180 compound 3D structures are obtained. Before when starts any chemical compounds calculation, it is necessary to have them optimized. We use Gaussian09 in NCHC (National Center for High-performance Computing) with DFT, 6-31G via B3LYP energy levels to optimize those compounds. Discovery Studio is responsible to convey the coordinates to real 3D structures. Padel(Pharmaceutical Data Exploration Laboratory)-molecular descriptors is developed from National University of Singapore (NUS), is a powerful software contains 1875 molecular descriptors from 2D to 3D. And then we calculated 2D and 3D descriptors via Padel.
WEKA is a data exploring machine learning soft ware developed via Wekato University, New Zealand. WEKA offers classification selecting attributes function. Feature selection is mainly based on selecting attributes function. The selecting methods are dominantly calculating via best-first evaluator with PSO (Particle Swarm Optimization) and EA (evolutionary algorithm) methods, another 5 algorithms are used to compensate whether there are some possible related to HIA features maybe being lost .
SVM is used to classify and valid the results via choosing the proper parameters. There are three parameters important to classification: Cost (C), Gamma (γ) and epsilon (ε). Bad parameter chooses induces incorrect classification or overffiting results. In this research we have tried many series of parameters to obtain the best set validated via Pearson correlation coefficient and cross validate statistics.
Classification and feature selection are used alternately until the ranges of selected features are not narrowed. By means of feature selection, the classification is getting better. Selecting features step by step its number of selected features (NFS) vary from 1875, 1104, 625, 280, 177, 98, 50 and 37.
Using non validated statistics and 10-fold statistics, NFS=98 is the best predictive feature sets and its correlation coefficient R^2 is 0.88. That is these 98 molecular descriptors are highly correlated to intestinal absorption. Finally the testing task is completed via external validation. 13 drug compounds with different HIA values are used to the external validation. The q^2 is 0.729 and R^2 is 0.7536.

Contents
Introduction………………………………………………………………........1
Materials and methods
2.1 Optimization of compound structures…………………………….……....….....3
2.1.1 from NCBI………………………………………………….….….....….…3
2.1.2 gaussian09 calculation…………………………………...…...…....…….…3
2.1.3 convey files via DS………………………………………….…....….….….3
2.2 Calculation of molecular descriptors……………………………….…..…..……4
2.2.1 2D&3D molecular descriptors…………………………………..……..…...4
2.2.2 MMFF94 force field………………………………………….….…..……..4
2.3 libSVM……………………………………………………….…….….….……..5
2.3.1 principle and formula…………………………………...……..…..………..5
2.3.2 parameter choose……………………………….…………….…….………7
2.3.3 RBF kernel function………………………………………….…….……….8
2.4 Feature selection……………………………………………………..…………..8
2.4.1 CfsSubsetEval…………………………………………….….…………..8
2.4.2 search methods…………………………………………….…………….8
Best first………………………………………………….……………….9
Evolutionary Search Algorithm (EA)……………………….……………9
Greedy stepwise…………………………………………………...…….10
Linear forward selection………………………………………………...10
Particle Swarm Optimization (PSO) algorithm…………………………10
Subset Size Forward Selection………………………………………….11
Tabu Search……………………………………………………………..11

2.5 Pearson correlation coefficient…………………………………………………11
2.6 Q square, R square (non cross and 10-fold cross validations)………………….12

Results
3.1 optimization…………………………………………………………..….13
3.2 molecular descriptor……………………………………………………..26
3.3 Feature selection…………………………………………………..……..27
Discussion
4.1 The bigger HIA, the predictions are closed to original values……..….…56
4.2 Feature selection can reduce calculating time and source……………..…56
Reference……………………………………………………………….…….57
Appendix……………………………………………………………………………..61

List of figures
Fig.1: non-optimized Aspirin (HIA=100%)………………………………………….14
Fig.2: Coordinates of non-optimized Aspirin (HIA=100%)…………………………14
Fig.3: Non-optimized Chlorphenesin (HIA=100%)……………………………….…15
Fig.4: Coordinates of non-optimized Chlorphenesin (HIA=100%)………………….15
Fig.5 Coordinates of non-optimized Fenclofenac (HIA=100%)…………...………16
Fig.6: Coordinates of non-optimized Fenclofenac (HIA=100%)…………………….16
Fig.7 Coordinates of non-optimized Acetohexamide (HIA=80%)………………...17
Fig.8 Coordinates of non-optimized Acetohexamide (HIA=80%)………………...17
Fig.9 Coordinates of non-optimized Cefetametpivoxil (HIA=47%)………………18
Fig.10 Coordinates of non-optimized Cefetametpivoxil…………………………...18
Fig.11 Coordinates of non-optimized Lactulose (HIA=0.6%)……………………..19
Fig.12 Coordinates of non-optimized Lactulose (HIA=0.6%)……………………..19
Fig.13 Optimized Aspirin (HIA=100%)…………………...……………………….20
Fig.14 Coordinates of optimized Aspirin (HIA=100%)……………………………20
Fig.15 3D structure of optimized Aspirin (HIA=100%)…………………………...20
Fig.16 Optimized Chlorphenesin (HIA=100%)……………………………………21
Fig.17 Coordinates of optimized Chlorphenesin (HIA=100%)……………………21
Fig.18 3D structure of optimized Chlorphenesi (HIA=100%)……………………..21
Fig.19 Optimized Fenclofenac (HIA=100%)………………………………………22
Fig.20 Coordinates of optimized Fenclofenac (HIA=100%)………………………22
Fig.21 3D structure of Fenclofenac (HIA=100%)…………………………………22
Fig.22 Optimized Acetohexamide (HIA=80%)……………………………………23
Fig.23 Coordinates of optimized Acetohexamide (HIA=80%)…………………….23
Fig.24 3D structure of Fenclofenac Acetohexamide (HIA=80%)………………….23
Fig.25 Optimized Cefetametpivoxil (HIA=47%)…………………………………..24
Fig.26 Coordinates of optimized Cefetametpivoxil (HIA=47%)…………………..24
Fig.27 3D structure of optimized Cefetametpivoxil (HIA=47%)………………….24
Fig.28 Optimized Lactulose (HIA=0.6%)………………………………………….25
Fig.29 Coordinates of optimized Lactulose (HIA=0.6%)………………………….25
Fig.30 3D structure of optimized Lactulose (HIA=0.6%)…………………………25
Fig. 31 Predicted HIA values versus original values. Correlation coefficient R^2=0.887………………………………………………………………………...…53
Fig. 32 Predicted HIA values versus original values. Correlation coefficient R^2=0.5431………………………………………………………………………….53
Fig. 33 Predicted HIA values versus original ones in external validation. The correlation coefficient R^2 is 0.7536………………………………………………….55
List of Tables
Table 1 The PaDEL output molecular descriptors…………………………………26
Table 2 Pearson correlation of using full data as training set when features=1875..28
Table 3 Pearson correlation of 10-fold cross validation when features=1875……..29
Table 4 Continue adding cost values when features=1875………………………...29
Table 5 Pearson correlation of using full data as full training set when features=1104……………………………………………………………...…………30
Table 6 Pearson correlation of 10-fold cross validation when features=1104……..30
Table 7 The prediction of using full training data when features=1104……………31
Table 8 The prediction of 10-fold cross validation when features=1104……..……32
Table 9 The Pearson correlation of using full training data set when selected features=625. The Pearson correlation is 0.9677…………………………………….33
Table 10 The Pearson correlation of 10-fold cross validation when selected features=625………………………………………………………………………….33
Table 11 The prediction of using full training set when features=625……………..34
Table 12 The predicted HIA of 10-fold cross validation set when features=625…..35
Table 13 The Pearson correlation of using full training set when selected features=280………………………………………………………………………….36
Table 14 The Pearson correlation of 10-fold cross validation
when features=280……………………………………………………….36
Table 15 The predicted HIA of using full training set when features=280………...37
Table 16 The predicted HIA of 10-flod cross validation when features=280……...38
Table 17 The Pearson correlation of using full training set when selected features=177………………………………………………………………………….39
Table 18 The Pearson correlation of 10-fold cross
validation when features=177……………………………………………39
Table 19 C varies from 600 to 1100 with g=0.0056 and g=0.001………………….40
Table 20 The predicted HIA of using full training set when features=177………...41
Table 21 The predicted HIA of 10-flod cross validation when features=177……...42
Table 22 The Pearson correlation of using full training set when selected features=98…………………………………………………………………………...43
Table 23 The Pearson correlation of 10-fold cross validation when features=98….43
Table 24 The predicted HIA of using full training set when features=98………….44
Table 25 The predicted HIA of 10-flod cross validation when features=98……….45
Table 26 The Pearson correlation of using full training set when selected features=50…………………………………………………………………………...46
Table 27 The Pearson correlation of 10-fold cross validation when features=50….46
Table 28 The predicted HIA of using full training set when features=50………….47
Table 29 The predicted HIA of 10-flod cross validation when features=98……….48
Table 30 The Pearson correlation of using full training set when selected features=37…………………………………………………………………………...49
Table 31 The Pearson correlation of 10-fold cross validation when features=37….49
Table 32 The predicted HIA of using full training set when features=37………….50
Table 33 The predicted HIA of using full training set when features=37………….51
Table 34 The statistics of feature selection………………………………………...52
Table 35 Testing compounds with HIA%.................................................................54
Table 36 The external validations when NFS=98………………………………….55

1. Klopman, G., L.R. Stefan, and R.D. Saiakhov, ADME evaluation. 2. A computer model for the prediction of intestinal absorption in humans. Eur J Pharm Sci, 2002. 17(4-5): p. 253-63.
2. Eddershaw, P.J., A.P. Beresford, and M.K. Bayliss, ADME/PK as part of a rational approach to drug discovery. Drug Discovery Today, 2000. 5(9): p. 409-414.
3. Selick, H.E., A.P. Beresford, and M.H. Tarbit, The emerging importance of predictive ADME simulation in drug discovery. Drug Discovery Today, 2002. 7(2): p. 109-116.
4. Fung, M., et al., Evaluation of the characteristics of safety withdrawal of prescription drugs from worldwide pharmaceutical markets-1960 to 1999*. Drug Information Journal, 2001. 35(1): p. 293-317.
5. Rodrigues, A.D., Preclinical drug metabolism in the age of high-throughput screening: an industrial perspective. Pharmaceutical research, 1997. 14(11): p. 1504-1510.
6. Butina, D., M.D. Segall, and K. Frankcombe, Predicting ADME properties in silico: methods and models. Drug discovery today, 2002. 7(11): p. S83-S88.
7. Li, A.P., Screening for human ADME/Tox drug properties in drug discovery. Drug Discov Today, 2001. 6(7): p. 357-366.
8. DiMasi, J.A., Risks in new drug development: approval success rates for investigational drugs. 2001.
9. Gertrudes, J.C., et al., Machine learning techniques and drug design. Curr Med Chem, 2012. 19(25): p. 4289-97.
10. Byvatov, E., et al., Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. J Chem Inf Comput Sci, 2003. 43(6): p. 1882-9.
11. Heikamp, K. and J. Bajorath, Support vector machines for drug discovery. Expert Opin Drug Discov, 2014. 9(1): p. 93-104.
12. Dougherty, J., R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. in Machine learning: proceedings of the twelfth international conference. 1995.
13. Zhao, Z. and H. Liu. Spectral feature selection for supervised and unsupervised learning. in Proceedings of the 24th international conference on Machine learning. 2007. ACM.
14. Li, B.K., et al., In silico prediction of spleen tyrosine kinase inhibitors using machine learning approaches and an optimized molecular descriptor subset generated by recursive feature elimination method. Comput Biol Med, 2013. 43(4): p. 395-404.
15. Yap, C.W., PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem, 2011. 32(7): p. 1466-74.
16. Halgren, T.A., Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. Journal of computational chemistry, 1996. 17(5‐6): p. 490-519.
17. Borhani, D.W. and D.E. Shaw, The future of molecular dynamics simulations in drug discovery. J Comput Aided Mol Des, 2012. 26(1): p. 15-26.
18. Jiang, Y. and R.B. Cole, Oligosaccharide analysis using anion attachment in negative mode electrospray mass spectrometry. J Am Soc Mass Spectrom, 2005. 16(1): p. 60-70.
19. Barrows, S.E., et al., Relative Stability of Alternative Chair Forms and Hydroxymethyl Conformations of Beta-D-Glucopyranose. Carbohydrate Research, 1995. 276(2): p. 219-251.
20. Barrows, S.E., et al., Factors controlling, relative stability of anomers and hydroxymethyl conformers of glucopyranose. Journal of Computational Chemistry, 1998. 19(10): p. 1111-1129.
21. Lii, J.H., B.Y. Ma, and N.L. Allinger, Importance of selecting proper basis set in quantum mechanical studies of potential energy surfaces of carbohydrates. Journal of Computational Chemistry, 1999. 20(15): p. 1593-1603.
22. Tanaka, M., et al., An ONIOM Study of a Guanidinium Salt Ionic Liquid. Experimental and Computational Characterization of N,N,N '' N '' N ''-Pentabutyl-N ''-benzylguanidinium Bromide. Zeitschrift Fur Naturforschung Section B-a Journal of Chemical Sciences, 2009. 64(6): p. 765-772.
23. Caruana, R. and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. in Proceedings of the 23rd international conference on Machine learning. 2006. ACM.
24. Bertelli, L., et al., Kernelized Structural SVM Learning for Supervised Object Segmentation. 2011 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), 2011.
25. Chang, C.-C. and C.-J. Lin, LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2011. 2(3): p. 27.
26. Schölkopf, B., et al., Input space versus feature space in kernel-based methods. Neural Networks, IEEE Transactions on, 1999. 10(5): p. 1000-1017.
27. Pirooznia, M., et al., A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics, 2008. 9 Suppl 1: p. S13.
28. Lin, S.-l. and Z. Liu, Parameter selection in SVM with RBF kernel function. Journal-Zhejiang University of Technology, 2007. 35(2): p. 163.
29. Chang, Y.-W., et al. Low-degree polynomial mapping of data for svm. in Journal of Machine Learning Research. 2010. Citeseer.
30. Keerthi, S.S. and C.-J. Lin, Asymptotic behaviors of support vector machines with Gaussian kernel. Neural computation, 2003. 15(7): p. 1667-1689.
31. Ben-Hur, A. and J. Weston, A user’s guide to support vector machines, in Data mining techniques for the life sciences. 2010, Springer. p. 223-239.
32. Guyon, I. and A. Elisseeff, An introduction to variable and feature selection. The Journal of Machine Learning Research, 2003. 3: p. 1157-1182.
33. Hall, M.A. and L.A. Smith, Feature subset selection: A correlation based filter approach. Progress in Connectionist-Based Information Systems, Vols 1 and 2, 1998: p. 855-858.
34. Kohavi, R. and G.H. John, Wrappers for feature subset selection. Artificial Intelligence, 1997. 97(1-2): p. 273-324.
35. Hall, M.A. and L.A. Smith, Practical feature subset selection for machine learning. Proceedings of the 21st Australasian Computer Science Conference, Acsc'98, 1998. 20(1): p. 181-191.
36. Hu, X.S., G. Greenwood, and J.G. D'Ambrosio, An evolutionary approach to hardware/software partitioning, in Parallel Problem Solving from Nature—PPSN IV. 1996, Springer. p. 900-909.
37. Malhotra, R., N. Singh, and Y. Singh, Genetic algorithms: Concepts, design for optimization of process controllers. Computer and Information Science, 2011. 4(2): p. p39.
38. 林豐澤, 演化式計算上篇: 演化式演算法的三種理論模式.
39. Zacharaki, E.I., V.G. Kanas, and C. Davatzikos, Investigating machine learning techniques for MRI-based classification of brain neoplasms. International journal of computer assisted radiology and surgery, 2011. 6(6): p. 821-828.
40. Gütlein, M., et al. Large-scale attribute selection using wrappers. in Computational Intelligence and Data Mining, 2009. CIDM'09. IEEE Symposium on. 2009. IEEE.
41. Eberhart, R.C. and Y.H. Shi, Particle swarm optimization: Developments, applications and resources. Proceedings of the 2001 Congress on Evolutionary Computation, Vols 1 and 2, 2001: p. 81-86.
42. Kennedy, J. and R. Eberhart, Particle swarm optimization. 1995 Ieee International Conference on Neural Networks Proceedings, Vols 1-6, 1995: p. 1942-1948.
43. Akbari, R. and K. Ziarati, A rank based particle swarm optimization algorithm with dynamic adaptation. Journal of Computational and Applied Mathematics, 2011. 235(8): p. 2694-2714.
44. Hedar, A.-R., J. Wang, and M. Fukushima, Tabu Search for Attribute Reduction in Rough Set Theory. 2006.
45. Glover, F., Future paths for integer programming and links to artificial intelligence. Computers & operations research, 1986. 13(5): p. 533-549.
46. Hauke, J. and T. Kossowski, Comparison of values of Pearson's and Spearman's correlation coefficients on the same sets of data. Quaestiones Geographicae, 2011. 30(2): p. 87-93.
47. Mukaka, M., A guide to appropriate use of Correlation coefficient in medical research. Malawi Medical Journal, 2012. 24(3): p. 69-71.
48. Swinscow, T.D.V., Statistics at square one. 2002.
49. Lin, T.H. and T.L. Tsai, Constructing a linear QSAR for some metabolizable drugs by human or pig flavin-containing monooxygenases using some molecular features selected by a genetic algorithm trained SVM. J Theor Biol, 2014. 356: p. 85-97.

國圖紙本論文

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供，不一定有電子全文可供下載，若連結有誤，請點選上方之〝勘誤回報〞功能，我們會盡快修正，謝謝！

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	支撐向量機應用於科學探索
2.	SupportVectorMachine技術應用於中文文件自動分類之探討
3.	使用支持向量機於本國專利自動分類之研究
4.	模擬退火法應用於支援向量機之參數調整與屬性篩選
5.	支持向量機的特性篩選方法
6.	整合機器學習方法於決策樹為基智慧型排程系統之研究
7.	結合特徵辭分析與粒子族群參數最佳化支援向量機之垃圾郵件分類模型
8.	多類別分類之特徵選擇 - 以專利文件分類為案例
9.	利用迴歸理論在入侵偵測系統的特徵選取之研究
10.	支持向量機於武器系統工程問題之研究─以逆向熱傳導、故障診斷及智能控制器為例
11.	健康資訊網站之中文醫療問題自動分類-以西醫為例
12.	紋理特徵參數分析用於偵測乳房攝影微小鈣化群
13.	文件內容來源對文件分類之績效評估
14.	以支向機為基礎並結合特徵擷取之語者辨識系統
15.	以兩階段分類法建構信用卡授信決策模型的實務評估

無相關期刊

1.	使用量子化學，分子模擬和機械學習的技術去研究探討酵素的機制
2.	影響疾病的蛋白質及藥物之結構資料庫
3.	探討脂肪幹細胞類球體膠原蛋白敷料促進傷口癒合的效果
4.	探討胃幽門螺旋桿菌26695脂多醣參與外膜囊泡之形成與選擇性蛋白質分選
5.	人類第一酵素複合體 NDUFS7 次單元蛋白與 Sumoylation 交互作用及對壓力之反應
6.	果蠅多巴胺乙醯基轉移酶的基質通道大小對酵素活性之影響
7.	微感測器受熱循環負載之有限單元應力分析
8.	奈米鑽石/環氧樹脂複合材料之機械特性研究
9.	軟式隱形眼鏡光學性能之量測方法分析
10.	電磁熱療系統應用於腫瘤微創治療及臟器切除療程
11.	多壁奈米碳管/石墨烯微片/氧代氮代苯并環已烷/環氧樹脂碳纖維積層板複合材料機械性質暨疲勞壽命之研究
12.	應用改質石墨烯補強高分子纖維積層複合材料機械性質與疲勞行為之研究
13.	石墨烯補強共聚物高分子不連續碳纖維複合材料之機械性質與破壞行為研究
14.	以7a衍生物針對幽門螺旋菌及結核分枝桿菌之莽草酸去氫酶蛋白質結構為標準之抑制劑研究
15.	Ed在果蠅蛻變過程中對上皮細胞置換過程的影響

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室