Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library

doi:10.1016/j.cjche.2022.04.004

Chinese Journal of Chemical Engineering ›› 2022, Vol. 52 ›› Issue (12): 115-125.DOI: 10.1016/j.cjche.2022.04.004

Previous Articles Next Articles

Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library

Jun Zhang¹, Qin Wang², Weifeng Shen^1,3

1. School of Chemistry and Chemical Engineering, Chongqing University, Chongqing 401331, China;
2. School of Chemistry and Chemical Engineering, Chongqing University of Science & Technology, Chongqing 401331, China;
3. Chongqing Key Laboratory of Theoretical and Computational Chemistry, Chongqing 400044, China

Received:2021-11-23 Revised:2022-04-12 Online:2023-01-31 Published:2022-12-28
Contact: Qin Wang,E-mail:wangq356@mail2.sysu.edu.cn;Weifeng Shen,E-mail:shenweifeng@cqu.edu.cn
Supported by:
We acknowledge the financial support provided by the National Key Research and Development Project (2019YFC0214403) and Chongqing Joint Chinese Medicine Scientific Research Project (2021ZY023984).

Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library

Jun Zhang¹, Qin Wang², Weifeng Shen^1,3

1. School of Chemistry and Chemical Engineering, Chongqing University, Chongqing 401331, China;
2. School of Chemistry and Chemical Engineering, Chongqing University of Science & Technology, Chongqing 401331, China;
3. Chongqing Key Laboratory of Theoretical and Computational Chemistry, Chongqing 400044, China

通讯作者: Qin Wang,E-mail:wangq356@mail2.sysu.edu.cn;Weifeng Shen,E-mail:shenweifeng@cqu.edu.cn
基金资助:
We acknowledge the financial support provided by the National Key Research and Development Project (2019YFC0214403) and Chongqing Joint Chinese Medicine Scientific Research Project (2021ZY023984).

Abstract

Abstract: Due to outstanding performance in cheminformatics, machine learning algorithms have been increasingly used to mine molecular properties and biomedical big data. The performance of machine learning models is known to critically depend on the selection of the hyper-parameter configuration. However, many studies either explored the optimal hyper-parameters per the grid searching method or employed arbitrarily selected hyper-parameters, which can easily lead to achieving a suboptimal hyper-parameter configuration. In this study, Hyperopt library embedding with the Bayesian optimization is employed to find optimal hyper-parameters for different machine learning algorithms. Six drug discovery datasets, including solubility, probe-likeness, hERG, Chagas disease, tuberculosis, and malaria, are used to compare different machine learning algorithms with ECFP6 fingerprints. This contribution aims to evaluate whether the Bernoulli Naïve Bayes, logistic linear regression, AdaBoost decision tree, random forest, support vector machine, and deep neural networks algorithms with optimized hyper-parameters can offer any improvement in testing as compared with the referenced models assessed by an array of metrics including AUC, F1-score, Cohen’s kappa, Matthews correlation coefficient, recall, precision, and accuracy. Based on the rank normalized score approach, the Hyperopt models achieve better or comparable performance on 33 out 36 models for different drug discovery datasets, showing significant improvement achieved by employing the Hyperopt library. The open-source code of all the 6 machine learning frameworks employed in the Hyperopt python package is provided to make this approach accessible to more scientists, who are not familiar with writing code.

Key words: Machine learning, Prediction, Optimal design, Hyper-parameter optimization, Hyperopt library

摘要： Due to outstanding performance in cheminformatics, machine learning algorithms have been increasingly used to mine molecular properties and biomedical big data. The performance of machine learning models is known to critically depend on the selection of the hyper-parameter configuration. However, many studies either explored the optimal hyper-parameters per the grid searching method or employed arbitrarily selected hyper-parameters, which can easily lead to achieving a suboptimal hyper-parameter configuration. In this study, Hyperopt library embedding with the Bayesian optimization is employed to find optimal hyper-parameters for different machine learning algorithms. Six drug discovery datasets, including solubility, probe-likeness, hERG, Chagas disease, tuberculosis, and malaria, are used to compare different machine learning algorithms with ECFP6 fingerprints. This contribution aims to evaluate whether the Bernoulli Naïve Bayes, logistic linear regression, AdaBoost decision tree, random forest, support vector machine, and deep neural networks algorithms with optimized hyper-parameters can offer any improvement in testing as compared with the referenced models assessed by an array of metrics including AUC, F1-score, Cohen’s kappa, Matthews correlation coefficient, recall, precision, and accuracy. Based on the rank normalized score approach, the Hyperopt models achieve better or comparable performance on 33 out 36 models for different drug discovery datasets, showing significant improvement achieved by employing the Hyperopt library. The open-source code of all the 6 machine learning frameworks employed in the Hyperopt python package is provided to make this approach accessible to more scientists, who are not familiar with writing code.

关键词: Machine learning, Prediction, Optimal design, Hyper-parameter optimization, Hyperopt library

Jun Zhang, Qin Wang, Weifeng Shen. Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library[J]. Chinese Journal of Chemical Engineering, 2022, 52(12): 115-125.

Jun Zhang, Qin Wang, Weifeng Shen. Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library[J]. 中国化学工程学报, 2022, 52(12): 115-125.

References

[1] A.S. Rifaioglu, H. Atas, M.J. Martin, R. Cetin-Atalay, V. Atalay, T. Doğan, Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases, Brief. Bioinform. 20 (5) (2019) 1878–1912. https://pubmed.ncbi.nlm.nih.gov/30084866/
[2] C. Su, J. Tong, Y.J. Zhu, P. Cui, F. Wang, Network embedding in biomedical data science, Brief. Bioinform. (2018) 2018Dec10. https://pubmed.ncbi.nlm.nih.gov/30535359/
[3] J.M. Stokes, K. Yang, K. Swanson, W.G. Jin, A. Cubillos-Ruiz, N.M. Donghia, C.R. MacNair, S. French, L.A. Carfrae, Z. Bloom-Ackermann, V.M. Tran, A. Chiappino-Pepe, A.H. Badran, I.W. Andrews, E.J. Chory, G.M. Church, E.D. Brown, T.S. Jaakkola, J.J. Collins, A deep learning approach to antibiotic discovery, Cell 180 (4) (2020) 688–702.e13. http://dx.doi.org/10.1016/j.cell.2020.01.021
[4] J.K. Wu, S.H. Wang, L. Zhou, X. Ji, Y.Y. Dai, Y.G. Dang, M. Kraft, Deep-learning architecture in QSPR modeling for the prediction of energy conversion efficiency of solar cells, Ind. Eng. Chem. Res. 59 (42) (2020) 18991–19000. https://doi.org/10.1021/acs.iecr.0c03880
[5] S. Ekins, A.C. Puhl, K.M. Zorn, T.R. Lane, D.P. Russo, J.J. Klein, A.J. Hickey, A.M. Clark, Exploiting machine learning for end-to-end drug discovery and development, Nat. Mater. 18 (5) (2019) 435–441. https://pubmed.ncbi.nlm.nih.gov/31000803/
[6] E. Gawehn, J.A. Hiss, G. Schneider, Deep learning in drug discovery, Mol. Inform. 35 (1) (2016) 3–14. https://pubmed.ncbi.nlm.nih.gov/27491648/
[7] K.P. Bennett, C. Campbell, Support vector machines, SIGKDD Explor. Newsl. 2 (2) (2000) 1–13. https://doi.org/10.1145/380995.380999
[8] G. Tripepi, K.J. Jager, F.W. Dekker, C. Zoccali, Linear and logistic regression analysis, Kidney Int. 73 (7) (2008) 806–810. https://pubmed.ncbi.nlm.nih.gov/18200004/
[9] X.Y. Xia, E.G. Maliski, P. Gallant, D. Rogers, Classification of kinase inhibitors using a Bayesian model, J. Med. Chem. 47 (18) (2004) 4463–4470. https://pubmed.ncbi.nlm.nih.gov/15317458/
[10] R.G. Susnow, S.L. Dixon, Use of robust classification techniques for the prediction of human cytochrome P450 2D6 inhibition, J. Chem. Inf. Comput. Sci. 43 (4) (2003) 1308–1315. https://doi.org/10.1021/ci030283p
[11] S.C. Wang, Y.Y. Li, J.M. Wang, L. Chen, L.L. Zhang, H.D. Yu, T.J. Hou, ADMET evaluation in drug discovery. 12. Development of binary classification models for prediction of hERG potassium channel blockage, Mol. Pharm. 9 (4) (2012) 996–1010. https://pubmed.ncbi.nlm.nih.gov/22380484/
[12] J.B.O. Mitchell, Machine learning methods in chemoinformatics, Wiley Interdiscip. Rev. Comput. Mol. Sci. 4 (5) (2014) 468–481. https://pubmed.ncbi.nlm.nih.gov/25285160/
[13] A. Korotcov, V. Tkachenko, D.P. Russo, S. Ekins, Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery data sets, Mol. Pharm. 14 (12) (2017) 4462–4475. https://pubmed.ncbi.nlm.nih.gov/29096442/
[14] A. Koutsoukas, K.J. Monaghan, X.L. Li, J. Huan, Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data, J. Cheminform. 9 (1) (2017) 42. https://pubmed.ncbi.nlm.nih.gov/29086090/
[15] Z. Basrak, A routine for parameter optimization using an accelerated grid-search method, Comput. Phys. Commun. 46 (1) (1987) 149–154. http://dx.doi.org/10.1016/0010-4655(87)90042-7
[16] Y. Bengio, Gradient-based optimization of hyperparameters, Neural Comput. 12 (8) (2000) 1889–1900. https://pubmed.ncbi.nlm.nih.gov/10953243/
[17] Y.F. Xia, C.Z. Liu, Y.Y. Li, N.N. Liu, A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring, Expert Syst. Appl. 78 (2017) 225–241. http://dx.doi.org/10.1016/j.eswa.2017.02.017
[18] J.T. Springenberg, A. Klein, S. Falkner, F. Hutter, Bayesian optimization with robust Bayesian neural networks, Advances in Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 2016.
[19] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D.D. Cox, Hyperopt: a Python library for model selection and hyperparameter optimization, Comput. Sci. Disc. 8 (1) (2015) 014008. https://doi.org/10.1088/1749-4699/8/1/014008
[20] F.A. Quintero, S.J. Patel, F. Muñoz, M. Sam Mannan, Review of existing QSAR/QSPR models developed for properties used in hazardous chemicals classification system, Ind. Eng. Chem. Res. 51 (49) (2012) 16101–16115. https://doi.org/10.1021/ie301079r
[21] D. Rogers, M. Hahn, Extended-connectivity fingerprints, J. Chem. Inf. Model. 50 (5) (2010) 742–754. https://pubmed.ncbi.nlm.nih.gov/20426451/
[22] R. Caruana, A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms, in: Proceedings of the 23rd international conference on Machine learning, Pittsburgh Pennsylvania, USA, 2006.
[23] J. Cohen, A coefficient of agreement for nominal scales, Educ. Psychol. Meas. 20 (1) (1960) 37–46. https://doi.org/10.1177/001316446002000104
[24] B.W. Matthews, Comparison of the predicted and observed secondary structure of T4 phage lysozyme, Biochim. Biophys. Acta 405 (2) (1975) 442–451. https://pubmed.ncbi.nlm.nih.gov/1180967/
[25] B. Shahriari, K. Swersky, Z.Y. Wang, R.P. Adams, N. de Freitas, Taking the human out of the loop: a review of Bayesian optimization, Proc. IEEE 104 (1) (2016) 148–175. https://doi.org/10.1109/jproc.2015.2494218

Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library

Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library

PDF (PC)

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments

[1]	Borui Liu, Tao Zhang, Yi Zheng, Kailong Li, Hui Pan, Hao Ling. A dynamic control structure of liquid-only transfer stream distillation column [J]. Chinese Journal of Chemical Engineering, 2023, 59(7): 135-145.
[2]	Danlei Chen, Yiqing Luo, Xigang Yuan. Cascade refrigeration system synthesis based on hybrid simulated annealing and particle swarm optimization algorithm [J]. Chinese Journal of Chemical Engineering, 2023, 58(6): 244-255.
[3]	Li Xia, Yule Pan, Tingting Zhao, Xiaoyan Sun, Shaohui Tao, Yushi Chen, Shuguang Xiang. Estimating heat capacities of liquid organic compounds based on elements and chemical bonds contribution [J]. Chinese Journal of Chemical Engineering, 2023, 57(5): 30-38.
[4]	Jixiang Liu, Xin Zhou, Gengfei Yang, Hui Zhao, Zhibo Zhang, Xiang Feng, Hao Yan, Yibin Liu, Xiaobo Chen, Chaohe Yang. Conceptual carbon-reduction process design and quantitative sustainable assessment for concentrating high purity ethylene from wasted refinery gas [J]. Chinese Journal of Chemical Engineering, 2023, 57(5): 290-308.
[5]	Haoshan Duan, Xi Meng, Jian Tang, Junfei Qiao. Prediction of NO_x concentration using modular long short-term memory neural network for municipal solid waste incineration [J]. Chinese Journal of Chemical Engineering, 2023, 56(4): 46-57.
[6]	Yiming Bai, Shuaiyu Xiang, Feifan Cheng, Jinsong Zhao. A dynamic-inner LSTM prediction method for key alarm variables forecasting in chemical process [J]. Chinese Journal of Chemical Engineering, 2023, 55(3): 266-276.
[7]	Kun Ren, Zheng Jiao, Xiaolong Wu, Honggui Han. Multivariable identification of membrane fouling based on compacted cascade neural network [J]. Chinese Journal of Chemical Engineering, 2023, 53(1): 37-45.
[8]	Xinqiang You, Kai Zhao, Ling Li, Ting Qiu. Ionic liquids as entrainer in extractive distillation for effectively separating 1-propanol–water azeotropic mixture [J]. Chinese Journal of Chemical Engineering, 2022, 49(9): 224-233.
[9]	Jia Ren, Zengqiang Chen, Mingwei Sun, Qinglin Sun, Zenghui Wang. Proportion integral-type active disturbance rejection generalized predictive control for distillation process based on grey wolf optimization parameter tuning [J]. Chinese Journal of Chemical Engineering, 2022, 49(9): 234-244.
[10]	Denglong Ma, Ruitao Wu, Zekang Li, Kang Cen, Jianmin Gao, Zaoxiao Zhang. A new method to forecast multi-time scale load of natural gas based on augmentation data-machine learning model [J]. Chinese Journal of Chemical Engineering, 2022, 48(8): 166-175.
[11]	Tong Qin, Zhenhao Xi, Ling Zhao, Weikang Yuan. Monte Carlo simulation of sequential structure control of AN-MA-IA aqueous copolymerization by different operation modes [J]. Chinese Journal of Chemical Engineering, 2022, 46(6): 231-242.
[12]	Danlei Chen, Yiqing Luo, Xigang Yuan. Refrigeration system synthesis based on de-redundant model by particle swarm optimization algorithm [J]. Chinese Journal of Chemical Engineering, 2022, 50(10): 412-422.
[13]	Jiale Mao, Jiazhi Miao, Yingying Lu, Zheming Tong. Machine learning of materials design and state prediction for lithium ion batteries [J]. Chinese Journal of Chemical Engineering, 2021, 37(9): 1-11.
[14]	Jie Yang, Alejandro Gallegos, Cheng Lian, Shengwei Deng, Honglai Liu, Jianzhong Wu. Curvature effects on electric-double-layer capacitance [J]. Chinese Journal of Chemical Engineering, 2021, 29(3): 145-152.
[15]	Jiaqi Ding, Nan Xu, Manh Tien Nguyen, Qi Qiao, Yao Shi, Yi He, Qing Shao. Machine learning for molecular thermodynamics [J]. Chinese Journal of Chemical Engineering, 2021, 29(3): 227-239.