SCI和EI收录∣中国化工学会会刊

中国化学工程学报 ›› 2022, Vol. 52 ›› Issue (12): 115-125.DOI: 10.1016/j.cjche.2022.04.004

• Full Length Article • 上一篇    下一篇

Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library

Jun Zhang1, Qin Wang2, Weifeng Shen1,3   

  1. 1. School of Chemistry and Chemical Engineering, Chongqing University, Chongqing 401331, China;
    2. School of Chemistry and Chemical Engineering, Chongqing University of Science & Technology, Chongqing 401331, China;
    3. Chongqing Key Laboratory of Theoretical and Computational Chemistry, Chongqing 400044, China
  • 收稿日期:2021-11-23 修回日期:2022-04-12 出版日期:2022-12-28 发布日期:2023-01-31
  • 通讯作者: Qin Wang,E-mail:wangq356@mail2.sysu.edu.cn;Weifeng Shen,E-mail:shenweifeng@cqu.edu.cn
  • 基金资助:
    We acknowledge the financial support provided by the National Key Research and Development Project (2019YFC0214403) and Chongqing Joint Chinese Medicine Scientific Research Project (2021ZY023984).

Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library

Jun Zhang1, Qin Wang2, Weifeng Shen1,3   

  1. 1. School of Chemistry and Chemical Engineering, Chongqing University, Chongqing 401331, China;
    2. School of Chemistry and Chemical Engineering, Chongqing University of Science & Technology, Chongqing 401331, China;
    3. Chongqing Key Laboratory of Theoretical and Computational Chemistry, Chongqing 400044, China
  • Received:2021-11-23 Revised:2022-04-12 Online:2022-12-28 Published:2023-01-31
  • Contact: Qin Wang,E-mail:wangq356@mail2.sysu.edu.cn;Weifeng Shen,E-mail:shenweifeng@cqu.edu.cn
  • Supported by:
    We acknowledge the financial support provided by the National Key Research and Development Project (2019YFC0214403) and Chongqing Joint Chinese Medicine Scientific Research Project (2021ZY023984).

摘要: Due to outstanding performance in cheminformatics, machine learning algorithms have been increasingly used to mine molecular properties and biomedical big data. The performance of machine learning models is known to critically depend on the selection of the hyper-parameter configuration. However, many studies either explored the optimal hyper-parameters per the grid searching method or employed arbitrarily selected hyper-parameters, which can easily lead to achieving a suboptimal hyper-parameter configuration. In this study, Hyperopt library embedding with the Bayesian optimization is employed to find optimal hyper-parameters for different machine learning algorithms. Six drug discovery datasets, including solubility, probe-likeness, hERG, Chagas disease, tuberculosis, and malaria, are used to compare different machine learning algorithms with ECFP6 fingerprints. This contribution aims to evaluate whether the Bernoulli Naïve Bayes, logistic linear regression, AdaBoost decision tree, random forest, support vector machine, and deep neural networks algorithms with optimized hyper-parameters can offer any improvement in testing as compared with the referenced models assessed by an array of metrics including AUC, F1-score, Cohen’s kappa, Matthews correlation coefficient, recall, precision, and accuracy. Based on the rank normalized score approach, the Hyperopt models achieve better or comparable performance on 33 out 36 models for different drug discovery datasets, showing significant improvement achieved by employing the Hyperopt library. The open-source code of all the 6 machine learning frameworks employed in the Hyperopt python package is provided to make this approach accessible to more scientists, who are not familiar with writing code.

关键词: Machine learning, Prediction, Optimal design, Hyper-parameter optimization, Hyperopt library

Abstract: Due to outstanding performance in cheminformatics, machine learning algorithms have been increasingly used to mine molecular properties and biomedical big data. The performance of machine learning models is known to critically depend on the selection of the hyper-parameter configuration. However, many studies either explored the optimal hyper-parameters per the grid searching method or employed arbitrarily selected hyper-parameters, which can easily lead to achieving a suboptimal hyper-parameter configuration. In this study, Hyperopt library embedding with the Bayesian optimization is employed to find optimal hyper-parameters for different machine learning algorithms. Six drug discovery datasets, including solubility, probe-likeness, hERG, Chagas disease, tuberculosis, and malaria, are used to compare different machine learning algorithms with ECFP6 fingerprints. This contribution aims to evaluate whether the Bernoulli Naïve Bayes, logistic linear regression, AdaBoost decision tree, random forest, support vector machine, and deep neural networks algorithms with optimized hyper-parameters can offer any improvement in testing as compared with the referenced models assessed by an array of metrics including AUC, F1-score, Cohen’s kappa, Matthews correlation coefficient, recall, precision, and accuracy. Based on the rank normalized score approach, the Hyperopt models achieve better or comparable performance on 33 out 36 models for different drug discovery datasets, showing significant improvement achieved by employing the Hyperopt library. The open-source code of all the 6 machine learning frameworks employed in the Hyperopt python package is provided to make this approach accessible to more scientists, who are not familiar with writing code.

Key words: Machine learning, Prediction, Optimal design, Hyper-parameter optimization, Hyperopt library