A Summary of Solubility Models

Kuhn, Max; Johnson, Kjell

doi:10.1007/978-1-4614-6849-3_9

Max Kuhn³ &
Kjell Johnson⁴

212k Accesses

Abstract

In chapters 6-8, we developed a number of models to predict compounds’ solubility. In this chapter we compare and contrast the models’ performance and demonstrate how to select the optimal final model.

Access provided by Autonomous University of Puebla. Download chapter PDF

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Across the last few chapters, a variety of models have been fit to the solubility data set. How do the models compare for these data and which one should be selected for the final model? Figs. 9.1 and 9.2 show scatter plots of the performance metrics calculated using cross-validation and the test set data.

With the exception of poorly performing models, there is a fairly high correlation between the results derived from resampling and the test set (0.9 for the RMSE and 0.88 for R ²). For the most part, the models tend to rank order similarly. K-nearest neighbors were the weakest performer, followed by the two single tree-based methods. While bagging these trees did help, it did not make the models very competitive. Additionally, conditional random forest models had mediocre results.

There was a “pack” of models that showed better results, including model trees, linear regression, penalized linear models, MARS, and neural networks. These models are more simplistic but would not be considered interpretable given the number of predictors involved in the linear models and the complexity of the model trees and MARS. For the most part, they would be easy to implement. Recall that this type of model might be used by a pharmaceutical company to screen millions of potential compounds, so ease of implementation should not be taken lightly.

The group of high-performance models include support vector machines (SVMs), boosted trees, random forests, and Cubist. Each is essentially a black box with a highly complex prediction equation. The performance of these models is head and shoulders above the rest so there is probably some value in finding computationally efficient implementations that can be used to predict large numbers of new samples.

Are there any real differences between these models? Using the resampling results, a set of confidence intervals were constructed to characterize the differences in RMSE in the models using the techniques shown in Sect. 4.8. Figure 9.3 shows the intervals. There are very few statistically significant differences. Additionally, most of the estimated mean differences are less than 0.05 log units, which are not scientifically meaningful. Given this, any of these models would be a reasonable choice.

Author information

Authors and Affiliations

Division of Nonclinical Statistics, Pfizer Global Research and Development, Groton, Connecticut, USA
Max Kuhn
Arbor Analytics, Saline, Michigan, USA
Kjell Johnson

Authors

Max Kuhn
View author publications
You can also search for this author in PubMed Google Scholar
Kjell Johnson
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kuhn, M., Johnson, K. (2013). A Summary of Solubility Models. In: Applied Predictive Modeling. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6849-3_9

Download citation

DOI: https://doi.org/10.1007/978-1-4614-6849-3_9
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-6848-6
Online ISBN: 978-1-4614-6849-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics