Introduction

Today, big data and artificial intelligence revolutionize many areas of our daily life, and materials science is no exception [1,2,3]. More scientific data are available now than ever before and the size of the literature is growing at an exponential rate [4,5,6,7]. This has led to multiple efforts in building the digital ecosystem for material discovery, most notably the Materials Genome Initiative (MGI) [8, 9]. The MGI is a multinational effort focused on improving the tools and techniques surrounding materials research, which recently has included suggestions to adopt the set of Findable, Accessible, Interoperable, and Reusable (FAIR) principles when reporting data [10]. In the years since the creation of the MGI, a number of large materials and chemical datasets have emerged, including the 2D Materials Encyclopedia (2DMatPedia) [11], Automatic Flow (AFLOW) database [12, 13], Computational 2D Materials Database (C2DB) [14, 15], Computational Materials Repository (CMR) [16], Joint Automated Repository for Various Integrated Simulations (JARVIS) [17], Materials Project [18], Novel Materials Discovery (NOMAD) repository [19], and the Open Quantum Materials Database (OQMD) [20]. We note that all of these are primarily computational in nature, and that there is still a scarcity of large databases containing comprehensively characterized experimental data. Despite this, at least in computational materials discovery, the current availability of data has been a boon for exploration of the materials space, as it allows for highly flexible, data-hungry [21] models to be trained.

One such approach that has seen widespread popularity in recent years is gradient boosting. Gradient boosting [22] is an ensemble technique in which a collection of weak learners (typically decision trees) are incrementally trained with respect to the gradient of the loss function [23]. A well-known variant is eXtreme Gradient Boosting (XGBoost) [24], which reformulates the algorithm to provide stronger regularization and improved protection against overfitting. In chemistry, its applications have been diverse: XGBoost has been used to predict the adsorption energy of noble gases to Metal-Organic Frameworks (MOFs) [25], biological activity of pharmaceuticals [26], atmospheric transport [27], and has even been combined with the representations found in Graph Neural Networks (GNNs) to generate accurate models of various molecular properties, as Deng et al demonstrated for several well-known datasets including TOX-21 [28] (a toxicology dataset), FreeSolv [29] (a dataset of small molecule hydration free energies), SIDER [30] (a dataset of adverse drug reactions), and others. [31].

Neural networks have also seen a lot of interest, owing to their ability to learn new features from input data. This has included the influential Behler-Parinello [32] and Crystal Graph Convolutional Neural Network (CGCNN) [33] architectures based on chemical structure, the Representation Learning from Stoichiometry (Roost) [34] architecture based on chemical formula, and many other approaches [1, 2, 35,36,37,38,39,40,41]. Historically, interpretability of neural networks has been a major challenge, although there has also been substantial recent work in addressing this problem [42].

The modern machine learning (ML) toolbox is large, although it is still far from complete. As a result, model selection techniques are becoming increasingly necessary: this has led to the field of automated machine learning (AutoML). This area of work has seen much progress in recent years [43, 44], and has even been extended to Neural Architecture Search (NAS) [45], for the automated optimization of neural network architectures. In this work, we leverage the Tree-based Pipeline Optimization Tool (TPOT) approach to AutoML [46,47,48], which uses a Genetic Algorithm (GA) to create effective ML pipelines. Although it generally draws from the models of SciKit-Learn [49], it can also be configured to explore gradient boosting models via XGBoost [22], and neural network models via PyTorch [50]. Moreover, TPOT also performs its own hyperparameter optimization, thus providing a more hands-off solution to identifying ML pipelines. The success of GA-based approaches in ML is not isolated to AutoML. Indeed, they are a fundamental part of genetic programming, where they are used to optimize functions for a particular task [51, 52]. Eureqa [53] is a particularly successful example of this [54], leveraging a GA to generate equations fitting arbitrary functions, and has been used in several areas of chemistry, including the generation of adsorption models to nanoparticles [55] and metal atoms to oxide surfaces [56]. This approach of fitting arbitrary functions to a task is also known as “symbolic regression.” Recent work surrounding compressed sensing has yielded the Sure Independence Screening and Sparsifying Operator (SISSO) approach [57]. SISSO also generates equations mapping descriptors to a target property, proceeding by combining descriptors using various building blocks, including trigonometric functions, logarithms, addition, multiplication, exponentiation, and many others. This methodology has been highly successful in a variety of areas including crystal structure classification [58], as well as the prediction of perovskite properties [59,60,61] and 2D topological insulators [62].

While the recent raise in the availability of scientific data has led to the increased integration of machine learning and artificial intelligence techniques into materials science, one of the ongoing challenges associated with using these methods is getting physically interpretable results. By physical interpretation, we mean an understanding of the relationship between the chosen descriptors and the target property. Although a black-box model which has a high level of accuracy but little physical interpretation may lend itself well to the Edisonian screening of a wide range of materials, it may be difficult to understand exactly what feature (or combination of features) actually matters to the design of the material. Once the screening is done and the target values are calculated, little may be done to improve performance aside from including new features, adjusting the model’s hyperparameters, or increasing the size of the training set. Alternatively, consider a model which has less accuracy, but which has an intuitive explanation, such as an equation describing an approximate relationship between features and target. Although such a model may at first glance seem less useful than a highly accurate black-box, such a model can help deliver insight into the underlying process that results in the target property. Moreover, by understanding which features are important, the model can give clues into what may be done to further improve it — driving the rational discovery of materials. In addition, interpretability versus accuracy is not a strict trade-off, and it is possible for interpretable and black-box models to deliver similar accuracy [63]. Therefore, in this work we take steps to compare the performance of TPOT, XGBoost, SISSO, and Roost for each problem with respect to i) performance and ii) interpretability.

We leverage a diverse selection of techniques in order to draw comparisons of model accuracy and interpretability. Taking advantage of the current abundance of chemical data, we can re-use the Density Functional Theory (DFT) calculations of others stored on several FAIR chemical datasets. A set of three different problems are investigated: (1) the prediction of perovskite volumes, (2) the prediction of 2D material bandgaps, and (3) the prediction of 2D material exfoliation energies. These problems allow for coverage over a range of relevant areas within materials science. Perovskites are well-studied systems with relevance to catalysis and solar cells [64] [65], and the unit cell is a fundamental property of crystalline materials. 2D materials are an exciting new field within nanotechnology with applications in electronics [66] [67]. The bandgap in particular is a crucial property for electronics [68] and the exfoliation energy is often a key parameter in the production of 2D materials [69].

For the perovskite volume problem, we leverage the ABX3 perovskite dataset (containing 144 examples) published by Körbel, Marques, and Botti [70]; this dataset is hosted by NOMAD [19], whose repository has strong focus on enabling researchers to report their data such that it satisfies the FAIR data principles [71]. For the 2D material problems, we apply the 2DMatPedia published by Zhou et al [11], a set of 6,351 hypothetical 2D materials identified via a high-throughput screening of systems on the Materials Project [18]. Although at first glance this may seem like a relatively small dataset, we note the general rarity of known 2D materials in the literature. Haastrup et. al. released the C2DB [14], which contains around 4,000 systems algorithmically generated from a set of prototypes. The JARVIS [17] database maintained by NIST contains, among many other systems, around 1,000 low-D materials. Thus, we choose the 2DMatPedia because it offers us access to a large number of structures out of the box, without needing to combine together multiple data sources. In addition, we note that data can sometimes be hard to come by in the materials science space: datasets with large numbers of entries may not always exist for properties of interest. Thus, evaluating the performance of ML approaches on smaller datasets is an important (and realistic) benchmark to measure.

The manuscript is organized as follows: we begin by training a diverse set of four models, which are XGBoost, TPOT, Roost, and SISSO to investigate each of the three problems, resulting in a total of 12 trained models. Performance metrics (and comparative plots) are presented for each trained model to facilitate comparison, and we discuss the interpretation we can achieve from each of these models. Overall, between the four categories of model we trained, we leverage the XGBoost model as a baseline, as it is the simplest among them. Additionally, it is a common workhorse oftentimes achieving good results on tabular data. Framing our analysis as a comparison to the interpretability and accuracy relative to the XGBoost model, we can then draw conclusions about the interpretation and applicability of the other three model types. Finally, we provide a discussion of the future outlook of ML in the digital materials science ecosystem and what can be done to further accelerate materials discovery.

We find that TPOT delivers high-quality models, generally outperforming the other methods in terms of fitness metrics. Despite this, interpretability is not guaranteed, as it can create highly complex pipelines. XGBoost lends itself to interpretation more consistently, as it allows for an importance metric, although it may be harder to understand exactly what the relationship is between the different features (or combinations of different features) and the target variable. We found that Roost performed well on problems that could be approached via compositional descriptors (i.e., without structural descriptors); as a result, it can help us understand when a target property requires more than just the composition. Finally, we achieve the easiest interpretability from SISSO, as it provides access to descriptors which directly capture the relationship between the features and target variable. Using these results, we discuss the advantages and disadvantages of each method, and discuss areas where the digital ecosystem surrounding materials discovery could be improved to improve adherence to FAIR principles. Our work provides a comparison of several common ML techniques on challenging (but relevant) materials property prediction problems.

Results

Perovskite volume prediction

XGBoost, TPOT, and SISSO were applied to investigate the volume of perovskites as a function of the compositional features described in Sect. “Compositional Descriptors. Additionally, we trained a Roost model on the chemical formula of the perovskites to predict the volume. The train/test split resulted in a total of 129 entries in the training set, and 15 in the test set. We find generally good performance on the perovskite volume problem across all 4 models, although the TPOT and SISSO model display the best performance by all metrics investigated (see Table 1), including respective test-set R2 of 0.996 and 0.990. We note again here that we only used the compositional descriptors for this problem, and not the structural descriptors. The Roost model also performs well with a test-set R2 of 0.935, but it also has a non-normal error, as can be seen in Figure 1. Finally, we find that while XGBoost is the worst performing method, it still has a relatively good test-set R2 of 0.866.

TABLE 1 Performance metrics for the XGBoost, TPOT, Roost, and SISSO models on the perovskite volume prediction problem.

The performance of all 4 models is summarized in Figure 1. Visually, we find a very tight fit by the TPOT model in both the training and test sets, with good correlation from the XGBoost and SISSO models. We also find a systematic under-prediction of perovskite volumes in the Roost model in both the training and test set, with the under-prediction beginning at approximately 75 Å3/formula unit, achieving a maximum deviation at approximately 130 Å3/formula unit, and returning to parity at approximately 200 Å3/formula unit.

The good performance of the TPOT model results from a generated pipeline with seven stages. The first three stages are based on the Familywise Error, Feature Variance, and Familywise Error (FWE) again. This down-selects features according to the FWE error, and removes features with a variance under 0.20.

The alpha values for the two FWE thresholds are 0.047 and 0.046, respectively, which means the highest allowed uncorrelated p-value for a feature is 0.046. From here, the remaining features are passed to a series of stacked Random Forest, Extremely Randomized Trees [72], and XGBoost. The Random Forest uses 100 trees with bootstrapping, can use at most 60% of the features, and each leaf must contain at least 16 samples. The Extremely Randomized Trees model averages over 100 trees without bootstrapping and at most 20% of the features, with each leaf containing at least 16 samples. The XGBoost stage has 100 estimators, and is rather shallow with any individual tree having a maximum depth of 1. The "stacked" component of this series of 3 regressors means that each regressor adds its own predictions to the dataset as a new column, which informs further models down the pipeline. Finally, a LASSO model is fit with a Least Angle Regressor.

Moving onward from TPOT, in the case of our XGBoost model, we can extract feature importances. Although various different feature importance metrics can be derived from XGBoost, in this case we use the “gain” metric, which describes how the model’s loss function improves when a feature is chosen for a split while constructing the trees. A large number of features were input into this model, so we display only the 10 most important features identified by XGBoost in Supporting Information Figure 5. Here, we find that the average Rahm atomic radii [73, 74] (importance score 0.48) have the highest importance score, followed by the average Van der Waals radius used by the Universal Force Field (UFF) [75] (importance score 0.27). The remaining 288 features fall off as a long tail of low importance scores, indicating that they did little to improve the model’s performance in predicting the perovskite volume.

For SISSO, we used the feature space as outlined in Sect. “Symbolic Regression with SISSO,” with the pre-screened features listed in the Supporting Information along with the assumption we made about the units of the descriptor when fed into SISSO.

Generally, we find that the main descriptors selected by the procedure are related to volume and atomic radius. Some other descriptors with less interpretability are found, such as the C6 dispersion coefficients, polarizability, melting points, and Herfindahl-Hirschman Index (HHI) [76] production and reserve values. Although typically used to help indicate the size of a company within a particular sector of the economy, the XenonPy definition of HHI appears to come from the work of Gaultois et al [76]. In the referenced work, the HHI production value refers to the geographic distribution of elemental production (in other words, it assesses how concentrated or dispersed the global industrial effort is which produces those elements), and HHI reserve value describes the geographic distribution of known deposits of these materials (e.g., whether they are spread out over a wide area, or concentrated in a small area).

We report the best descriptor found in Eq. 1. In this equation, the variables \(c_0, a_0, a_1\) are the regression coefficients determined by SISSO.

$$V_{{Perovskite}} \approx c_{0} + a_{0} \cdot \frac{{Z^{{ave}} }}{{C^{{ave}} \cdot \left( {r_{{Slater}}^{{ave}} - r_{{pyykko,triple}}^{{ave}} } \right)}} + a_{1} \cdot \left( {V_{{gs}}^{{ave}} - V_{{gs}}^{{\min }} } \right) \cdot \frac{{r_{{pyykko,triple}}^{{ave}} }}{{r_{{pyykko}}^{{ave}} }}$$
(1)

where \(c_0 = -10.547\), \(a_0 =4.556\), \(c_1 =3.050\), \(Z^{ave}\) is the average atomic number, \(C^{ave}\) is the average mass-specific heat capacity of the elemental solid, \(r^{ave}_{Slater}\) is the average atomic covalent radius predicted by Slater, \(r^{ave}_{pyykko, triple}\) is the average triple bond covalent radius predicted by Pyyko, \(r^{ave}_{pyykko}\) is the average single bond covalent radius predicted by Pyyko, and \(V^{ave}_{gs}\) and \(V^{min}_{gs}\) are the average and minimum ground state volume per atom as calculated by DFT. Unsurprisingly the ground state atomic volumes and covalent radii play an important role in determining the final volume of the perovskite structures. Interestingly, both the atomic number and specific heat capacity of the material appear in the final descriptor. This is interesting because they do not intuitively have a connection to the unit cell volume. It’s possible that these just act as another source of variance for the model to pick up on, but we also note here that it could just be a correlation with the size of the individual atoms (e.g., another source of information about the volume). For the atomic number, it is well known that it has a periodic trend with the atomic radius (e.g., He is a small atom and Cs is a very large atom).

Figure 1
figure 1

Parity plots for the XGBoost, TPOT, Roost, and SISSO models on the perovskite unit cell volume problem. Included are the training and testing sets. A diagonal line indicating parity is drawn as a guide to the eye.

2D material bandgaps

The bandgap predictions leveraged a data filtering strategy (described in Sect. “Data Filtering.” As a result of our data filtering approach, the 6351 entries in the dataset were reduced to 1412 entries. The train/test split divided the data into a training set of 1270 rows, and a test set containing 142 entries. The performance metrics of the XGBoost, TPOT, Roost, and SISSO models of 2D Material Bandgap can be found in Table 2. Performance is generally worse on this problem when compared to the perovskite volume predictions. As a result, in addition to the compositional features of XenonPy (Sect. “Compositional Descriptors) we also used several structural features (Sect. “Structural descriptors”). We also leveraged the bulk bandgap of the parent-3D material for each of the 2D materials, as we observed the performance of the TPOT, SISSO, and XGBoost models increased when this value was included.

TABLE 2 Performance metrics for the XGBoost, TPOT, Roost, and SISSO models on the 2D material bandgap problem.

Although test-set model performance was worse compared to the perovskite problem, XGBoost, TPOT, and SISSO models all perform well with nearly equivalent metrics for the test-set R2, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). We find the Roost model overfit the data to some extent on the data, as the test-set error metrics are considerably worse than their training set counterparts. A parity plot summarizing these results can be found in Figure 2. In all cases, we can see a spike of misprediction for systems with a DFT bandgap of 0. We note here that a large portion of these entries had DFT bandgaps of 0: of the 382 of the 1412 entries in the dataset, a total of 27% of all training data.

The pipeline generated by TPOT is less complex than that of the perovskite volume problem. The first stage of the pipeline is a MaxAbsScaler unit, scaling each feature by the maximum absoulte value of the feature. The second stage is then an ElasticNetCV unit, which uses 5-fold cross-validation to optimize the alpha and L1/L2 ratio of the Elastic Net model. The converged alpha value was \(0.0001\), and the converged L1/L2 ratio was 0.85, which strongly leans toward the L1 (Least Absolute Shrinkage and Selection Operator (LASSO)) regularization penalty. Finally, it uses an ExtraTreesRegressor and averages over 100 decision trees to estimate the target property. Each tree in this final step can use 80% of the features with each leaf having at least two samples and each internal node splitting at least 14 samples.

We can also extract feature importances from the XGBoost model, and we report the 10 highest-ranked features in Supporting Information Figure 6. Similar to the perovskite results the XGBoost model is dominated by a single feature, namely the bandgap of the parent-3D material (importance score 0.44). This feature also very important for the SISSO models, as shown in Supporting Information Table 7, signaoling that the similarity of the performance of these three models could be attributed to this feature. In fact the selected SISSO model is

$$\begin{aligned}{} & {} E^{2D}_{Bandgap} \approx c_0 + a_0 \cdot \frac{Period^{ave} \cdot r^{ave}_{cov, slater}}{\left( r^{min}_{cov, cordero}\right) ^3} \nonumber \\{} & {} \quad + a_1 \cdot \frac{E_{Bandgap}^{3D, parent}}{r^{min}_{rahm}} \left( r^{min}_{vdw} + r^{min}_{cov, cordero}\right) \end{aligned}$$
(2)

where \(c_0=-0.3296\), \(a_0=1.69\times 10^{3}\), \(a_1=6.59\times 10^{-1}\), \(r^{min}_{vdw}\) is the minimum Van der Waals radius of the atoms in the material, \(r^{ave}_{cov, slater}\) is the average Slater covalent radius of an atom in the material, \(r_{cov, cordero}\) is the Cordero covalent radius of an atom in the material, \(E_{Bandgap}^{3D, parent}\) is the bandgap of the 3D-parent material, \(r^{min}_{rahm}\) is the minimum Rahm radius of an atom in the material, and \(Period^{ave}\) is the average period of the elements in the material. This descriptor primarily represents a simple rescaling and shifting of the bandgap of the 3D-parent material, further implying the dominant role this feature plays in describing the bandgap of the 2D material.

Figure 2
figure 2

Parity plots for the XGBoost, TPOT, Roost, and SISSO models on the 2D material bandgap problem. Included are the training and testing sets. A diagonal line indicating parity is drawn as a guide to the eye. Regression statistics for the models shown on this plot can be found in Table 4.

2D material exfoliation energy

In the case of the 2D material exfoliation energy problem, the training and test-set statistics for the XGBoost, TPOT, Roost, and SISSO models can be found in Table 3. In this case, our feature selection methodology down-selected the 6351 rows of our dataset into 3388 rows. The train/test split further divided this into a training-set of 3049 entries, and a test set of 339 entries. Generally, we see the worst performance of the models in this problem, compared to the perovskite volume and 2D material bandgap problems.

TABLE 3 Performance metrics for the XGBoost, TPOT, Roost, and SISSO models on the 2D material exfoliation energy problem.

A set of parity plots for all four models is presented in Figure 3. To facilitate easier comparison at experimentally relevant energy ranges, we have zoomed the plot in such that the highest exfoliation energy is 2 eV. Plots showing the entire energy range explored can be found in Supporting Information Figure 8. Here, we find that all models perform generally poorly, with the largest errors occurring at higher exfoliation energies in the case of XGBoost, TPOT, and SISSO (see Figure 3). The best test-set R2 and RMSE this time is only TPOT, although they are still relatively poor, with a test-set R2 of only 0.603. Roost displays the best test-set MAE, although the model seems to have overfit, as it displays drastically better performance on the training set than it does on the test set. The XGBoost model performs slightly worse than either TPOT or Roost, and the SISSO approach did not perform well for this problem.

The TPOT algorithm results in a relatively complicated model pipeline, with multiple scaling and estimation steps. The pipeline firsts standardizes the features and then creates a linear model using linear ridge regression. From here it counts the number of zero and non-zero feature values for each samples and adds it to the feature set, and then rescales the features by the maximum absolute value of each feature. It then adds an ExtraTreesRegressor using 100 trees, 70% of the features, a minimum samples per leaf of 6, and a minimum number of samples per split of 15. The next stage is a SelectFwe unit, which down-selects the features according to the FWE [23]. An alpha value of \(0.011\) is selected for this purpose. This is then fed into a linear support vector regressor with a C value of 0.5 and rescaled such that each feature is between 0 and 1. Finally, a last ExtraTreesRegressor unit is used with 100 trees, a maximum of 10% of all features used and a minimum of 5 samples at each split.

We again extract features from the XGBoost model (Supporting Information Figure 7), and find the Mendeleev Number again appears as an important feature (importance score 0.08), albeit as the maximum instead of the minimum. Additionally, we see descriptors related to bond strengths in the corresponding elemental systems: average melting points (importance score 0.05), and average heats of evaporation (importance score 0.05).

The list of preselected features can be found in the Supporting Information Table 8. The best SISSO model found for this problem is

$$E_{{Exf}} \approx c_{0} + a_{0} BP^{{\max }} n_{p}^{{ave}} + a_{1} \left( {EA^{{\text{var} }} + EA^{{ave}} } \right) + a_{2} \frac{{q_{{ev}}^{{\min }} }}{{V_{{ICSD}}^{{\max }} }}$$
(3)

where \(c_0=8.29\times 10^{-1}\), \(a_0=-1.48\times 10^{-4}\), \(a_1=-8.56\times 10^{-2}\), \(a_2=1.16\times 10^{-1}\), \(BP^{max}\) is the maximum boiling point of an elemental solid of an atom in the material, \(q_{evaporation}^{min}\) is the minimum atomic evaporation heat of each element in the material, EA is atomic electron affinity of an atom in the material, \(V_{ICSD}^{ave}\) is the average atomic volume in the ICSD database, and \(n_{p}^{ave}\) is the average number of p valance electrons. Examining this equation gives insights into why the SISSO model’s performance is as poor as it is. The \(V^{ave}_{icsd}\) of graphene is 5.67, lower than any other data point in the dataset. Removing this single data point increases the Test \(R^2\) to 0.329 and reduces the MAE, RMSE, and max error to 0.26, 0.40, and 1.88, respectively. All of which are inlined with the training results.

Figure 3
figure 3

Parity plots for the XGBoost, TPOT, Roost, and SISSO models on the 2D material exfoliation energy problem. Included are the training and testing sets. A diagonal line indicating parity is drawn as a guide to the eye. Regression statistics for the models shown on this plot can be found in Table 3, the values presented here are for only the demonstrated data.. To facilitate comparison at energy ranges that are more experimentally relevant, we have zoomed in the plot to study energies no higher than 2 J/m\(^2\). The full data range is plot in Supporting Information Figure 8.

Discussion

We have developed a series of models which are capable of generating predictions for (1) the volume per formula unit of a series of ABX3 perovskites, (2) the DFT-calculated bandgap of several 2D materials, and (3) several 2D material exfoliation energies. These problems encompass a variety of outcomes that one may find when training models of predictive properties.

Perovskite volume per formula unit

In the case of the volume per formula unit of ABX3 perovskites, we observe all four model types perform well. Overall, we find that the volume per formula unit for ABX3 perovskites can be predicted using only compositional descriptors (i.e., with no structural descriptors). The likely reason all four models perform well despite having no structural information is the general similarity in crystal structure between these systems—they are all perovskites, and therefore all possess very similar crystal structures. Supporting this is that the Roost model, which only leverages the chemical formula as an input, and which we did not optimize the hyperparameters or architecture for, performed just as well on this problem—albeit with some systematic deviation from parity at intermediate volumes (see Figure 1). Although interpretability is reduced by virtue of being a neural network, we can still achieve an important insight from this model—just by knowing the chemical formula of the system, we can achieve accurate predictions of perovskite volumes, which further justifies our use of compositional descriptors (see Sects. “Compositional Descriptors) on this problem as we move to the SISSO, XGBoost, and TPOT models. Additionally, we note this performance was achieved with a dataset containing only 129 entries—compared to the original Roost paper [34] that leveraged approximately 275,000 entries from the OQMD datset [77].

Like the Roost model, we have difficulty in interpreting the pipeline generated by TPOT. The TPOT model delivers the best performance—which is clearly visible from the parity plot in Figure 1. This performance came at a price, however, and the rather complex pipeline containing multiple feature selection steps, three estimators stacked together (the predictions from the previous are added as a new feature to the next), and a LASSO model fit using Least Angles regression.

Entering into the realm of interpretability, although the XGBoost model does not produce a direct formula for perovskite volumes, we can still gain some insight using it. It is still, however, relatively accurate—and allows us access to a feature importance metric (see Supporting Information Figure 6). In this case, we see the five most important features are the average Rahm [73, 74] atomic radius, average UFF [75] atomic radius, sum of elemental velocities of sound in the material, average Ghosh [78] electronegativity, and the sum of the Pyykko [79] triple bond covalent radii. Overall, we see a strong reliance on descriptors of atomic radius—which as we noted in the TPOT discussion makes intuitive sense.

Finally, the SISSO model (Eq. 1) offers the most direct interpretation, as it is simply an equation. Immediately, we see that a variety of descriptors related to volume are important. This result is highly intuitive and is not surprising when we consider that we are predicting volume.

The overall good performance of SISSO for this application is promising, as it is one of the most accurate models, while being by far the most interpretable. This represents a key advantage to symbolic regression, as if you can find an accurate model, then it will be easy to understand and analyze the results. Moreover, we note that are not alone in the literature when it comes to leveraging SISSO to generate models of perovskite properties—the last several years have seen success in the creation of models of perovskite properties with this tool. The work of Xie et al. [61] achieved good success in predicting the octahedral tilt in ABO3 perovskites, the work of Bartel et al. [60] resulted in the creation of a new tolerance factor for ABX3 perovskite formation, and Ihalage and Hao [59] leveraged descriptors generated by SISSO to predict the formation of quaternary perovskites with formula \((\text {A}_{1-x}\text {A}^\prime _x)\text {BO}_3\) and \(\text {A}(\text {B}_{1-x}\text {B}^\prime _x)\text {O}_3\).

2D material bandgap

The 2D material bandgap models did not achieve the same performance as for the perovskite systems (see Figure 2). Even in the case of TPOT, SISSO, and XGBoost, which still had the best test-set performance by most metrics, there were a few outliers. Specifically, we find that the test-set MAE for the models ranged between 0.273 to 0.89 eV (Roost) relative to the PBE DFT calculations reported by the 2DMatPedia. Putting this number in perspective, we note the recent work of Tran et al [80], which benchmarked the bandgap predictions of several popular DFT functions for many of the systems in the C2DB; the work identified that the PBE functional exhibited a MAE of 1.50 eV relative to the G0W0 method. Other investigators have studied the prediction of 2D material bandgaps: Rajan et al. [81] also achieved a test-set MAE of 0.11 eV on a dataset of 23,870 MXene systems (which, as far as we are aware, has not been made publicly available) using a Gaussian Process regression approach, with DFT-calculated properties including the average M–X bond length, volume per atom, MXene phase, and heat of formation, and compositional properties including the mean Van der Waals radius, standard deviation of periodic table group number, standard deviation of the ionization energy, and standard deviation of the meting temperature. Zhang et al. [82] improved on this error slightly, achieving a test-set MAE of 0.10 eV on the C2DB dataset (around 4000 entries) [14] with both Support-Vector Regression and Random Forest approaches, albeit using descriptors such as the Fermi-energy density of states and total energy of the system (requiring further DFT work for additional prediction). In contrast to both approaches, which used DFT-calculated values that would need to be obtained for new systems to be predicted, the only DFT-calculated value we leverage in our feature set is a bulk bandgap tabulated on the Materials Project [18]. Thus, although our TPOT model had a slightly higher MAE, we note that this would not require further DFT work to generate new predictions.

In addition, we note that although we considered the bandgap of the corresponding bulk material, we did not consider the crystal structure of the corresponding bulk material. As this the electronic properties are heavily influenced by the structure, future work should evaluate the effect of crystal structure on bandgap models in order to ensure robustness across a wide array of structures.

As 2D systems are still relatively novel, we note that much more work has been performed in the 3D materials space, particularly in the leveraging of neural networks to predict bandgaps. The recent Atomistic LIne Graph Neural Network (ALIGNN) [83] reported a test-set MAE of 0.218 eV for the prediction of bulk materials hosted by Materials Project [18] (which as of October 2021 has over 144,000 inorganic systems). The Materials Graph Network (MEGNet) architecture [84] achieved a test-set MAE of 0.32 eV on the bulk systems of the Materials Project. Although these neural network models are on 3D systems, we note that they do not leverage DFT properties (which we re-iterate would cause any resulting model to require a DFT calculation for future prediction) and had access to much larger datasets than the training set we obtained after filtering the 2DMatPedia entries (see Sect. “Data Filtering”). Overall, although the systems we investigate are not 3D bulk systems, we believe this puts the TPOT MAE for the bandgap of 2D systems in perspective.

In all 4 models we trained, many of the incorrect predictions occur where the DFT bandgap is 0 eV (which represented 27% of the training set values). Because of this, we tried simplifying the bandgap problem, by training an XGBoost model to predict whether the system was a metal (see Supporting Information section 6.5.3), and showed that we could achieve good results — for the sake of trying a variety of approaches, we also incorporated a purely structural fingerprint, the Sine Matrix Eigenspectrum (see Supporting Information Section 6.5.1). As this descriptor resulted in some rather large vectors (of length 40, the maximum number of atoms in any system) with little direct physical intuition, we do not directly include it for the purposes of this section. Ultimately, that the Sine Matrix Eigenspectrum provides a useful model indicates the incorporation of structural features can provide useful information to predict the bandgap.

If we take a closer look at the Roost model, we can see a poor generalization to the test set (see Table 3). This indicates that we have likely caused it to overfit (which could have been improved for example through the use of early stopping). Given that Roost is a purely compositional model, this reinforces our conclusion that structural descriptors are necessary to the prediction of the bandgap of these systems.

Future work on this problem may achieve better performance on the bandgap problem by incorporating other structural features (e.g., investigating the bond strengths of the different elements in the system). We also note the very good performance that recent neural network approaches have had on the 3D bandgap problem [83, 84], likely due to their choices in representation of the structure of the 3D systems. Similar to how the Roost model achieves good success when compositional descriptors are appropriate, we may find good success in leveraging neural network approaches when structural features are required. We note here that Deng et al. [31] achieved good results on a variety of molecule properties by incorporating various graph representations from different neural network architectures. Hence, future work in this domain may benefit from the incorporation of the information-dense structural fingerprints that may be obtained from neural network-based approaches.

2D material exfoliation energy

We observed some of the worst model performance (across all models) in the case of the 2D material exfoliation energy. Despite being a larger dataset than either the perovskite (144 total, 129 in the training set) or bandgap (1,412 total, 1,270 in the training set), the 3,049 entries in the training set (out of 3,388 total) for the exfoliation energy proved insufficient to achieve good results for any of the models. Moreover, neither the compositional nor the structural features were sufficient to adequately describe the system.

When we predict exfoliation energies, we’re predicting the interaction between layers in an exfoliable material. Overall, finding better methods of cheaply approximating these weak interactions may provide better results in the prediction of exfoliation. Additionally, as the number of datasets which contain exfoliation energies increases (such as the 2DMatPedia [11], C2DB [14, 15], and JARVIS [17]), further insight into this problem will be possible, and more-complex (albeit less interpretible) models will become feasible.

Additionally, in order to obtain more-accurate predictions of exfoliation energy, data generated via a more thorough computational treatment may be required. We illustrate this by examining an outlier in the training set at 9.9 J / m2, which all four models heavily under-predicted (see Supporting Information Figure 8) (7–8 eV in the case of XGBoost and TPOT, and over 9 eV in the case of Roost and SISSO). Upon closer examination of this system, we find that it is actually a pair of layers containing N atoms (Figure 4A). The 2DMatPedia [11] reports that this system (2dm-id 5985) was not directly sourced by a simulated exfoliation from a bulk structure, but instead was obtained by substituting the atoms in a hypothetical 2D Sb structure (Figure 4B). The Sb structure (2dm-id 4275) was obtained by a simulated exfoliation from a structure obtained from materials project (Figure 4C). The parent bulk material (mp-567409) is reported by the Materials Project [18] to be a monoclinic crystal which undergoes a favorable decomposition (energy above hull is reported as 0.121 eV/atom) to a triclinic system. That being said, as this is a hypothetical 2D system, comparison with the hypothetical 3D bulk system was necessary for the calculation of exfoliation energy. As the prediction of crystal structure is a very challenging field with few easy approximations [85], this may have contributed further to the extreme value of the exfoliation energy. Indeed, as Zhou et al report [11], the decomposition energy lends itself better to assessing whether a material is truly stable. Indeed, despite the extremely high exfoliation energy of this hypothetical 2D N system, it is reported by the 2DMatPedia to have a decomposition energy of 0 eV/atom. This too seems somehwat high, as systems containing N-N bonds tend to be high-energy materials, typically undergoing strongly exothermic decomposition to inert, gaseous N2 [86]. With this in conjunction with the observation that our models all predict exfoliation energies significantly lower than the tabulated values, we have reason to believe that this system would be far easier to exfoliate than the 10 eV exfoliation energy implies. Moreover, this system may have a strong energetic preference to decompose further into N2, which additional DFT work could reveal. Overall, this underscores the importance of obtaining high-quality data, and filtering that high-quality data, for the training of interpretable models.

Figure 4
figure 4

Illustrations of (A) a N-containing system (2dm-id 5985) which persisted as a large outlier across all exfoliation models in the training set, (B) the Sb structure (2dm-id 4275) the N-containing system was derived from, and (C) the bulk structure from Materials Project (mp-567409) from which the exfoliation of the Sb system was simulated.

Future outlook

As ML is further integrated into materials discovery workflows, we anticipate that the numerous successes neural networks have presented [1,2,3, 32,33,34,35,36,37,38,39,40,41, 87,88,89] will continue to propel them onto the cutting edge of chemical property prediction. This comes with the challenge of honing our techniques for their interpretation, an area which has seen much interest in recent years, and where there is still plenty of opportunity for further development [90, 91]. We also expect AutoML techniques such as TPOT will continue gaining traction in materials discovery, due to the amount of success and attention they have recently had [43,44,45,46,47,48]. This too presents the challenge of interpretability if highly complex pipelines are generated (see Sect. 2.1 and 3.1). We note here that part of the value that AutoML techniques bring is the ability to make advanced techniques accessible to a wider audience of researchers by lowering the barrier of entry. Hence, we expect that the problem of interpretation may be compounded for AutoML (and especially NAS) systems: the ability to automatically extract some level of interpretation from the generated pipelines is important for automation to make ML truly accessible to non-experts. Overall, we expect that as neural network models and AutoML algorithms continue to grow in capability and complexity, work in developing the tools and techniques needed to interpret them will see a greater attention.

In contrast with challenge of interpreting neural networks or the pipelines found by AutoML systems, symbolic regression tools like Eureqa and SISSO yield an exact equation describing the model and are thus easier to interpret. This makes it easier to achieve key insights with physical interpretations — such as the very intuitive way in which SISSO is able to describe the systems. Overall, despite its reduced ability to predict the exfoliation energy of a material when compared to the models of TPOT, XGBoost, and Roost, we note the mathematical equations returned by SISSO provide a direct relationship between the target properties and model predictions. Additionally, in the case of the exfoliation energy, we believe that we may see further improvements by including richer structural information. We base this on the observation that the Roost model performed poorly on both of this problems – recalling that Roost is only provided the chemical formula of the system, this could indicate that compositional descriptors alone are insufficient to describe these properties. Indeed, it is well known that structure and energy are intimately related (the fundamental assumption of geometry optimization techniques is that energy is a function of atomic position), hence it can be inferred that exfoliation energy and structure are similarly related. In the case of bandgaps, we note that there is also a strong dependence on structure; Chaves et al [68] notes that the number of layers in a 2D material can strongly influence the bandgap, reporting differences of up to several eV can occur between the bulk and monolayer form of a material.

Interoperability is still a challenge in the materials discovery ecosystem. Although it is possible to easily convert between different chemical file formats (e.g., by OpenBabel [92]), and packages such as Pymatgen [93], Atomic Simulation Environment (ASE) [94], and RDKit [95] can easily convert to each others’ format, we note that there is a challenge of calculating features using a variety of different packages. Some tools expect Pymatgen objects (e.g., XenonPy), others expect ASE objects, whereas others require RDKit objects (e.g., all of the descriptors in the RDKit library) to perform a calculation of features, thus creating some standard for the interoperability of these packages would be beneficial. Additionally, further efforts should be made to report the sources of data used by featurization packages. We note that MatMiner [96] is exemplary in this regard: each of the featurization classes it defines has a “citation” method returning the appropriate source to credit. Mendeleev [97] is another good example of this; within its documentation, a table lists citations for many (though not all) of the elemental properties it can return. Overall, by placing a stronger focus on i) interoperability and ii) data provenance, the Python materials modeling ecosystem can be made stronger—and therefore help accelerate materials discovery.

Moreover, we note that as models continue to grow in complexity, it will continue to be more important to evaluate their complexity if they are to be used for practical applications. Although we did not benchmark the models in this study for time, if one were to deploy a model in a production setting, it would be important to understand the CPU / GPU and memory requirements for training and inference to be possible.

All of the models we have investigated in this work required sufficient training data to avoid overfitting. Although techniques such as cross-validation, early stopping (in the case of neural networks and XGBoost), and train/test splitting can help guard against (and detect) overfitting, having a sufficiently large dataset is of the utmost importance to achieve truly generalizable models. As a result, there is a critical need for data management approaches that satsify the set of FAIR principles. This crucial need for effective data management has led to the incorporation of data storage tooling in popular chemistry packages including Pymatgen [93], ASE [94], and RDKit [95]. Moreover, advances in both computational capacity and techniques has given rise to studies performing the high-throughput screening of chemical systems [98,99,100]. This has resulted in the development of tools focusing on the provenance of data, such as the Automated Interactive Infrastructure and Database for Computational Science (AiiDA) system [101, 102].

Overall, we have identified a series of key issues should see more attention as the digital ecosystem surrounding materials modeling continues to develop. First, interpretability of models allows us to derive physical understanding from the available data. This is a key benefit of symbolic regression tools like SISSO, which result in the creation of human-readable equations describing the model. Additionally, increasing the accessibility of ML techniques through automation (such as in the field of AutoML) will allow a wider range of researchers the ability to benefit from advances in modeling techniques. Data management and data provenance are another major issues, which allow us to better understand which datasets can be combined (e.g., when combining DFT datasets, the methodologies should be consistent between them), and to help us understand if something intrinsic to the training data is affecting model performance. These data management goals are core focus of platforms such as Exabyte [103], which provides an all-in-one solution for i) storing material data and metadata, ii) storing the methodology required to derive a property from a material, and iii) providing the means to automatically perform calculations, and iv) automatically extracting calculation results and storing them for the user. This focus on providing a tool that manages materials, workflows, and calculations has allowed Exabyte to be a highly successful platform, which has led to studies involving automated phonon calculations [104], high-throughput screening of materials for their band structure [105, 106]. Future capabilities of the platform are slated to include a categorization scheme for computational models to provide even more metadata to track the provenance of calculated material properties [107].

Conclusion

In this work, we have performed a series of benchmarks on a diverse set of ML algorithms: gradient boosting (XGBoost), AutoML (TPOT), deep learning (Roost), and symbolic regression (SISSO). These models were used to predict (i) the volume of perovskites, (ii) the DFT bandgap of 2D materials, and (iii) the exfoliation energy of 2D materials. We identify that TPOT, SISSO, and XGBoost tend to produce more-accurate models than Roost, but Roost works well in systems where compositional descriptors are enough to predict the target property. Finally, although SISSO was unable to find an accurate model for the exfoliation energy, it provides a human-readable equation describing the model, facilitating an easier interpretation compared to the other algorithms. We believe that interpretability will remain a key challenge to address as complex techniques (i.e., neural networks and AutoML) become more mainstream within the digital materials modeling ecosystem. Overall, as tools improving the accessibility of machine-learning continue to be developed, data provenance and model interpretability will become even more important, as it is a critical part of ensuring the accessibility of these techniques. By working to ensure that a wider audience of researchers can achieve insight from the rich digital ecosystem of materials design, materials discovery can be accelerated.

Methodology

Data sources

Crystal structures for the perovskite systems were obtained from the “Stable Inorganic Perovskites” dataset published by Körbel, Marques, and Botti [70], as hosted by NOMAD [19]. This dataset contains a total of 144 DFT-relaxed inorganic perovskites identified via a high-throughput screening strategy. Using this dataset, we develop a model of perovskite volume. As we rely on the use of compositional descriptors for these systems, we have scaled the volume of the perovskite unit cell by the number of formula units, such that the volume has units of Å3 / formula unit.

Structures for 2D materials were obtained from the 2DMatPedia [11], a large database containing a mixture of 6,351 real and hypothetical 2D systems. This database was generated via a DFT-based high-throughput screening approach, which investigated bulk structures hosted by the Materials Project database [18] to find systems which may plausibly form 2D structures. Among other things, the 2DMatPedia provides DFT-calculated exfoliation energies and bandgaps, along with a DFT-optimized structure for each material. We use this dataset to develop models for the bandgap and exfoliation energy of 2D materials. Although the dataset reports exfoliation energies in units of eV, to facilitate comparison with other works focusing on 2D material exfoliation energy, we have converted these into units of J / m2. Bandgaps are reported in units of eV.

Because datasets may change and evolve over time, we note that all datasets used in this work were accessed during the time period between June and December of 2021. Further details on the datasets can be found in our supporting information section 6.8.

Feature engineering

To facilitate the development of ML algorithms capable of rapidly predicting material properties, we focus primarily on features that do not require further (computationally intensive) DFT calculations. A variety of chemical featurization libraries were used to generate compositional and structural descriptors for the systems we investigated, and they are listed in Sects. “Compositional Descriptors” and “Structural descriptors,” respectively. Features with values of NaN (which occurred when a feature could not be calculated) were assigned a value of 0.

In the case of the 2D material bandgap, we include the DFT-calculated bandgap of the respective bulk material; we note that these values are tabulated on the Materials Project and can be looked up, thus circumventing the need for further DFT work. We also note that the 2D and 3D material bandgaps are highly correlated with one-another. In effect, our model becomes a correction on top of the 3D bandgap term and furthermore reduces its applicability to systems with a corresponding 3D parent to be derived from. We acknowledge this produces a slightly less interesting result, but ultimately included it given the difficulty of the bandgap prediction problem.

Compositional descriptors

Compositional (i.e., chemical formula-based) descriptors were calculated via the open-source XenonPy packaged developed by Yamada et al [108]. XenonPy uses tabulated elemental data from Mendeleev [97], Pymatgen [93], the CRC Handbook of Chemistry and Physics [109], and Magpie [110] in order to calculate compositional features. XenonPy does this by combining the elemental descriptors (e.g., atomic weight, ionization potential, etc.) in various ways to form a single composition-weighted value. For example, three compositional descriptors may be obtained with XenonPy by taking the composition-weighted average, sum, or maximum elemental value of the atomic weight. Leveraging the full list of compositional features implemented in XenonPy results to 290 compositional descriptors, which are explained in greater detail within their publication [108].

The 290 compositional descriptors were used for the perovskite volume prediction, 2D material bandgap, and 2D material exfoliation energy prediction problems. We note that these descriptors were not used in the Roost model, as it directly takes the composition for its input.

Structural descriptors

Some structural descriptors were calculated using MatMiner [96], an open-source Python package geared toward data-mining material properties. Leveraging MatMiner, the following 9 descriptors were calculated: Average bond length, average bond angle, Global Instability Index (GII) [111], Ewald Summation Energy [112], a Shannon Information Entropy-based Structural Complexity (both per atom and per cell), and the number of symmetry operations available to the system. In the case of the average bond length and average bond angle, bonds were determined using Pymatgen’s implementation of the JMol [113] AutoBond algorithm. This list of bonds was also used to calculate an average Coordination Number (CN) over all atoms in the unit cell. Finally, we also took the perimeter:area ratio of the 2D material’s repeating unit.

The structural descriptors were used for the 2D material bandgap and 2D material exfoliation energy problems. We did not use these descriptors in the case of the perovskite volume prediction problem, and we note that they were not used as inputs to the Roost model.

Data filtering

The data filtering methodology was chosen based on the problem at-hand. The perovskite volume prediction problem did not utilize any data filtering. In the case of the 2D material bandgap and exfoliation energy prediction problems, the data obtained from the 2DMatPedia were required to satisfy all of the following criteria:

  1. 1.

    No elements from the f-block, larger than U, or noble gases were allowed.

  2. 2.

    Decomposition energy must be below 0.5 eV/atom.

  3. 3.

    Exfoliation energy must be strictly positive.

Additionally, in the case of the 2D material bandgap, data were required to have a parent material defined on the Materials Project. This was done because we use the Materials Project’s tabulated DFT bandgap of the bulk system as a descriptor for the bandgap of the corresponding 2D system.

ML models

For each dataset investigated, 10% of the given dataset was randomly selected to be held out as a testing set. The same train/test split was used for all 4 models considered (XGBoost, TPOT, Roost, and SISSO). To facilitate a transparent comparison between models, in all cases we report the MAE, RMSE, Maximum Error, and R2 score of the test set.

Gradient boosting with XGBoost

For details on how XGBoost works, we refer the reader to the XGBoost publication by Chen and Guestrin [24] and to the package’s documentation located at the following URL: https://xgboost.readthedocs.io/en/stable/. When training XGBoost models, 20% of randomly selected data were held out as an internal validation set. This was used to adopt an early-stopping strategy, where if the model RMSE did not improve after 50 consecutive rounds, training was halted early. When training, XGBoost was configured to optimize its RMSE.

Hyperparameters were optimized via the open-source Optuna [114] framework. The hyperparameter space was sampled using the Tree-structured Parzen Estimator (TPE) approach [115, 116]. To accelerate the hyperparameter search, we leveraged the Hyperband [117] approach for model pruning, using the validation set RMSE to determine whether to prune a model. Hyperband’s budget for the number of trees in the ensemble was set to range between 1 and 256 (corresponding with the maximum number of estimators we allowed an XGBoost model to have). The search space for hyperparameters is found in Table 4.

Table 4 Ranges of hyperparameters screened with Optuna for all XGBoost runs. The search was inclusive of the listed minima and maxima. Hyperparameters use the same variable naming convention as in the XGBoost documentation.

The variable names here (e.g., learning_rate) correspond with the variable names listed in the documentation of XGBoost. Additionally, Optuna was used to select a standardization strategy, choosing between Z-score normalization (i.e., subtracting the mean and dividing by the standard deviation) or Min/Max scaling (i.e., scaling the data such that it has minimum 0 and maximum 1). To prevent test-set leakage, the chosen standardizer was fit only with the internal training set, i.e., the portion of the training set that was not held out as an internal validation set. Optuna performed 1000 trials to minimize the validation set RMSE. We report the results of the final optimized model.

AutoML with TPOT

The AutoML tool TPOT was leveraged with a population size of 100 pipelines, with training proceeding for a total of 100 generations. The default maximum evaluation time of 5 minutes per model was set. As TPOT is an actively maintained open-source repository, for the purposes of future replication we enumerate this configuration’s set of allowable components in Table 55. The models listed in this table could be combined in any order any number of times. Models were selected such that their 10-fold cross-validated RMSE was optimized. TPOT also conducts its own internal optimization of model hyperparameters, thus we did not perform our own hyperparameter optimization of the TPOT pipelines.

Neural networks with Roost

The Roost Neural Network (NN) architecture was leveraged using the “example.py” script provided with its source code. Roost is a message-passing graph neural network which leverages the material stoichiometry instead of the material structure for its inputs. For details on the specific architecture (e.g., the number of message-passing layers, activation functions, etc.) we refer the reader to the original paper by Goodall et. al. for more details on how it is formulated [34]. Our work used the reference implementation found on the GitHub page reported by the original Roost publication.

Models are trained for a total of 512 epochs with the default settings. In the case of Roost models, the only feature provided is the composition of the system, given through the chemical formula.

Symbolic regression with SISSO

The first step of using SISSO is reducing the number of primary features down from a list of hundreds down to the tens. This is done due to the exponential computational cost of SISSO with respect to the number of features and the number of rungs being considered. To perform this down selection we first generate a rung 1 feature space including all of the primary features and operators that are used in the SISSO calculation. We then check how often each of the primary features appear in the ten thousand generated features that are most correlated to the target property. Additionally, we add units to all of the preselected primary features to ensure all generated expressions are valid.

In many cases, it was easy to infer what the abstract units are for the XenonPy descriptors. In a few cases where the units weren’t as clear, we compared the reported elemental values of those units to those of known sources (e.g., the NIST WebBook [118] or the CRC Handbook [109]) in order to determine the units. Finally, although it was generally easy to determine where the source of a feature was, sometimes we were unable to determine a source. In these cases, we refer to the features as a "XenonPy" feature (for example, “\(r_{XenonPy}\)”).

The optimal number of terms (up to 3) and rung (up to 2), i.e., the number of times operators is recursively applied to the feature space, is determined using a five-fold cross-validation scheme. For all models, we allow for an external bias term to be non-zero and use a SIS selection size of 500. The resulting descriptors were then evaluated using the same external test set for each of the other methods. To take advantage of SISSO’s ability to generate new composite descriptors and operate in large feature spaces, additional features were included in the SISSO calculations. A full list of features used in the SISSO work can be found in the linked GitHub repository.