1 Introduction

An essential step in data science is selecting a suitable machine learning model that maximizes the performance measured for a given task. The traditional approach trains different models, evaluates their performance on a validation set, and chooses the best model. However, this method is time-consuming and resource-intensive. Automated machine learning is an active area of research to automatically select a suitable machine learning model for a given task. Researchers have tried to address the model selection problem through various approaches such as meta-learning [1,2,3], deep reinforcement learning [4], Bayesian optimization [5, 6], evolutionary algorithms [7,8,9], and budget-based evaluation [10].

Analyzing the data characteristics is essential for selecting an appropriate classification model and feature engineering. However, supposing we can estimate the empirical classification model performance with explainability for the dataset apriori, it becomes straightforward to pick a suitable classifier model class to solve the problem. This setup is advantageous while working with large datasets, as evaluating different classifier model classes for model selection is laborious and time-consuming.

Clustering methods group the data points having similar characteristics into neighborhoods or disjuncts of different sizes. Clustering indices [11] are cluster evaluation metrics used to assess the quality of the clusters induced by a clustering algorithm. Clustering indices measure the ability of a clustering algorithm to induce good-quality neighborhoods with similar data characteristics. We hypothesize that the clustering indices provide a low-dimensional vector representation of the dataset characteristics with respect to a clustering method. When we use different clustering methods to compute the clustering indices, we can generate different views of the dataset characteristics. Combining multiple views of the data characteristics in terms of clustering indices gives us a rich feature space representation of the dataset characteristics.

We use the term model fitness to denote the ability of a model class to learn a classification task on a given binary class dataset. The empirical model fitness of a dataset can be measured based on the expected classification performance of a model class on a dataset. We use the \(F_1\) score as the classification performance metric in our experiments, but the idea is agnostic to any metric. A classifier’s empirical performance depends on the classifier hypothesis’s ability to model the data characteristics [12]. We hypothesize that the dataset characteristics represented by the clustering indices correlate strongly with the empirical model fitness.

In this paper, we propose CIAMS, a novel Clustering Indices-based Automatic Model Selection method from a set of model classes by estimating the empirical model fitness for a given binary class dataset from only its clustering indices representation of the data characteristics. We model the relationship between the clustering indices of a dataset and the empirical model fitness of a model class as a regression problem. We learn a regressor for each model class from a set of model classes on several datasets by randomly drawing subsamples with replacement. Constructing multiple subsamples allows us to increase the number of data points to train our regression model. Another advantage of using subsamples is to provide broader coverage of the dataset variance characteristic for regression modeling.

We train independent regressors for each model class with the best achievable classification performance for each dataset as the output and its respective estimated clustering indices as the input predictors. We tune every candidate classifier model for maximum classification performance concerning the dataset subsample. This way, when the regressors learn the mapping between clustering indices and the maximum achievable classification performance for every model class. The automation model selection process, implemented as a prediction task, can estimate the model fitness directly in terms of the expected classification performance. Using the estimated model fitness, we rank the candidate model classes to suggest the top model classes for a given dataset as the recommendation. We limit ourselves from suggesting the model class hyper-parameters. We believe mapping the hyper-parameters to the dataset characteristics is a separate problem, which we mark as one of the future extensions of CIAMS. We validate our model selection regressor through cross-validation using 60 (sixty) public domain binary class datasets and observe that our model recommendation is accurate for over three-fourths of the datasets.

We extend our automatic model classification method to an end-to-end Automated Machine Learning platform to offer binary classification modeling as a service. We use the top3 model classes predicted by our model selection method to build tuned classifiers using the labeled portion of the given production dataset. We define production dataset as the input provided by an end-user containing labeled and unlabeled portions, from which we learn these classifiers, followed by predicting the labels for the unlabeled data points. The best-performing model, chosen through cross-validation, among the top3 tuned classifier models is offered as a service to predict labels for the unlabeled portion of the production dataset. We validate our platform against other commercial and noncommercial automated machine learning systems using a different set of 25 (twenty-five) public domain binary class datasets of varied sizes. The comparison experiment shows that our platform outperforms other systems with an excellent average rank of 1.68, proving its viability in building practical applications.

The main contributions of this paper are:

  • A novel hypothesis is that the classification performance of a model class for a binary class dataset is a function of the dataset’s clustering indices.

  • A novel method to estimate the expected classification performance of a model class for a binary class dataset without building the classification model.

  • A novel application of clustering indices for automatic model selection from a list of model classes for a given binary class dataset.

  • A novel automated machine learning platform (Automated ML) for learning and deploying a classifier model as a service.

We organize the remainder of the paper as follows. Section 2 lists the related techniques and approaches for model selection and model fitness assessment. Section 3 summarizes our approach to automatic model selection. Section 3.2 gives a detailed explanation of our proposed model selection system. Section 4 describes the entire experimental setup and parameter configuration. Section 5 validates our system and narrates the results obtained from the experimental study. Section 6 provides the concluding remarks and next steps.

2 Related work

In this section, we summarize various approaches from the literature for automatic model selection organized into different categories.

  • Random search: early research works hypothesized automated model selection as a Combined Algorithm Selection and Hyperparameter optimization (CASH) problem. Amazon’s Sagemaker [13, 14] is an example of a commercial Automated ML platform that follows the CASH paradigm. H\(_2\)O AutoML [15, 16], an open-source Automated ML platform, uses fast random search and ensemble methods like stacking to achieve competitive results.

  • Bayesian optimization: Auto-weka [5] and Auto-sklearn [6] are Automated ML frameworks extensions of the popular Weka and Scikit-learn libraries, respectively. Auto-Weka [5] uses a state-of-the-art Bayesian optimization method, random-forest-based Sequential Model-based Algorithm Configuration (SMAC), for automated model selection. Auto-sklearn [6] builds on top of the Bayesian optimization solution in Auto-weka by including meta-learning for initialization of the Bayesian optimizer and ensembling to provide high predictive performance. Microsoft Azure Automated ML [17, 18] uses Bayesian optimization and collaborative filtering for automatic model selection and tuning.

  • Evolutionary algorithms: TPOT [7] is a Python-based framework that uses the Genetic Programming algorithm to evolve and optimize tree-based machine learning pipelines. Autostacker [8] is similar to TPOT, but stacked layers represent the machine learning pipeline. AutoML-Zero [9] uses basic mathematical operations as building blocks to discover complete machine learning algorithms through evolutionary algorithms. FLAML [19] uses an Estimated Cost for Improvement (ECI)-based prioritization to find the optimal learning algorithm in low-cost environments.

  • Deep reinforcement learning: AlphaD3M [4] uses deep reinforcement learning to synthesize various components in the machine learning pipeline to obtain maximum performance measures.

  • Meta-learning: Brazdil et al. [1, 20] uses a k-nearest neighbor approach based on the dataset characteristics to provide a ranked list of classifiers using different ranking methods based on accuracy and time information. AutoDi [2] uses word-embedding features and dataset meta-features for automatic model selection. AutoGRD [3] represents the datasets as graphs to extract features for training the meta-learner. AutoClust [21] uses clustering indices as meta-features to automatically select suitable clustering algorithms and hyper-parameters. Sahni et al. [22] developed a meta-feature approach to automatically select a sampling method for imbalanced data classification. Santhiappan et al. [23] propose a method using clustering indices as meta-features to estimate the empirical binary classification complexity of the dataset.

Our method follows the meta-learning paradigm, wherein we learn the relationship between the extracted meta-features of the dataset in terms of clustering indices and the expected classification performance of a model class. The trained meta-learner predicts the classification performance of a model class for an unseen dataset without building a classifier model. The differentiation among various meta-learning methods pivots on the choice of the meta-features extracted from the dataset. The following list presents the meta-features from the literature, organized into different categories.

  • Statistical and information-theoretic [24, 25]: these measures include the number of data points in the dataset, number of classes, number of variables with a numeric and symbolic data type, average and variance of every feature, the entropy of individual features, and more. These metrics capture important meta information about the dataset.

    • Class boundary: the nature of the class margin of a dataset is an essential characteristic reflecting its classifiability. Measures such as inter-class and intra-class nearest-neighbor distance, error rate, and non-linearity of the nearest-neighbor classifier try to capture the underlying class margin properties such as shape and narrowness between classes.

    • Class imbalance: machine learning methods in their default settings are biased toward learning the majority class due to a lack of data points representing the minority class. Features such as entropy of class proportions and class-imbalance ratio strongly reflect dataset characteristics such as the classification complexity of a dataset.

    • Data sparsity [26]: sparse regions in the dataset affect the classifier’s learning ability leading to poor performance. The average number of features per dimension, the average number of PCA dimensions per point, and the ratio of PCA dimension to the original dimension capture sparsity in the dataset.

  • Feature-based [27, 28]: the learning ability of methods is highly correlated with the features’ discriminatory power. Fisher’s discriminant ratio, Overlap region volume, and feature efficiency are among many measures from the literature that tries to capture the ability to learn.

  • Model based [29,30,31]: the hyper-parameters of a model directly affect the model performance. For instance, hyper-parameters such as the number of leaf nodes, maximum depth, and average gain-ratio difference serve as meta-features in a tree-based model. Likewise, the number of support vectors required in SVM modeling is a meta-feature.

    • Linearity [27, 32]: most classifiers perform highly when the dataset is linear. To capture the inherent linearity present in data, we use model-based measures, such as the error rate of a linear SVM and non-linearity of the linear classifier, as meta-features.

    • Landmarking: Bensusan et al. [33, 34] noted that providing the performance measures obtained using simple learning algorithms (baselines) as a meta-feature has a strong co-relation with the classification performance of the considered algorithm. Fürnkranz et al. [35] explored different landmarking variants like relative landmarking, sub-sample landmarking techniques, and their effectiveness in several learning tasks, such as decision tree pruning.

      • Landmarking is one of the most effective methods for meta-learning-based automatic model selection [35]. Landmarking uses simple classifiers’ performance on a dataset to capture the underlying characteristics. Landmarking requires building several simple classifier models on the dataset to extract features. In comparison, our proposed model selection approach requires building dataset clusters to extract clustering indices features. We compare the performance of landmarking and clustering indices features through an extrinsic regression task in Sect. 5.2.

      • The computational cost of extracting the clustering indices as the meta-features for big datasets is mitigated through subsampling similar to Landmarkers [35]. Petrak et al. [36] establish the “Similarity of regions of Expertise” property, which says that the meta-features from several subsamples of the dataset collectively represent the characteristics of the full dataset. We also empirically validate the clustering indices upholding the said property in Sect. 3.1.

  • Graph-based [37, 38]: graph representation of a dataset can help extract useful meta-features such as mean network density, coefficient of clustering, and hub score for several meta-learning tasks.

There are several contributions in the literature to benchmark AutoML methods. Zöller et al.  [39] introduce a mathematical formulation covering the complete procedure of automatic ML pipeline synthesis and compare it with existing problem formulations. The benchmarking encompasses eight Hyper-Parameter Optimization methods (HPO) and six popular AutoML frameworks on real datasets. Santu et al. [40] introduce a new classification system for AutoML systems, using a seven-tiered schematic to distinguish these systems based on their level of autonomy. The authors describe what an end-to-end machine learning pipeline actually looks like and which subtasks of the machine learning pipeline have been automated already. The paper also introduces our novel level-based taxonomy for AutoML systems and defines each level according to the scope of automation support provided. He et al. [41] focus only on the deep-learning-based AutoML and present a survey of methods for HPO and Neural Architecture Search (NAS).

In this work, we don’t aim to perform an extensive benchmark, as our goal is only to propose a new method for automatic model selection, which we also extend to an Automated ML system for binary classification tasks. So, we limit our benchmarking to only a few popular AutoML methods to establish the validity of our approach.

Data characterization is a crucial setup in understanding the nuances in a dataset. When specific dataset properties are known, it helps choose the suitable method or algorithm to solve the task. Several meta-features discussed in the literature are targeted at specific dataset characteristics. Enlisting the meta-features of a dataset is a laborious process that is costly in terms of time and computing power. We choose clustering indices to be the meta-features to represent dataset characteristics. Clustering indices are evaluation metrics to estimate how well a clustering algorithm grouped the data with similar characteristics. Clustering indices are scalar values that indicate the nuances in a dataset under different clustering assumptions. Computing the clustering indices is a parallelizable process whose time complexity is proportional to the size of a dataset subsample. We get more comprehensive coverage of the data characteristics when we generate clustering indices under different clustering assumptions.

In principle, the clustering indices approach to represent the data characteristics is similar to the landmarking approach. Despite requiring more computing power for dataset clustering, our experiments in Sect. 5 empirically show that the clustering indices capture a richer dataset characteristic representation for providing better generalization.

3 Our approach to automatic model selection

Data characterization techniques extract meaningful dataset properties that the downstream machine learning tasks and applications could use to improve performance. We hypothesize that the clustering indices computed from dataset clustering represent dataset characteristics concerning a specific clustering method. Clustering algorithms make different clustering assumptions for grouping the data points into neighborhoods. Clustering indices are quality measures for validating the clusters induced by a clustering algorithm. When we use clustering indices to measure the performance of such clustering algorithms, they inherently capture different properties of the datasets. When a clustering index is independent of any external information, such as data labels, the index becomes an internal index, or quality index [11]. On the contrary, when the clustering index uses data point labels, it becomes an external index.

Table 1 Notations

Table 1 lists the notations used for representing different entities. Given a binary labeled dataset \(D = \{\langle X_{i}, y_{i}\rangle \}_{i=1}^{n}\), where the data instance vector \(X_{i}~\in ~{\mathbb {R}}^p\) and the binary class label \(y_{i} \in \{{-1},1\}\), the objective of our automatic model selection system is to determine the best model class \(C_{best}\) from a set of model classes \({\mathcal {C}} = \{C_{1}, C_{2}, \cdots , C_{m}\}\) that provides the best classification performance. We hypothesize that the clustering indices representing the characteristics of a binary class dataset D shall strongly correlate with the expected classification performance of a model class \(C_i\in {\mathcal {C}}\) for the dataset D.

Let \({\mathcal {I}}=\{I_1, I_2,\cdots , I_t\}\) be the selected clustering indices containing internal and external measures. Let \({\mathbb {F}}=\{{\mathcal {F}}_1,{\mathcal {F}}_2,\ldots , {\mathcal {F}}_t\}\) be the set of functions that map a given dataset D to a clustering index I defined as \({\mathcal {F}}_j(D;A):D\rightarrow I_j\), where the data instance matrix \(\textbf{X}\) of \(D=\langle \textbf{X}, \textbf{y}\rangle \) transforms to a scalar cluster index value \(I_j\). Each function \({\mathcal {F}}(D; A)\) represents running a clustering algorithm A on the dataset D, followed by extracting several clustering indices \(I\in {\mathcal {I}}\). The function \({\mathbb {F}}(D; A)\) represents processing the dataset D independently by all the functions \({\mathcal {F}}_j\in {\mathbb {F}}\) as a Multiple Instruction Single Data (MISD) operations.

$$\begin{aligned} {\mathbb {F}}(D;A)\equiv \left[ {\mathcal {F}}_1(D;A), {\mathcal {F}}_2(D;A),\ldots ,{\mathcal {F}}_t(D;A)\right] ^\textrm{T} \end{aligned}$$
(1)

Let the dataset transformation to the cluster indices feature space be defined as \({\mathbb {F}}(D;A):D\rightarrow \textbf{I}\), where \(\textbf{I}=\left[ I_1,I_2,\ldots ,I_t\right] ^\textrm{T}\). Let the average \(F_1\) score be the performance metric for evaluating the model fitness of a model class \(C_i\). Let R be a regression task that learns the mapping between the clustering indices and the expected classification performance of the model class \({C}_i\) defined as \(R(\textbf{I};{C}_i):\textbf{I}\rightarrow [0,1]\). We define individual regressors \(R_i\) for each model class \(C_i\) as a set \({\mathcal {R}}=\{R_1,R_2,\ldots ,R_m\}\). Now, the objective of the automatic model selection method is to find the best-performing model class \(C_{best}\in {\mathcal {C}}\) for a dataset D based on the maximum output from each of the regressors in \({\mathcal {R}}\).

$$\begin{aligned} i^*&=\arg \max _{1\le i\le m} R_i({\mathbb {F}}(D;A)) \end{aligned}$$
(2)
$$\begin{aligned} C_{best}&= C_{i^*} \end{aligned}$$
(3)

Training a regressor \(R_i\) for a model class \(C_i\) requires several samples of the form \(\langle \textbf{I}, O\rangle \), where \(\textbf{I}={\mathbb {F}}(D;A)\), and O being the maximum classification performance score achievable (for instance, \(F_1\) metric) for the tuned model class \(C_i\) computed by the function \({\mathcal {Q}}(D; C_i)\).

We understand that the cluster indices feature vector \(\textbf{I}\) is computed for each dataset D. Assuming we have a collection of datasets \({\mathcal {D}}=\{D_1, D_2,\ldots , D_N\}\) for training the regressor \(R_i\), if we consider each dataset as a single instance vector of clustering indices, the number of training samples gets limited by the size of the dataset collection \({\mathcal {D}}\). It becomes hard to train the regressor model due to the shortage of training samples. We train the regression functions \(R_i\in {\mathcal {R}}\) using the dataset subsamples instead of the full dataset to overcome the data shortage problem. In this process, every dataset \(D_i\) undergoes random subsampling with replacement to generate b subsamples of constant size h as \({\mathcal {B}}_i = \{B_{i1},B_{i2},\cdots B_{ib}\}\), where \(B_{ij} = \{d \Vert d\sim D_i\}_{k=1}^{h}, \Vert B_{ij}\Vert =h\).

An advantage of using subsamples instead of the full dataset is generating more variability in the datasets used for training the regressors, making it robust to the dataset variance. Another advantage is the ease of generating clustering indices from subsamples compared to working with large datasets in a single shot. In the single shot mode, we run the clustering algorithms on the full dataset population to estimate the clustering indices. Usually, running the clustering algorithms on the dataset samples (subsamples mode) is significantly faster than running on the full population (single-shot mode).

3.1 Clustering indices of subsamples vs. population

We use the stratified random sampling method to create subsamples of a dataset. In this section, we analyze how the subsamples represent the characteristics of the data population. We attempt to establish that the cluster indices estimated for a whole population are similar to that of the subsamples by visualizing the cluster indices in a lower-dimensional space through t-SNE [42]-based visualization.

Figure 1 represents the lower-dimensional representation of the dataset’s cluster indices for the data population and subsamples. We observe from the figure that the subsample cluster indices are found near and around the whole population cluster indices in most cases. The closeness of the cluster indices implies that the subsamples are indeed representative of the dataset population collectively.

Fig. 1
figure 1

Low dimensional t-SNE visualization of the clustering indices estimated for the whole dataset population and dataset subsamples. The BLACK points represent the cluster indices of the whole dataset population, whereas the points represent cluster indices of the dataset subsamples. The points represent the subsamples that have failed Hotelling’s \(T^2\) test (colour figure online)

We use Hotelling’s two-sample \(T^2\) test [43], a multivariate extension of two-sample t-test for checking if a subsample is representative of the whole dataset population in terms of the respective dataset predictor variables. Figure 1 shows the subsamples that fail the Hotelling’s \(T^2\) test (shown as RED dots) along with the passing subsamples (shown as GREEN dots). The subsamples that pass Hotelling’s test appear closer to the whole dataset population shown in BLACK color. A few subsamples appear relatively far from the entire population, but they seem to fail in Hotelling’s test appropriately. This observation confirms the agreement of Hotelling’s test response to the t-SNE visualization of the clustering indices.

One possible reason for some subsamples to go away from the whole population cluster indices is the randomness in sampling. The deviant subsamples benefit the training phase as our dataset construction pipeline described in Sect. 3.3.3 considers each subsample as an independent dataset. Therefore, the far-away subsamples offer new dataset variance to our regressor \({\mathcal {R}}\) training that, in turn, should help the regressors to achieve better generalization over unseen datasets.

On the other hand, the deviant subsamples are problematic in the recommendation pipeline, as they might skew the estimated classification performance of the model classes in \({\mathcal {C}}\). Removing such deviant subsamples from the experiment requires the whole population cluster indices that might not be available during testing. To overcome this limitation, we propose to remove subsamples that fail the Hotelling’s \(T^{2}\) test from the recommendation pipeline described in Sect. 3.4 to limit the skew due to random subsampling.

3.2 Automatic model selection system architecture

The Automatic Model Selection system consists of two independent pipelines, one for training the underlying regression models and another pipeline for recommending the top-performing model classes for a given dataset. Figure 2 illustrates the overall architecture of the proposed automatic model selection system. The upper section is the Training pipeline, and the bottom is the model Recommendation pipeline. The Mapper module of the training pipeline contains the regressors for each model class that learn the mapping between clustering indices and model fitness. The model recommendation pipeline uses the learned regressors (Mapper modules) to predict the expected model fitness in terms of \(F_1\) score (but not limited to) for each model class. The model classes that score the top3 \(F_1\) scores become the best-fit classifier candidates for the given production dataset. We pick the best-performing classifier among the recommended top models through cross-validation to predict the labels of the unlabeled data.

Fig. 2
figure 2

Architecture of the automatic model selection system

3.3 Training pipeline

The training phase has three subparts, namely: A Preprocessing, B Data construction, and C Mappers as regression models shown as the Training pipeline enclosed in the top dotted rectangle region in Fig. 2.

3.3.1 Preprocessing

Data preprocessing involves multiple sub-tasks to transform a single dataset \(D_i\) into a set \({\mathcal {B}}_i\) of several subsamples generated by stratified random sampling with replacement. The dataset \(D_i\) divides into 70:30 training-validation partitions for cross-validating the regressor model training. The training and validation partitions undergo random sampling independently to generate the respective subsamples. The constructed subsamples {\(B_{ij}\}_{j=1}^b\) from a dataset \(D_i\), undergo a cleansing process involving scaling and standardization. At the end of the preprocessing stage, we have a set \({\mathcal {B}}_i=\{B_{i1}, B_{i2},\ldots , B_{ib}\}\) of several randomly sampled subsamples from both the training and validation partitions of the dataset \(D_i\). The training process uses the training subsamples for learning and tuning the regressor functions \({\mathcal {R}}\) through cross-validation. We use the validation partitions for reporting the performance through \(R^2\) score.

Table 2 Hyper-parameters for tuning the classifiers while estimating the model-fitness

3.3.2 Estimating model-fitness

When applied to a dataset, the model-fitness score indicates what to expect as the classification performance from a model class. We set up the function \({\mathcal {Q}}(B;C)\) to measure the model fitness of a classifier model class C for the dataset B. We measure the model fitness by estimating the maximum achievable classification performance measured using \(F_1\) metric (but not limited to) by building tuned classifiers for the given dataset. Table 2 lists the tuning parameters we use for each model class to achieve optimal performance. We tune the classifiers for each given dataset to find the maximum achievable \(F_1\) score and use the estimate as a surrogate measure for model fitness. Our hyper-parameters list is not exhaustive but only indicates the need to optimize the model performance. We do this exercise through cross-validation to avoid an overfitting scenario that potentially skews the model-fitness score.

3.3.3 Data construction

Given a set of subsamples \({\mathcal {B}}_i=\{B_{i1},B_{i2},\ldots ,B_{ib}\}\) for training and validation drawn from each dataset \(D_i\in {\mathcal {D}}\), the objective of the data construction phase is to generate a set of tuples \(\{\langle \mathbf {I_{ij},O_{ij}}\rangle \}_{j=1}^b\) from all the subsamples in \({\mathcal {B}}_i\), where \(\mathbf {I_{ij}}={\mathbb {F}}(B_{ij})\) as per Eq. (1) and \(\mathbf {O_{ij}}\) is the model-fitness score for a dataset \(B_{ij}\) for every model class in \({\mathcal {C}}\) given by:

$$\begin{aligned} \mathbf {O_{ij}}\leftarrow \left[ {\mathcal {Q}}(B_{ij};C_1), {\mathcal {Q}}(B_{ij};C_2),\ldots , {\mathcal {Q}}(B_{ij};C_m)\right] ^\textrm{T} \end{aligned}$$
(4)

At the end of the data construction phase, we get a matrix of clustering indices feature vectors \(\hat{\textbf{I}}_i=[\textbf{I}_{i1}, \textbf{I}_{i2},\ldots ,\textbf{I}_{ib}]^\textrm{T}\) generated for every subsample from the set \({\mathcal {B}}_i\) from dataset \(D_i\) with the corresponding matrix of model-fitness scores \(\hat{\textbf{O}}_i=[\textbf{O}_{i1}, \textbf{O}_{i2},\ldots ,\textbf{O}_{ib}]^\textrm{T}\) estimated for each model class in \({\mathcal {C}}\). Then, we combine the data generated for individual training datasets \(D_i\in {\mathcal {D}}\) into a jumbo dataset \(\langle \hat{\textbf{I}},\hat{\textbf{O}}\rangle \), such that:

$$\begin{aligned}&\hat{\textbf{I}}\leftarrow \left[ \hat{\textbf{I}}_1,\hat{\textbf{I}}_2, \ldots ,\hat{\textbf{I}}_N\right] ^\textrm{T} \end{aligned}$$
(5)
$$\begin{aligned}&\hat{\textbf{O}}\leftarrow \left[ \hat{\textbf{O}}_1,\hat{\textbf{O}}_2, \ldots ,\hat{\textbf{O}}_N\right] ^\textrm{T} \end{aligned}$$
(6)

3.3.4 Mapper for model selection

In the Mapper phase, we learn a multiple regression function \({\mathcal {R}}:\hat{\textbf{I}}\rightarrow \hat{\textbf{O}}\) using the dataset \(\langle \hat{\textbf{I}},\hat{\textbf{O}}\rangle \) generated from the Data Construction stage. We tune the hyper-parameter of the multiple regressors using the evaluation set through cross-validation. Alternately, instead of multiple regressors \({\mathcal {R}}\), we can also build individual regressors \(R\in {\mathcal {R}}\) per model class \(C\in {\mathcal {C}}\) as \(R_k:\hat{\textbf{I}}\rightarrow \hat{\textbf{O}}^{(k)}\), where \(\hat{\textbf{O}}^{(k)}\) is the \(k^{th}\) column-vector of the matrix \(\hat{\textbf{O}}\). The mapper module constitutes the resulting set of regressors \({\mathcal {R}} = \{R_1, R_2,\ldots , R_m\}\), which we use to predict the expected classification performance of different model classes \(C_k\in {\mathcal {C}}\) for a given test dataset \(D'\).

During prediction, the mapper estimates the expected classification performance of a dataset from its clustering indices features. The expected classification performance is what an optimized/tuned classifier may achieve for a given dataset. We assume that the representation of the dataset samples in the clustering indices space follows the i.i.d assumption. This means that the parameter setting required for achieving higher performance for the training sample shall be similar to that of the validation sample. As the mapper learns to map the clustering indices to the tuned classifier performance during training, we expect the mapper to predict the closest estimate of optimized performance during validation.

3.4 Recommendation pipeline

The automatic model selection system’s recommendation pipeline is a simple process of invoking the tuned regressor models \({\mathcal {R}}\) for predicting the expected model fitness measured in terms of expected classification performance for a given test dataset \(D'\). The test dataset undergoes the same data transformation and cleansing stages as the training pipeline for consistency.

$$\begin{aligned} \textbf{I}'\leftarrow {\mathbb {F}}(D') \end{aligned}$$
(7)

The transformed data is input to each regressor function \(R\in {\mathcal {R}}\) to predict the expected classification performance (model-fitness) score for all the model classes \(C\in {\mathcal {C}}\).

$$\begin{aligned} \textbf{O}'\leftarrow {\mathcal {R}}(\textbf{I}') \end{aligned}$$
(8)

We recommend the best model class \(C_{best}\) that scores the highest model-fitness score for the given test dataset \(D'\).

$$\begin{aligned}&i^* \leftarrow \arg \max _{\forall R_i\in {\mathcal {R}}} R_i(\textbf{I}')\equiv \arg \max _{1\le i\le m} \textbf{O}'_i \end{aligned}$$
(9)
$$\begin{aligned}&C_{best} \leftarrow C_{i^*} \end{aligned}$$
(10)

Alternately, the prediction for the test dataset is also runnable using the data set subsamples, where we generate several subsamples \({\mathcal {B}}'\) by random sampling with replacement from the input test dataset \(D'\) as \({\mathcal {B}}'=\{B'_1,B'_2,\ldots ,B'_b\}\), where \(B'_i = \{d\ \Vert \ d\sim D'\}_{k=1}^h,\forall B'_i\in {\mathcal {B}}'\).

We then transform the subsamples to the clustering index feature space.

$$\begin{aligned} \textbf{I}'_j\leftarrow \left[ {\mathcal {F}}_1(B'_j), {\mathcal {F}}_2(B'_j),\ldots ,{\mathcal {F}}_t(B'_j)\right] ^\textrm{T},\quad \forall B'_j\in {\mathcal {B}}' \end{aligned}$$
(11)

We input these vectors of cluster indices to the regressors \({\mathcal {R}}\) to make predictions of expected model-fitness scores for each model class \(C\in {\mathcal {C}}\).

$$\begin{aligned} \textbf{O}'_j\leftarrow {\mathcal {R}}(\textbf{I}'_j), 1\le j\le b \end{aligned}$$
(12)

We compute the expected model-fitness scores for all the model classes for the dataset \(D'\) by averaging the estimated fitness scores for each subsample \(B'\in {\mathcal {B}}'\).

$$\begin{aligned} \textbf{O}'\leftarrow \frac{1}{b}\sum _{j=1}^b \textbf{O}'_j \end{aligned}$$
(13)

With the availability of the estimated \(F_1\) scores, the expected classification performance per model class from Eq. (13), we use Eqs. (9) and (10) to recommend the best model class \(C_{best}\) for the test dataset \(D'\).

Table 3 Set of internal and external clustering indices \({\mathcal {I}}\)
Table 4 List of classifiers \(C_i\in {\mathcal {C}}\) representing different model classes and the set of clustering methods \({\mathcal {A}}\) to generate clustering indices \({\mathcal {I}}\)

3.5 Configuration

The Automatic Model Section system requires configuration of the clustering indices set \({\mathcal {I}}\) and the list of representative model classes \({\mathcal {C}}\). Table 3 lists the set of clustering indices [44] that we configure the system to represent the dataset characteristics. The dataset clustering assumptions greatly influence the clustering indices. To make the clustering assumptions comprehensive, we configure multiple clustering algorithms representing different clustering assumptions on the dataset. Table 4 lists different clustering algorithms \(A\in {\mathcal {A}}\) that we use to generate the respective clustering indices. We concatenate the clustering indices generated by these 4 (four) clustering methods to form a broader clustering index feature space of 4t dimensions.

The transformation function set becomes \({\mathbb {F}}=\{{\mathcal {F}}_1,{\mathcal {F}}_2,\ldots , F_{4t}\}\) to cover the clustering indices from four different families of dataset clustering assumptions. We aim to use as many clustering indices as possible to build the dataset’s feature space representation. By scaling up the 40 (forty) dimensional clustering indices features from Table 3 with four different clustering algorithms from Table 4, we get a total of \(4\times 40=160\) clustering indices features to represent the dataset characteristics. The mapper modules use an appropriate subset of the features while learning the association between clustering indices and the expected classification performance of a model class.

$$\begin{aligned} {\mathcal {I}}&= \left[ {\mathcal {I}}_{A_1:kmeans}, {\mathcal {I}}_{A_2:agglomerative}, \right. \nonumber \\&\qquad \left. {\mathcal {I}}_{A_3:spectral}, {\mathcal {I}}_{A_4:hdbscan}\right] \end{aligned}$$
(14)
$$\begin{aligned} {\mathcal {I}}&= \left[ \underbrace{I_1,I_2,\ldots ,I_t}_{A_1:kmeans}, \underbrace{I_{t+1},I_{t+2},\ldots ,I_{2t}}_{A_2:agglomerative},\right. \nonumber \\&\quad \left. \underbrace{I_{2t+1},I_{2t+2},\ldots ,I_{3t}}_{A_3:spectral}, \underbrace{I_{3t+1},I_{3t+2},\ldots ,I_{4t}}_{A_4:hdbscan}\right] \end{aligned}$$
(15)

In our choice of clustering indices, we are fully aware of the existence of feature redundancy due to correlation among the indices. Despite that, we like to leverage the perspectives a clustering index can provide when it combines with different clustering methods becoming unique tuples of the form \({\mathcal {I}}\times {\mathcal {A}}\). To overcome the possible shortcomings, we expect the mapper (regressor) module that learns the relationship between clustering indices and empirical model fitness to automatically choose predictors (tuple features) based on their ability to maximize the regression task performance. For instance, tree-based regressors have the capability to eliminate redundant or correlating features during the tree growth [45].

Table 5 List of binary class datasets used for training and validating the proposed model selection

3.6 Classification modeling as a service

In an Automated ML setting, a user uploads a production dataset containing labeled and unlabeled parts to the service endpoint. The user then expects label predictions for the unlabeled part of the dataset without thinking about the underlying machine-learning pipeline. As an enterprise extension to the Automatic Model Selection system, we expose the underlying data classification modeling as a service (an Automated ML SaaS offering) by abstracting the model selection, tuning, and training processes on a production dataset. Here, we rank the model classes by their estimated model fitness score and pick the best performing model among top3 models (\(C_A, C_B, C_C\in {\mathcal {C}}\)). We use the given labeled portion of the production dataset \(\langle \textbf{X},\textbf{y}\rangle \) to cross-validate the top3 models with the best parameter settings. Using the best-performing model amongst the top3, we expose an API for predicting the class labels for the unlabeled portion of the production dataset. When the user invokes the API with the unlabeled data instance, the best-performing model among top3 classifiers outputs the class label \(\hat{y}\) for the data instance X from the test dataset \(D'\).

We describe the detailed validation of the end-to-end Automated ML system in Sect. 5.5 by comparing it against a few of the popular commercial and noncommercial Automated ML solutions.

4 Experimental setup

In this section, we test our Clustering Indices-based Automatic classification Model Selection (CIAMS) hypothesis by cross-validating with several classification datasets collected from multiple public domain sources. We set up the experiment by choosing 6 (six) different classification model families listed in Table 4 representing various model classes.

4.1 Training phase

We use sixty (60) binary class datasets in Table 5 to build our Automatic Model Selection system. We divide the datasets into train and test partitions. We use the train partition to build and tune our model selection system and the test partition to report the performance results. We divide train partition further into trainset and evalset partitions in a k-fold cross-validation setting to tune our model selection system. Each dataset from trainset and evalset partitions undergo sub-sampling independently. This ensures that we do not reuse any training data points during validation. Dataset subsampling increases the number of data points for training the mapper module (regressors) and exposes the underlying regressor models to more data variances. In total, we create 11,190 subsamples from all the 60 datasets.

4.2 Evaluation phase

We evaluate the performance of our Automatic Model Selection system using two modes, such as A subsamples mode and B single-shot mode. The subsamples mode mimics the dataset preparation strategy of the Training Phase, where we use only the subsamples of the datasets in the test partition for running the performance evaluation. Each dataset from the test partition gives out several test data points constructed from several subsamples drawn from it.

We set up cross-validation on 60 datasets using their respective subsamples. Traditionally, the sizes of the training and evaluation splits remain constant across all the cross-validation folds. In our setting, the split sizes vary proportionally to the population sizes of the datasets picked under the training and evaluation splits. We explain the proportionality in Sect. 4.3. We aim to construct the folds without any dataset spilling across folds. A fold may contain one or more datasets, but a dataset restricts to only onefold.

An ideal configuration to run the evaluation is Leave-One-(dataset)-Out. In our setting, the equivalent is running a 60-fold cross-validation. We observe that 60-fold cross-validation on the subsamples of 60 datasets is time-consuming. To speed up, we repeat the sixfold cross-validation exercise several times to approximate the results of 60-fold. Also, repeating the cross-validation several times allows us to test our method with different combinations of datasets in the training and testing folds.

On the contrary, we evaluate the single-shot mode through Leave-One-Out 60-fold cross-validation. The single-shot mode uses the entire dataset from the test partition as one test record. The single-shot mode evaluation helps assess the system’s usefulness in enterprise deployment settings. An advantage of the subsamples mode is the ability to work with large datasets, as the full dataset single-shot approach might become resource-intensive for generating the clustering indices.

4.3 Hyper parameters

A few hyperparameters control the system’s behavior, which we tune by trial and error.

4.3.1 Number of clusters

The number of clusters is a critical parameter for most clustering algorithms. We set the number-of-clusters parameter to 2 for the clustering algorithms listed in Table 4. Initially, we allow a linear search for the number-of-cluster hyper-parameter. Our linear search experiment achieves the best results when the count is 2.

4.3.2 Subsamples

We set the subsample size as \(h=500\) after doing a linear search with different subsample sizes when the size of the dataset n is beyond 2000 points. We choose the following h-value for smaller datasets depending on the dataset size range.

$$\begin{aligned} h \leftarrow \left\{ \begin{array}{ll} 100 &{} \quad n\le 500\\ 300 &{} \quad 500<n\le 2000\\ 500 &{} \quad n>2000\\ \end{array}\right. \end{aligned}$$

The number of subsamples b is set proportional to the size of the dataset n. The average bootstrap subsample contains 63.2% of the original observations and omits 36.8% [46]. When the subsample size is h, we get 0.63h data points from the original dataset. The following equation helps us get the number of subsamples to draw from the dataset to ensure maximum coverage of dataset variances and more datasets for training our internal models. The hyper-parameter \(\alpha \) is the oversampling constant, which we set to 5 based on trial and error.

4.4 Scaling up

Our present design of the CIAMS system uses four (4) clustering models on the datasets, six (6) classification model classes, and sixty (60) binary class datasets. We can train the underlying mapper module with more datasets \({\mathcal {D}}\) seamlessly to make the regressors robust for handling dataset variances. We can also add more model classes to \({\mathcal {C}}\) to increase the choices of classification methods for a given dataset. For comparison, the Azure AutoMLFootnote 1 platform uses about 12 classifiersFootnote 2 from 8 model classes. Neural Networks and Bayesian methods are the two extra model classes other than the model classes listed in \({\mathcal {C}}\). Likewise, we can further expand the clustering indices feature space by adding other clustering methods to the set \({\mathcal {A}}\). The CIAMS system is scalable and extensible by design.

5 Evaluation

In this section, we evaluate the CIAMS system at different granularity levels in a bottom-up style to answer the following questions that may challenge our proposed scheme’s usefulness.

  1. 1.

    Is the dataset model-fitness a function of the dataset clustering indices for a model class?

  2. 2.

    Does the Mapper module implemented using Regression models efficiently learn the relationship between the dataset clustering indices and the expected classification performance of a selected model class?

  3. 3.

    What is the correctness of the recommendation given by the CIAMS system?

  4. 4.

    What is the performance of the end-to-end Classification Modeling as a Service platform for enterprise deployments?

5.1 Mapper module evaluation

Firstly, we evaluate the performance of the mapper module constructed using Regression models. Evaluating the regressors helps us understand the validity of our model selection hypothesis, which assumes that the clustering indices of a dataset strongly correlate with the expected classification performance of a chosen model class. We evaluate our idea using the estimated \(R^2\) metric for every regressor built for every classifier model class. We also study the usefulness of the regressors through \(L_1\)-norm, or Mean Absolute Error (MAE) analysis between the predicted and actual dataset model-fitness measured using \(F_1\) score. We run a small-sample statistical significance test on the \(L_1\)-norm estimates to check if the system contains the error margin within \(\pm 10\)%.

Table 6 Performance of the tuned Regressor functions \(R_i\in {\mathcal {R}}\) for each model class \(C_i\in {\mathcal {C}}\) measured using \(R^2\) score in Single-Shot and Subsamples modes

5.1.1 Regressor performance

We designate a regressor \(R_i\) for every classifier \(C_i\) from the set of model classes \({\mathcal {C}}\). Each regressor function \(R_i:\hat{\textbf{I}}\rightarrow \hat{\textbf{O}}^{(i)}\) learns a mapping between the cluster indices \(\hat{\textbf{I}}\) and the tuned classifier \(C_i\) performance \(\hat{\textbf{O}}^{(i)}\) measured in \(F_1\) metric for all the training datasets \({\mathcal {D}}\). We experiment with several regression models such as SVR, Random Forest, XGBoost, k-NN, and Decision Tree. We select XGBoost as the best regression model to learn the mapper through \(R^2\) analysis. We train the regressor models using two configurations, namely: A single-shot, and B subsamples mode as explained in Sect. 4.2. In the single-shot mode, we run Leave-One-(dataset)-Out cross-validation, and in the subsamples mode, we run six-fold cross-validation with six repeats to report the performance of the regressor models measured using \(R^2\) score. We tune the XGBoost regressor models for best parameters by cross-validating on the train partition that gets split further into trainset and evalset partitions per fold. We repeat this exercise for all the classifiers in \({\mathcal {C}}\) to find the best XGBoost regression model parameters that maximize the \(R^2\) score. Table 6 narrates the \(R^2\) scores of the tuned regressors \(R_i\in {\mathcal {R}}\); we build for every classifier \(C_i\in {\mathcal {C}}\) in Single-Shot and Subsamples modes. We observe an average of 84% \(R^2\) score for both modes, which confirms the feasibility of learning a mapping between clustering indices and model fitness. Moreover, the Validation Performance column confirms that the regressors are near-optimally fit as the scores align closely with the test performance numbers.

5.1.2 Prediction correctness

We validate the regressors by checking if the predicted classification performance is similar to a model class’s expected classification performance on a dataset. We measure the similarity between the predicted and the expected classification performance measured as the model-fitness score (\(F_1\)) using Mean Absolute Error (MAE) or \(L_1\)-norm.

$$\begin{aligned} MAE \leftarrow \left\| F_1^{\;expected}-F_1^{\;predicted} \right\| \end{aligned}$$
(16)

For every model class \(C_i\), we consider the prediction of the mapper module (regressor \(R_i\)) to be a PASS if the absolute difference between the predicted and the actual classification performance is within 10% margin.

$$\begin{aligned} \texttt {Correctness} \leftarrow \left\{ \begin{array}{ll} {\texttt {PASS}} &{} \quad MAE \le 10\% \\ {\texttt {FAIL}} &{} \quad MAE >10\% \end{array}\right. \end{aligned}$$
Table 7 Mapper module performance in terms of prediction correctness using MAE

Table 7 summarizes the MAE and PASS performance of the mapper module. It is apparent from the table that the MAE is contained within \(10\%\) for the majority of the datasets. In the Subsamples mode, the average MAE is slightly off for KNN, and SVM because of higher variance in the predicted \(F_1\) scores. It is interesting to observe that the Mapping module achieves 10% compliance for over \(75\%\) of the datasets for the ensembling methods with lower variance. The large-margin model class seems to have trouble with the stability in performance across different subsamples. An average of 37 datasets have responded well to satisfy the \(10\%\) error margin constraint. When the average MAE for a model class is within the \(10\%\) margin, we are accurate in our prediction for over \(61\%\) datasets. Ignoring the low-performing SVM classifier, the truncated average PASS performance stands at 40, which is two-thirds of the datasets. The result of the subsampling mode is positively motivating to further this research avenue into achieving higher performance with better subsampling and model class tuning. On the other hand, in the Single-shot mode, the average MAE estimates are generally above the \(10\%\) margin. Although the error margin is higher than \(10\%\), the Mapper module makes accurate predictions for around \(41\%\) of 60 datasets. It is obvious from Table 7 that the Subsamples mode is more appropriate than the Single-shot mode for making predictions because the Mapper module inherently uses only the subsamples for training.

We perform the statistical significance test on the MAE estimate to check if the mapper module is indeed restricting the MAE within the 10% margin. We use the following version of the t and Z-statistic [47] to perform the significance test. Suffixes e and p denotes the expected and predicted model fitness (\(F_1\)) scores, \(\Delta \) is the hypothesized difference between the population means \(\bar{x}_e,\bar{x}_p\) (we set \(\Delta =0.1\)).

$$\begin{aligned} Z\text { or }t\text {-statistic} \rightarrow \frac{\left\| \bar{x}_e - \bar{x}_p\right\| - \Delta }{\sqrt{\frac{\sigma ^2_e}{n_e}+\frac{\sigma ^2_p}{n_p}}} \end{aligned}$$
(17)

We run t-test when the number of samples is \(\le 30\), and Z-test when the sample count is \(>30\) at \(p=0.05\) significance. Table 8 summarizes the performance of the Mapper module tested against 60 datasets, where the second column lists the number of datasets where the MAE between expected and predicted model-fitness (\(F_1\)) score is within the \(10\%\) margin at the statistical significance level of \(p=0.05\). The t/Z-test confirms our regressors’ ability to accurately predict the expected classification performance for at least 60% of the datasets.

Table 8 Performance evaluation of the Mapper module using two-sample Z-test & t-test in the Subsamples mode at \(p=0.05\) significance

An average of above 60% PASS datasets in Tables 7 and 8 is interesting because, in the subsamples mode, we test the ability of the mappers to generalize on ten unseen datasets per cross-validation fold. Out of 10 test-fold datasets, the mappers correctly predict the model fitness for at least six datasets on average. A recall of 61% demonstrates excellent promise in the technique and increases the motivation to improve this idea further to achieve even higher recall. It is evident from the results in Tables 6, 7, and 8 that the Mapper module is efficient in learning the relationship between the clustering indices of a dataset and the expected classification performance of a model class. The results also empirically prove the validity of our hypothesis that the binary class dataset’s model fitness is indeed a function of the dataset clustering indices for a model class.

5.2 Comparing with equivalent methods

We argue that the clustering indices features represent the data characteristics. It is necessary to compare our approach against similar methods from the literature, such as Landmarking [34]. Landmarking determines the location of a specific learning problem in the space of all learning problems by measuring the performance of some simple and efficient learning algorithms. Landmarking attempts to characterize the data properties by building simple classifier models. Similarly, classic statistical and information-theoretic features directly represent the data characteristics. We compare the ability of the Landmarking and classic statistical and information-theoretic features against clustering indices features to represent the data characteristics concerning a classification task.

Table 9 Statistical and information-theoretic and landmarking meta-features

There is no straightforward method to compare two data characteristics unless we use a downstream task to evaluate extrinsically. We describe our experiment below in steps, which compares the effectiveness of clustering indices against landmarking, statistical and information-theoretic features through the performance evaluation of an extrinsic regression task.

  • Pick a subset of datasets from our experiment and frame a fivefold cross-validation on the datasets.

  • Every dataset in the fold shall undergo subsampling individually.

  • Prepare the model-fitness scores (dependent variable) for each subsample.

  • Clustering indices.

    • Prepare the clustering indices (features) from Table 3 for each subsample.

    • Learn regressors for each model class from clustering indices to model-fitness.

  • Classic statistical features.

    • Collect a list of statistical and information-theoretic features [48, 49] as listed in Table 9.

    • Generate the statistical features for each subsample.

    • Learn regressors for each model class from statistical features to model-fitness.

  • Landmarking features.

    • Collect a list of landmarking features [50, 51] as listed in Table 9.

    • Generate the landmarking features for each subsample.

    • Learn regressors for each model class from landmarking features to model-fitness.

  • Measure the average cross-validation performance using \(R^2\) score for the testing and training folds across model classes.

Table 10 Regressor \(R^2\) performance comparison

Table 10 summarizes the performance of the extrinsic regression task through (i) clustering indices features, (ii) classic statistical features, and (iii) landmarking. It is apparent from the table that the clustering indices features are better than the other two strategies for capturing the dataset characteristics, at least with respect to an extrinsic regression task. The performance numbers show that the generalization error is higher for classic statistical features, although the training performance is reasonable. Landmarking, on the other hand, is unable to capture the dataset characteristics during training. As a result, the landmarking features generalize poorly during validation.

Table 11 Performance of the Mapper module for different configurations of clustering indices

5.3 Ablation Study

Clustering indices reflect the ability of a clustering algorithm to form meaningful clusters over a dataset. A clustering algorithm may inherently suffer shortcomings due to its hypothesis and algorithmic limitations. For example, K-means cannot form non-convex clusters, but methods such as Spectral clustering overcome this shortcoming. Likewise, Hierarchical clustering is very sensitive to outliers, but density-based methods such as HDBSCAN are resilient against outliers. This mentioned observation sets the premise for extracting clustering indices of a dataset using different clustering methods and concatenating them to create extended feature representation.

We perform our ablation study by dropping clustering indices from one or more specific clustering methods. We ablate the clustering indices to assess their relative contribution to an optimal build of the Mapper module. We also extend our ablation study to measure the contribution of internal vs. external cluster indices listed in Table 3. The performance of the Mapper module for different configurations of cluster indices is summarized in Table 11.

In Table 11, the first column lists the choices of clustering-indices-groups that we choose to ablate for the study. The group is ablated during the performance evaluation to observe the changes in the regression \(R^2\) score. The Table lists the \(R^2\) score at different ablation configurations. The first group of rows in the Table shows the performance when we ablate all the internal indices or external indices together as a large group. The second group of rows ablates the clustering indices from one clustering method at a time to study the performance variation. The third group of rows ablates pairs of clustering methods, and the fourth uses only one clustering method to generate clustering indices. The last row indicates the performance when all the clustering indices are used. The column-wise high scores are bold-faced. Interestingly, all the scores shown in the last row of the table with nothing ablated are indeed the high scores for each model class. Ablation, either internal vs. external or based on the choice of clustering method(s), is observed to be performing lower than the non-ablated scenario. In summary, we conclude that the Mapper model performs best (with high scores) only when we use all the internal & external clustering indices from all four families of clustering methods.

Table 12 Top 10 critical features estimated using Spearman correlation score estimation between the clustering indices and the estimated classification performance on a dataset
Table 13 Confusion matrix of the true and predicted top3 model classes in Single-shot and Subsampling modes
Table 14 Performance comparison of the top3 classifiers recommended by CIAMS

We further verify the result of the ablation study by observing the correlation between the feature importance and model fitness. We extract the feature importance in terms of gain score [59] from the tree-based regressor (XGBoost) that learns the mapping between the clustering indices feature and the model fitness. We analyze the correlation between the extracted high-gain cluster indices and the expected classification performance to estimate the magnitude and direction of the influence. Table 12 narrates the list of critical clustering indices (features) that influence the dataset model fitness estimated using Spearman correlation measure. It is reassuring from the table that the critical features are distributed across all the variations of clustering indices, as suggested by the ablation study.

5.4 Evaluating end-to-end CIAMS system

We successfully validate our hypothesis that classification performance is a function of the binary class dataset clustering indices for every model class by evaluating the performance of the Mapper module of CIAMS. Given a test dataset, we now study the end-to-end CIAMS system for the correctness of model-class recommendations. We use a 20–80 train-test split to mimic the real-world production scenario where we get a limited labeled dataset to build models. This splitting also gives us an insight into the ability of the model to generalize with the availability of limited labeled data. As explained in Sect. 4.2, we validate the CIAMS system in both Single-shot and Subsamples mode. Given a test dataset \(D'\), we generate the feature vector \(\textbf{I}'\) in the clustering indices feature space \({\mathcal {I}}\) using the procedure in Sect. 3.3.3 as \({\mathbb {F}}(D';{\mathcal {A}}):D'\rightarrow \textbf{I}'\). We supply the clustering indices feature vector \(\textbf{I}'\) as input to regressors \(R_j \in {\mathcal {R}}\) to predict the classification performance \({F_1}_j^{pred}\) as \(\textbf{O}'^{(j)}\leftarrow R_j(\textbf{I}';C_j)\) for every model class \(C_j\). In the subsampling mode, the resulting \(F_1^{pred}\) scores are of dimension \(b\times c\), where b is the number of subsamples drawn from the dataset \(D'\) and c is the number of model classes. In the Single-shot mode, the dimension is \(1\times c\). The Subsamples model output is collapsed using Eq. (13) into a \(1\times c\) vector.

We now have the predicted model-fitness \(F_1\) scores for the test dataset \(D'\) for each model class \(C_j\in {\mathcal {C}}\). Sorting the model fitness \(F_1\) scores in descending order gets us the ranked recommendation of suitable model classes for the given test dataset \(D'\). We call the top3 model classes as \(C_A^{pred}, C_B^{pred}, C_C^{pred}\) in the higher to lower classification performance order (\(C_A^{} \ge C_B^{} \ge C_C^{}\)). We compare the predicted rank order \(C_A^{pred}, C_B^{pred}, C_C^{pred}\) with the true rank order \(C_A^{true}, C_B^{true}, C_C^{true}\). To get the “true” rank order of the model classes, we build tuned classifier models for all the model classes using the same dataset \(D'\) that we feed to the Recommendation pipeline. We tune all the classification models through cross-validation. We use the evaluation score for the model classes to rank-order the classifiers, which in turn yields us \(C_A^{true}, C_B^{true}, C_C^{true}\).

Table 15 A limited list of commercial and open-source Automated ML platforms
Table 16 Weighted \(F_1\) score achieved by different automated ML methods for various public domain binary class datasets (as production datasets) in the Single-shot setting

5.4.1 Validation

We validate the performance of CIAMS using the corpus of 60 datasets. Table 13 presents the confusion matrix of the predicted and actual top3 model class recommendation by the CIAMS system in both the Single-shot and Subsampling modes. Although the ranks are not exact matches, we observe a strong overlap between the true and the predicted top3 model classes. From the table, we observe that CIAMS can recall (highlighted in YELLOW) the true top1 model class among the predicted top3 model class with recall scores of 78% and 74% in the Single-shot and Sub-sampling modes respectively. Likewise, the top1 prediction from CIAMS can recall the true top3 model classes with recall scores of 74% and 70% in the Single-shot and Sub-sampling modes, respectively. The ability to recall the top1 model class in three-fourths of the test datasets is remarkable.

Table 17 Student’s t-test results with a significance level of 0.02 for the comparison of CIAMS against other automated ML methods

5.4.2 Testing

We test CIAMS’s top3 recommend model classes using a separate hold-out set of public domain binary class datasets listed in the first column of Table 14. We compare the actual classification performance of the top3 model classes and tabulate the results in Table 14. We report the fivefold cross-validation performance of the top3 classifiers on the test datasets. The top3 recommended classifiers win in 22 of 25 datasets. By running cross-validation, we use the best of top3 recommended classifiers for a given test dataset. Essentially, we bring the complexity down from running an exhaustive model search to choosing the best from only three choices. We use the best classifier from the recommended list as the underlying model of the end-to-end Automated ML system to expose a SaaS API for automatic dataset classification.

As an exercise, we also ensemble the top3 classifiers into a Weighted Voting Classifier \({C}_{voting}\). We list the ensemble’s performance in the last column of Table 14. The voting classifier works best for 9 of 25 datasets. When we consider the top2 scores from the column, we find that the voting classifier performs well for 21 of 25 datasets. The voting classifier abstracts the top3 models, simplifying the Automated ML pipeline with one final prediction model. Of the 9 (nine) best scores, we also observe that the voting classifier ensemble is marginally better than the constituent classifiers in 3 (three) datasets highlighted in YELLOW in Table 14. As the performance improvement is insignificant, we skip ensembling the top3 model classes.

Table 18 Comparison of time taken (in seconds) by CIAMS and other Automated ML methods for different binary class datasets

5.5 Validating CIAMS-based end-to-end automated ML system

Automated Machine Learning platforms provide significant cost savings to businesses, focusing on complex processes such as product innovation, market penetration, and enhanced client satisfaction. Automated ML platforms decrease the energy spent on time and resource-consuming processes such as model selection, feature engineering, and hyper-parameter tuning. The major cloud players such as Microsoft and Amazon have their version of the Automated ML platforms. In responding to the demand for accessible and affordable automatic machine learning platforms, open-source frameworks are also available to put the data to use as quickly and with as little effort as possible. Table 15 lists a limited set of commercial and open-source Automated ML platforms, with which we compare performance against CIAMS-based Automated ML system.

Automated machine learning frameworks generally apply standardized techniques for feature selection, feature transformation, and data imputation on datasets developed over the years. However, the underlying methods used to automate machine learning tasks are different. We experimentally assess these methods in an end-to-end style across various datasets. We perform a quantitative comparison of the performance of CIAMS measured using \(F_1\) score with the other Automated ML candidate methods listed in Table 15.

Commercial and noncommercial Automated ML platforms apply different model ensembling techniques to boost performance. In our design, we use the top3 CIAMS recommended model classes and build the classifiers using the labeled part of the Production dataset and tune the constituent classifiers using fivefold cross-validation. We compare the evaluation (or test) set performance of the best-performing model among the top3 CIAMS against the performance of the other Automated ML methods in Table 16.

While comparing the performance of CIAMS against other Automated ML methods, we provide the test dataset in full as input to all the systems evaluated in this experiment. We observe from Table 16 that CIAMS is winning against the other methods with an average rank of 1.68, followed by Auto-weka scoring an average of 2.4. It is interesting to observe the CIAMS method scoring a definite top3 position in all 25 test datasets. The clear win also reflects in the number of top2 positions at 22 of 25. This observation gives us a strong validation that CIAMS-based end-to-end Automated ML system is at par if not better than other commercial and open-source automated machine learning methods even without any explicit feature engineering incorporated by the other methods.

We further study the statistical significance of the performance of CIAMS against other Automated ML methods using two samples t-test in Table 17. It is evident from the table that CIAMS wins over the other methods in an average of three-fourths (78%) of the test datasets. The next best methods in the comparison are TPOT and H2O, where CIAMS wins in two-thirds (64%) of the test datasets population. It is remarkable to observe CIAMS ruling over FLAML and Auto-WEKA with 92% and 100% wins. From Tables 16 and 17, we conclude that CIAMS is a great contender for becoming an Automated ML system for dataset classification in production settings.

Table 18 shows the time taken by each of the automated machine learning methods for building models and making predictions end-to-end. Amazon SageMaker and Auto-sklearn are the slowest methods consuming over an hour for each dataset. Azure Automated ML is the next slowest method, consuming over 15 mins for each dataset on average. TPOT is reasonably faster, with an average time of less than a minute. FLAML and H2O AutoML are the second fastest methods to make CIAMS the fastest method for automated machine learning in a limited dataset experiment. CIAMS scores the top1 fastest position on 12 and top2 faster position on 14 datasets out of 25.

6 Conclusion

CIAMS is a scalable and extensible method for automatic model selection using the clustering indices estimated for a given dataset. We build an end-to-end pipeline for recommending the best classification model class for a given production dataset based on the dataset’s characteristics as represented in the clustering index feature space. Our experimental setup with 60 different binary class datasets confirms the validity of our hypothesis that the classification performance of a dataset is a function of the dataset clustering indices with \(R^2>80\)% score for all the model classes included in the setup. We also observe that our mapper module predicts the expected classification performance within 10% error margin for an average of two-thirds of 60 datasets in the subsampling mode. While evaluating the rank-order prediction, we observe that our automatic model selection method scores precise top3 predictions for three-fourths of 60 datasets. We also develop an end-to-end automatic machine learning system for data classification. A user can send a test dataset and acquire the classification labels without worrying about the classification model selection and building processes. When we compare against popular commercial and open-source automatic machine learning platforms with another set of 25 binary class datasets, we outperform others with an average rank of 1.68 in classification performance, even in the absence of the explicit feature engineering performed by other platforms. Regarding running time, we show that CIAMS is significantly faster than the other methods. The next step for CIAMS is to extend the platform for multi-class classification and regression tasks to make it a complete Automated ML suite. Whilst successfully and objectively establishing the relationship between clustering indices and model fitness, it is compelling to also study how such relationships translate into human-understandable interpretations to enable story-telling on the dataset-model fitness. So, we envision building the next generation of the platform with explainability that provides the reasoning for why a model class is best suited for the given dataset. The codebase for this work is available in Github.Footnote 3