CIAMS: clustering indices-based automatic classification model selection

Santhiappan, Sudarsun; Shravan, Nitin; Ravindran, Balaraman

doi:10.1007/s41060-023-00441-5

CIAMS: clustering indices-based automatic classification model selection

Regular Paper
Published: 19 August 2023

(2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Data Science and Analytics Aims and scope Submit manuscript

CIAMS: clustering indices-based automatic classification model selection

Download PDF

Sudarsun Santhiappan¹,
Nitin Shravan³ &
Balaraman Ravindran^1,2

160 Accesses
Explore all metrics

Abstract

Classification model selection is a process of identifying a suitable model class for a given classification task on a dataset. Traditionally, model selection is based on cross-validation, meta-learning, and user preferences, which are often time-consuming and resource-intensive. The performance of any machine learning classification task depends on the choice of the model class, the learning algorithm, and the dataset’s characteristics. Our work proposes a novel method for automatic classification model selection from a set of candidate model classes by determining the empirical model fitness for a dataset based only on its clustering indices. Clustering Indices measure the ability of a clustering algorithm to induce good-quality neighborhoods with similar data characteristics. We propose a regression task for a given model class, where the clustering indices of a given dataset form the features and the dependent variable represents the expected classification performance. We compute the dataset clustering indices and directly predict the expected classification performance using the learned regressor for each candidate model class to recommend a suitable model class for dataset classification. We evaluate our model selection method through cross-validation with 60 publicly available binary class datasets and show that our top3 model recommendation is accurate for over 45 of 60 datasets. We also propose an end-to-end Automated ML system for data classification based on our model selection method. We evaluate our end-to-end system against popular commercial and noncommercial Automated ML systems using a different collection of 25 public domain binary class datasets. We show that the proposed system outperforms other methods with an excellent average rank of 1.68.

Evaluating Clustering Meta-features for Classifier Recommendation

An improved data characterization method and its application in classification algorithm recommendation

Article 02 July 2015

Optimal selection of benchmarking datasets for unbiased machine learning algorithm evaluation

Article 20 October 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

An essential step in data science is selecting a suitable machine learning model that maximizes the performance measured for a given task. The traditional approach trains different models, evaluates their performance on a validation set, and chooses the best model. However, this method is time-consuming and resource-intensive. Automated machine learning is an active area of research to automatically select a suitable machine learning model for a given task. Researchers have tried to address the model selection problem through various approaches such as meta-learning [1,2,3], deep reinforcement learning [4], Bayesian optimization [5, 6], evolutionary algorithms [7,8,9], and budget-based evaluation [10].

Analyzing the data characteristics is essential for selecting an appropriate classification model and feature engineering. However, supposing we can estimate the empirical classification model performance with explainability for the dataset apriori, it becomes straightforward to pick a suitable classifier model class to solve the problem. This setup is advantageous while working with large datasets, as evaluating different classifier model classes for model selection is laborious and time-consuming.

Clustering methods group the data points having similar characteristics into neighborhoods or disjuncts of different sizes. Clustering indices [11] are cluster evaluation metrics used to assess the quality of the clusters induced by a clustering algorithm. Clustering indices measure the ability of a clustering algorithm to induce good-quality neighborhoods with similar data characteristics. We hypothesize that the clustering indices provide a low-dimensional vector representation of the dataset characteristics with respect to a clustering method. When we use different clustering methods to compute the clustering indices, we can generate different views of the dataset characteristics. Combining multiple views of the data characteristics in terms of clustering indices gives us a rich feature space representation of the dataset characteristics.

We use the term model fitness to denote the ability of a model class to learn a classification task on a given binary class dataset. The empirical model fitness of a dataset can be measured based on the expected classification performance of a model class on a dataset. We use the $F_1$ score as the classification performance metric in our experiments, but the idea is agnostic to any metric. A classifier’s empirical performance depends on the classifier hypothesis’s ability to model the data characteristics [12]. We hypothesize that the dataset characteristics represented by the clustering indices correlate strongly with the empirical model fitness.

In this paper, we propose CIAMS, a novel Clustering Indices-based Automatic Model Selection method from a set of model classes by estimating the empirical model fitness for a given binary class dataset from only its clustering indices representation of the data characteristics. We model the relationship between the clustering indices of a dataset and the empirical model fitness of a model class as a regression problem. We learn a regressor for each model class from a set of model classes on several datasets by randomly drawing subsamples with replacement. Constructing multiple subsamples allows us to increase the number of data points to train our regression model. Another advantage of using subsamples is to provide broader coverage of the dataset variance characteristic for regression modeling.

We train independent regressors for each model class with the best achievable classification performance for each dataset as the output and its respective estimated clustering indices as the input predictors. We tune every candidate classifier model for maximum classification performance concerning the dataset subsample. This way, when the regressors learn the mapping between clustering indices and the maximum achievable classification performance for every model class. The automation model selection process, implemented as a prediction task, can estimate the model fitness directly in terms of the expected classification performance. Using the estimated model fitness, we rank the candidate model classes to suggest the top model classes for a given dataset as the recommendation. We limit ourselves from suggesting the model class hyper-parameters. We believe mapping the hyper-parameters to the dataset characteristics is a separate problem, which we mark as one of the future extensions of CIAMS. We validate our model selection regressor through cross-validation using 60 (sixty) public domain binary class datasets and observe that our model recommendation is accurate for over three-fourths of the datasets.

We extend our automatic model classification method to an end-to-end Automated Machine Learning platform to offer binary classification modeling as a service. We use the top3 model classes predicted by our model selection method to build tuned classifiers using the labeled portion of the given production dataset. We define production dataset as the input provided by an end-user containing labeled and unlabeled portions, from which we learn these classifiers, followed by predicting the labels for the unlabeled data points. The best-performing model, chosen through cross-validation, among the top3 tuned classifier models is offered as a service to predict labels for the unlabeled portion of the production dataset. We validate our platform against other commercial and noncommercial automated machine learning systems using a different set of 25 (twenty-five) public domain binary class datasets of varied sizes. The comparison experiment shows that our platform outperforms other systems with an excellent average rank of 1.68, proving its viability in building practical applications.

The main contributions of this paper are:

A novel hypothesis is that the classification performance of a model class for a binary class dataset is a function of the dataset’s clustering indices.
A novel method to estimate the expected classification performance of a model class for a binary class dataset without building the classification model.
A novel application of clustering indices for automatic model selection from a list of model classes for a given binary class dataset.
A novel automated machine learning platform (Automated ML) for learning and deploying a classifier model as a service.

We organize the remainder of the paper as follows. Section 2 lists the related techniques and approaches for model selection and model fitness assessment. Section 3 summarizes our approach to automatic model selection. Section 3.2 gives a detailed explanation of our proposed model selection system. Section 4 describes the entire experimental setup and parameter configuration. Section 5 validates our system and narrates the results obtained from the experimental study. Section 6 provides the concluding remarks and next steps.

2 Related work

In this section, we summarize various approaches from the literature for automatic model selection organized into different categories.

Random search: early research works hypothesized automated model selection as a Combined Algorithm Selection and Hyperparameter optimization (CASH) problem. Amazon’s Sagemaker [13, 14] is an example of a commercial Automated ML platform that follows the CASH paradigm. H$_2$O AutoML [15, 16], an open-source Automated ML platform, uses fast random search and ensemble methods like stacking to achieve competitive results.
Bayesian optimization: Auto-weka [5] and Auto-sklearn [6] are Automated ML frameworks extensions of the popular Weka and Scikit-learn libraries, respectively. Auto-Weka [5] uses a state-of-the-art Bayesian optimization method, random-forest-based Sequential Model-based Algorithm Configuration (SMAC), for automated model selection. Auto-sklearn [6] builds on top of the Bayesian optimization solution in Auto-weka by including meta-learning for initialization of the Bayesian optimizer and ensembling to provide high predictive performance. Microsoft Azure Automated ML [17, 18] uses Bayesian optimization and collaborative filtering for automatic model selection and tuning.
Evolutionary algorithms: TPOT [7] is a Python-based framework that uses the Genetic Programming algorithm to evolve and optimize tree-based machine learning pipelines. Autostacker [8] is similar to TPOT, but stacked layers represent the machine learning pipeline. AutoML-Zero [9] uses basic mathematical operations as building blocks to discover complete machine learning algorithms through evolutionary algorithms. FLAML [19] uses an Estimated Cost for Improvement (ECI)-based prioritization to find the optimal learning algorithm in low-cost environments.
Deep reinforcement learning: AlphaD3M [4] uses deep reinforcement learning to synthesize various components in the machine learning pipeline to obtain maximum performance measures.
Meta-learning: Brazdil et al. [1, 20] uses a k-nearest neighbor approach based on the dataset characteristics to provide a ranked list of classifiers using different ranking methods based on accuracy and time information. AutoDi [2] uses word-embedding features and dataset meta-features for automatic model selection. AutoGRD [3] represents the datasets as graphs to extract features for training the meta-learner. AutoClust [21] uses clustering indices as meta-features to automatically select suitable clustering algorithms and hyper-parameters. Sahni et al. [22] developed a meta-feature approach to automatically select a sampling method for imbalanced data classification. Santhiappan et al. [23] propose a method using clustering indices as meta-features to estimate the empirical binary classification complexity of the dataset.

Our method follows the meta-learning paradigm, wherein we learn the relationship between the extracted meta-features of the dataset in terms of clustering indices and the expected classification performance of a model class. The trained meta-learner predicts the classification performance of a model class for an unseen dataset without building a classifier model. The differentiation among various meta-learning methods pivots on the choice of the meta-features extracted from the dataset. The following list presents the meta-features from the literature, organized into different categories.

Statistical and information-theoretic [24, 25]: these measures include the number of data points in the dataset, number of classes, number of variables with a numeric and symbolic data type, average and variance of every feature, the entropy of individual features, and more. These metrics capture important meta information about the dataset.
- Class boundary: the nature of the class margin of a dataset is an essential characteristic reflecting its classifiability. Measures such as inter-class and intra-class nearest-neighbor distance, error rate, and non-linearity of the nearest-neighbor classifier try to capture the underlying class margin properties such as shape and narrowness between classes.
- Class imbalance: machine learning methods in their default settings are biased toward learning the majority class due to a lack of data points representing the minority class. Features such as entropy of class proportions and class-imbalance ratio strongly reflect dataset characteristics such as the classification complexity of a dataset.
- Data sparsity [26]: sparse regions in the dataset affect the classifier’s learning ability leading to poor performance. The average number of features per dimension, the average number of PCA dimensions per point, and the ratio of PCA dimension to the original dimension capture sparsity in the dataset.
Feature-based [27, 28]: the learning ability of methods is highly correlated with the features’ discriminatory power. Fisher’s discriminant ratio, Overlap region volume, and feature efficiency are among many measures from the literature that tries to capture the ability to learn.
Model based [29,30,31]: the hyper-parameters of a model directly affect the model performance. For instance, hyper-parameters such as the number of leaf nodes, maximum depth, and average gain-ratio difference serve as meta-features in a tree-based model. Likewise, the number of support vectors required in SVM modeling is a meta-feature.
- Linearity [27, 32]: most classifiers perform highly when the dataset is linear. To capture the inherent linearity present in data, we use model-based measures, such as the error rate of a linear SVM and non-linearity of the linear classifier, as meta-features.
- Landmarking: Bensusan et al. [33, 34] noted that providing the performance measures obtained using simple learning algorithms (baselines) as a meta-feature has a strong co-relation with the classification performance of the considered algorithm. Fürnkranz et al. [35] explored different landmarking variants like relative landmarking, sub-sample landmarking techniques, and their effectiveness in several learning tasks, such as decision tree pruning.
  - Landmarking is one of the most effective methods for meta-learning-based automatic model selection [35]. Landmarking uses simple classifiers’ performance on a dataset to capture the underlying characteristics. Landmarking requires building several simple classifier models on the dataset to extract features. In comparison, our proposed model selection approach requires building dataset clusters to extract clustering indices features. We compare the performance of landmarking and clustering indices features through an extrinsic regression task in Sect. 5.2.
  - The computational cost of extracting the clustering indices as the meta-features for big datasets is mitigated through subsampling similar to Landmarkers [35]. Petrak et al. [36] establish the “Similarity of regions of Expertise” property, which says that the meta-features from several subsamples of the dataset collectively represent the characteristics of the full dataset. We also empirically validate the clustering indices upholding the said property in Sect. 3.1.
Graph-based [37, 38]: graph representation of a dataset can help extract useful meta-features such as mean network density, coefficient of clustering, and hub score for several meta-learning tasks.

There are several contributions in the literature to benchmark AutoML methods. Zöller et al. [39] introduce a mathematical formulation covering the complete procedure of automatic ML pipeline synthesis and compare it with existing problem formulations. The benchmarking encompasses eight Hyper-Parameter Optimization methods (HPO) and six popular AutoML frameworks on real datasets. Santu et al. [40] introduce a new classification system for AutoML systems, using a seven-tiered schematic to distinguish these systems based on their level of autonomy. The authors describe what an end-to-end machine learning pipeline actually looks like and which subtasks of the machine learning pipeline have been automated already. The paper also introduces our novel level-based taxonomy for AutoML systems and defines each level according to the scope of automation support provided. He et al. [41] focus only on the deep-learning-based AutoML and present a survey of methods for HPO and Neural Architecture Search (NAS).

In this work, we don’t aim to perform an extensive benchmark, as our goal is only to propose a new method for automatic model selection, which we also extend to an Automated ML system for binary classification tasks. So, we limit our benchmarking to only a few popular AutoML methods to establish the validity of our approach.

Data characterization is a crucial setup in understanding the nuances in a dataset. When specific dataset properties are known, it helps choose the suitable method or algorithm to solve the task. Several meta-features discussed in the literature are targeted at specific dataset characteristics. Enlisting the meta-features of a dataset is a laborious process that is costly in terms of time and computing power. We choose clustering indices to be the meta-features to represent dataset characteristics. Clustering indices are evaluation metrics to estimate how well a clustering algorithm grouped the data with similar characteristics. Clustering indices are scalar values that indicate the nuances in a dataset under different clustering assumptions. Computing the clustering indices is a parallelizable process whose time complexity is proportional to the size of a dataset subsample. We get more comprehensive coverage of the data characteristics when we generate clustering indices under different clustering assumptions.

In principle, the clustering indices approach to represent the data characteristics is similar to the landmarking approach. Despite requiring more computing power for dataset clustering, our experiments in Sect. 5 empirically show that the clustering indices capture a richer dataset characteristic representation for providing better generalization.

3 Our approach to automatic model selection

Data characterization techniques extract meaningful dataset properties that the downstream machine learning tasks and applications could use to improve performance. We hypothesize that the clustering indices computed from dataset clustering represent dataset characteristics concerning a specific clustering method. Clustering algorithms make different clustering assumptions for grouping the data points into neighborhoods. Clustering indices are quality measures for validating the clusters induced by a clustering algorithm. When we use clustering indices to measure the performance of such clustering algorithms, they inherently capture different properties of the datasets. When a clustering index is independent of any external information, such as data labels, the index becomes an internal index, or quality index [11]. On the contrary, when the clustering index uses data point labels, it becomes an external index.

Table 1 Notations

CIAMS: clustering indices-based automatic classification model selection

Abstract

Similar content being viewed by others

Evaluating Clustering Meta-features for Classifier Recommendation

An improved data characterization method and its application in classification algorithm recommendation

Optimal selection of benchmarking datasets for unbiased machine learning algorithm evaluation

Explore related subjects

1 Introduction

2 Related work

3 Our approach to automatic model selection

3.1 Clustering indices of subsamples vs. population

3.2 Automatic model selection system architecture

3.3 Training pipeline

3.3.1 Preprocessing

3.3.2 Estimating model-fitness

3.3.3 Data construction

3.3.4 Mapper for model selection

3.4 Recommendation pipeline

3.5 Configuration

3.6 Classification modeling as a service

4 Experimental setup

4.1 Training phase

4.2 Evaluation phase

4.3 Hyper parameters

4.3.1 Number of clusters

4.3.2 Subsamples

4.4 Scaling up

5 Evaluation

5.1 Mapper module evaluation

5.1.1 Regressor performance

5.1.2 Prediction correctness

5.2 Comparing with equivalent methods

5.3 Ablation Study

5.4 Evaluating end-to-end CIAMS system

5.4.1 Validation

5.4.2 Testing

5.5 Validating CIAMS-based end-to-end automated ML system

6 Conclusion

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation