Keywords

1 Introduction

A classical pattern recognition system maps input data (e.g., an image) to an output label referred to as a class (e.g., “face” or “bicycle” or “stem-cell”). This mapping is accomplished using a classifier such as a neural network or a support vector machine, with each classifier having a number of tunable parameters. Determining the best classifier and the associated optimized parameters for a given pattern recognition task is often relegated to a trial-and-error approach that can be both laborious and sub-optimal. Further, a classification model (i.e., a classifier along with its parameters) that is optimal on one dataset may not work well on other datasets.

To address this issue, there is a need for developing techniques that can automatically determine the best classifier for a given dataset. Over the years, a suite of such automated machine learning (AutoML) tools have been created and shown to have the ability to surpass the state of the art performances of previously hand-selected models. One such tool, Auto Tuned Models (ATM), not only surpassed performance of human-tuned models on 30% of 420 datasets in OpenML, but also completed this work in 1/100th of the time [21].

These tools have the ability to produce many different optimized classifier models based on different classifier types and different parameter combinations. In this work, we explore the possibility of combining multiple classifier models generated by an AutoML tool in order to further improve classification accuracy. The fusion scheme is tested on three medical imaging datasets corresponding to stem cell detection, brain tumor classification and prostate cancer classification.

Medical image classification is often challenging due to the costs of the imaging machines, the required expertise to annotate images, and patient privacy laws. These challenges often result in limited data availability which can restrict the ability to train deep neural networks, which often require large annotated datasets. Adaptations to deep neural networks such as transfer learning [4] or self-supervised learning techniques [11] have been proposed to address these challenges. However, these approaches rely on the ability to appropriately select an existing architecture or the creation of custom image transformations. Thus, the use of classification techniques that do not require copious amounts of annotated training data can be pertinent in such cases (Fig. 1).

Fig. 1.
figure 1

Overview of proposed approach beginning with curating the dataset, producing candidate classifier models and choosing a subset of models for score fusion.

2 Background

We define a classifier as the type of classification algorithm, such as K-Nearest Neighbors (KNN), support vector machine (SVM), or multiple-layered perceptron (MLP). Once a classifier’s parameters have been set, either by human intuition or via an AutoML tool, the resulting instance is referred to as a classifier model (see Table 1).

2.1 Pattern Recognition with AutoML

Given labeled data, a supervised classifier learns a model that can accurately map the input data to the correct output class label. The performance of a model depends on many factors such as the amount of training data available, the balance of data across classes, and the distribution of the data itself. A classifier model that performs well on a given dataset may not achieve the same performance on a different dataset for the same task.

Software packages used to generate these models have started to offer automated methods for classifier selection and parameter tuning. These methods are referred to as automated machine learning (AutoML) tools, and have provided a way to intelligently find classifier models optimized for specific datasets [12]. Popular AutoML tools include Auto-WEKA [14], Auto-Sklearn [10], and TPOT (Tree-based Pipeline Optimization Tool for Automating Data Science) [19]. In this work, we utilized a recently developed AutoML tool known as Auto Tuned Models (ATM) [21]. We choose ATM for its ability to tune models in parallel, as well as the ability to archive all the generated models.

ATM is constructed with Python’s SciKit-Learn, and allows the same classifiers that are supported in SciKit-Learn. An example of the SVM classifier, its parameters, and the resulting tuned model are given in Table 1.

Table 1. Example of a classifier, parameters of the classifier, and values of these parameters. A classifier model denotes a classifier with fixed parameter values.

The above approaches focus on classical pattern recognition approaches. Newer AutoML tools such as Auto-DeepLab [17] and Auto-Keras [13] focus on tuning deep learning models. However, we do not focus on the latter set of tools as our goal is to demonstrate how simple classification models, trained on small-sized datasets, can be judiciously selected and combined to improve classification performance in the context of medical image classification.

2.2 Fusion

An AutoML tool typically outputs multiple classifier models, including the one with the highest accuracy on the training set. While one model may have the highest accuracy, is it possible to leverage multiple models to achieve higher performance? Combining multiple sources of information to improve performance is generally referred to as fusion, and there are multiple levels at which fusion can be performed. This includes fusion at the raw data level [8], feature level, score level [20], rank level [20], and decision level [16].

This work considers fusion of multiple classifier models generated by ATM at the score level. Since this technique relies only on the score reported, it is possible to use this approach without knowledge of the features used to produce the score. In this work, the score generally denotes the confidence of each classifier model when it renders an output class label based on the input feature vector.

There are many approaches to fusion, such as the simple sum rule where scores are averaged [20]. The fusion in this paper is performed with an SVM, where the scores from multiple models are input to an SVM which then maps this score vector to an output class or label. The question we raise is the following: how do we determine the models whose scores have to be fused?

3 Datasets

Three different medical imaging datasets are utilized in this work. These datasets include stem cells, brain tumors, and prostate cancer grade classifications for magnetic resonance imaging (MRI) scans. In our analysis, the MRI scans are comprised of a stack of 16-bit tiff images. These images are generated from DICOM (Digital Imaging and Communications in Medicine standard), a common file format generated by medical imaging machines such as MRIs.

3.1 MSU Stem Cell Dataset

Afridi et al. created a dataset of 6 in vivo MRI scans of rat brains that were hand-labeled for stem cells by radiologists [1, 2]. This dataset was generated from 2 different MRI machines of varying field strengths, and is intended to evaluate the generalizability and robustness of potential stem cell detection algorithms. Each scan is composed of 16-bit tiff images (Fig. 2, left), which radiologists manually reviewed to identify and label stem cells. Once a stem cell is identified, a 9 \(\times \) 9 pixel patch is extracted around the stem cell (Fig. 2, right).

Fig. 2.
figure 2

Visualization of a labeled MRI scan with stem cells marked in green (left). Extracting patches containing identified stem cells, forming the stem-cell patches (right). (Color figure online)

Using handcrafted features and Bayesian Classifiers, Afridi et al. [1] obtained an AUC accuracy of 89.1%. They were able to further improve performance using a CNN-based approach that incorporated the information about the time radiologists spent identifying stem cells (referred to as Labeling Latency). This approach achieved an accuracy of 94.6% [2]. For our analysis, we divide the labeled data into 80% training and 20% testing. This is summarized in Table 2 (Figs. 3).

Fig. 3.
figure 3

Non-stem cell patch extraction: After all stem cells have been identified (left), super-pixels are calculated (center) and super-pixels are randomly selected such that there is no overlap with stem-cell patches (right).

Table 2. Number of samples for each class in the training, testing, and entire stem-cell dataset
Fig. 4.
figure 4

Example of brain scans (top) and labeled tumors (bottom). (Color figure online)

Table 3. Brain Tumor classes and number of samples for each class in the dataset

3.2 Brain Tumor Classification

Cheng et al. [6] generated a dataset comprised of brain scans of 233 patients with identified tumors. Figure 4 provides examples of four slices (top) and the annotated tumors (bottom). In addition to the manually segmented tumor masks (highlighted in yellow on the bottom row), we display the bounding box surrounding the tumors in red. Three types of tumors are identified: Glioma, Meningioma, and Pituitary. Table 3 gives the breakdown of the 3,064 MRI slices. For our analysis, we divide the labeled data into 80% training and 20% testing. This is summarized in Table 3.

The authors of this dataset explored classification of tumors through region augmentation by including the segmented tumor and the regions surrounding it [7]. They obtained a classification accuracy of 82.3%. Afshar et al. further improved classification performance to 86.56% by designing a Capsule Network [3]. In our work, however, we only use the segmented tumor without surrounding pixels for classification.

3.3 Prostate Cancer Classification

The American Association of Physicists in Medicine, the International Society for Optics and Photonics, and the National Cancer Institute collaborated to create the PROSTATEx Challenges for classifying the aggressiveness of prostate cancer [5]. The PROSTATEx challenge dataset provides information for 182 subjects. This information includes multiple MRI scans with multiple weighting agents, and a cancer diagnosis based on Gleason Score grouping. A low Gleason score indicates small, well formed and uniform glands, i.e., mostly normal cells. A high score indicates irregular sizes, glands, or masses (abnormal cells). These scores are categorized into 5 groups defined in Table 4. Additionally, a set of KTrans images (a measure of capillary permeability is provided for each patient. This data is already divided into training and test sets. The best results of this competition was an AUC accuracy score of 95% (https://prostatex.grand-challenge.org/evaluation/results/) (Fig. 5).

Table 4. Grade grouping of the Gleason Scores and number of samples for each group in the dataset
Fig. 5.
figure 5

Examples of prostate scans from the PROSTATEx Challenge dataset.

4 Proposed Approach

The contribution of this work is the principled use of fusion to combine the multiple classifier models produced by an AutoML tool such as ATM. Questions we explore include which data representations to use, how many models are necessary to fuse, and how to select models for fusion. In this section, the proposed approach is described in detail. The proposed approach implements the following steps:

  1. 1.

    Extract Data: For each of the datasets described in Sect. 2, image patches are extracted from the MRI scan and given a class label.

  2. 2.

    Form Feature-Sets: While the pixels in a patch can be directly used as raw input to ATM, we also consider alternate inputs where a patch is subjected to different feature extraction techniques via texture descriptors. These alternate representations along with the original patch representation are referred to as Feature-Sets.

  3. 3.

    Generate Models with ATM: ATM [21] generates multiple classifier models from the training set of extracted patches and the corresponding feature-sets.

  4. 4.

    Collect Model Scores: Test patches are given as input to the models to obtain a score (described later in Sect. 4.4).

  5. 5.

    Perform Fusion: The scores of multiple classifier models are fused using an SVM classifier.

For the rest of this section, let \(m_{ijk}\) represent the \(i^{th}\) model pertaining to the \(j^{th}\) classifier tuned on the \(k^{th}\) feature-set (\(i\in Z\), \(j \in J\), \(k \in K\)), and let FM represent the subset of models to be used in fusion. Here, K represents the feature-sets (raw, scaled, HOG, LBP, LBP-Uniform, LBP-Rotation Invariant), and J represents the collection of classifiers (Logistic Regression, Support Vector Machine, Linear Classifier with Stochastic Gradient Descent, Decision Tree, Extra Trees, Random Forest, Gaussian Naive Bayes, Multinomial Naive Bayes, Bernoulli Naive Bayes, Gaussian Process, Passive Aggressive, K Nearest Neighbors, Multi-Layer Perceptron).

4.1 Data Extraction

For the stem cell dataset, we use the same method proposed by Afridi et al. [2] to extract image patches. We center patches around regions of interest (stem cell and non-stem cell), as shown in Fig. 2. The resulting patches are 9 \(\times \) 9 pixels.

For the brain tumor dataset, we first find the smallest width and also the smallest height of all the bounding boxes within the dataset. This worked out to be 14 pixels on both dimensions. We then locate the center pixel of the tumor’s bounding box, and extract a 14 \(\times \) 14 pixel patch surrounding the center pixel. This approach ensures that the patches contain only tumor information and nothing of the surrounding brain tissue.

For the prostate dataset, locations of legions were provided in the same manner as the stem cell data. That is, an expert identified point-of-interest is labeled within an MRI scan. We extract a 15 \(\times \) 15 pixel patch of the slice around the point of interest.

4.2 Data Descriptors Through Feature-Sets

In addition to the raw pixel values within the patches, scaled pixel intensities and texture descriptors are also considered. These methods produce the following feature-sets: Raw pixel intensities (in the range [0, 65,535]), Scaled Intensities (in the range of [0, 255]), Histogram of Oriented Gradients (HOG) [9], DAISY feature descriptors [22], and Local Binary Pattern (Classical, Rotation-Invariant, Uniform) [18].

4.3 Model Generation

Once all the feature-sets are generated, the labeled training data for each feature-set in K is run through ATM. We set ATM’s model budget, B, to 600 models for each feature-set in K. With \(|K| = 6\), this means ATM produces up to 3,600 tuned models. Since the number of tuned models is very large, we would like a principled manner to form a subset of candidate models, FM, that will reliably improve performance while efficiently selecting only the models necessary to achieve good results. This selection process is described in Sect. 4.5.

4.4 Model Scores

Given an input feature-set, each tuned classifier model predicts an output class label. Additionally, each model produces a score, which in many cases, corresponds to a probability value. For example, a multi layered perceptron (MLP) model classifies the input data by using a softmax score, which is the probability that the input data belongs to class y. If scores are not readily available from a tuned model, other representations of prediction strengths are used to estimate a score. To obtain scores for the KNN classifier, for example, we average the Euclidean distance of the K neighbors. The smaller the distance, the stronger the evidence is that the point is correctly classified. For the Extra Trees and Random Forest classifiers, the score denotes the percentage of trees that correctly classify the sample, and for the singular decision tree, the score corresponds to the percentage of nodes in agreement.

4.5 Model Selection for Fusion

With so many optimized models generated, a question arises. Should all models be included in fusion? It is possible that several optimized models are only slight variations of a dominant classifier’s parameters, resulting in redundant score data and biased classification predictions. Before we propose methods to automatically select models for fusion, we first explore various methods to partitioning the models to form subset FM and address potential bias.

Forming FM: The first obvious method to subsetting M into FM is to select the top n models reporting the highest classification accuracy (Top). This approach is straightforward and is effective at drastically cutting the size of M. However, as before, we find that if a certain classifier performs particularly well on a dataset, a large portion of the n models in FM will belong to the same classifier and make similar classification errors. Therefore, we next explore forming FM by partitioning M based on the classifier type (CS), the feature-set (FS), and the classifier-feature-set combination (CFS). To form FM, we select the highest performing model from each partition. For example, in the CS stratification approach we select the model with the highest reported AUC accuracy for each classifier type. Lastly, we form FM by considering every optimized model produced by ATM (FM == M). In summary, the following are the approaches we apply to form FM: Top, CS, FS, CFS, and the entire set of models.

Once the models in FM have been identified using one of the aforementioned methods, the next step is to design an algorithm to effectively select the minimal number of classifier models to fuse. Intelligently selecting models for fusion begins with deciding which model best complements the models already selected for fusion. To facilitate this, we compute the pair-wise score correlation between all pairs of candidate models. Typically, combining models which are least correlated can be beneficial [15]. In this work, we develop two techniques for selecting models to fuse. In both approaches, we start with the best performing model generated by ATM and then use pairwise correlation value to guide model selection. Then, we compare the performance of these two approaches, examining how the accuracy and efficiency are impacted by the number of models fused.

Static Selection: The first selected model is the best model produced by ATM. The remainder of the models in FM that are not selected form the unselected model set. The next model chosen is the unselected model with the least correlation to the last selected model. This process is repeated until there are no more unselected models and the selection is complete. By using the Pearson correlation between models’ scores across training samples in order to select models, we are selecting models whose scores contain new information not captured by the previously selected models.

figure a
figure b

Dynamic Selection: We next explore sorting models by dynamically updating the correlation values after selecting a new model for fusion. Once the first 2 candidate models are selected as described in the static sort above, we immediately fuse the selected models to obtain a new model with the new set of fused scores. We update the correlation coefficients with the unselected models to account for the scores from the newly fused model. This selection method is dynamic and allows for the flexibility to select the next model based on the current models already selected for fusion.

5 Results and Analysis

The proposed method is able to improve the AUC accuracies, both over any individual AutoML generated model and over the previous state-of-the-art accuracies described in Sect. 3. We focus on the Prostate and Brain datasets, as ATM produced significantly optimized models for the Stem Cell dataset with individual AUC accuracies over 99%. While no single stratification approach universally performed the best, we find that using a stratification approach reports higher AUC accuracies than randomly selecting the same number of models (repeated 5 times), as shown in Table 5, with the class-feature stratification (CFS) technique providing the best results with the fewest models for the prostate dataset, and selecting models from the top-performing stratification (TOP) provides the best results for the brain dataset. Figure 6 shows how the AUC accuracies change as models are added to the fusion.

Table 5. Summary of the best accuracies (%) achieved through fusion compared to previous published performances. Entries in parenthesis indicate number of models fused and the stratification approach. Note, ATM produced models which achieve strong results and so fusing additional models are not considered.
Fig. 6.
figure 6

Plots illustrating the changes in accuracy as increasing number of models are included for fusion. Left: Brain dataset, Right: Prostate dataset.

6 Summary

We applied score fusion to combine multiple classification models generated by an AutoML tool to produce improved accuracy on three medical imaging datasets. Further, we developed two methods to select models for fusion based on Pearson’s correlation coefficient. This case study focused on medical image datasets, and obtained several classification models via an AutoML tool (ATM) to fuse together. We observe that fusing models improves the accuracy beyond the best individual classification model produced by ATM, and that the highest accuracy achieved through fusion of these models surpasses the accuracy of even deep learning approaches. Furthermore, these results are achieved with small training data sizes without auxiliary information, such as labeling behavior or pixels surrounding the area of interest. This underscores the importance of judiciously selecting models for fusion in order to improve classification accuracy.