Introduction

The rapid rise of artificial intelligence (AI) in recent years has been accompanied (and enabled) by staggering advances both in software and hardware technologies. Tools, such as PyTorch [1] for deep learning (DL), scikit-learn for machine learning (ML) [2, 3], and graphics processing unit (GPU) hardware, all enable faster and better prototyping and deployment of AI systems than was possible a mere half-decade ago.

While deep learning has taken the world by storm, often—it would seem—at the expense of other computational paradigms, these (plentiful) latter are still quite alive and kicking. We propose herein to revisit stacking-based modeling [4], but within a comprehensive framework enabled by modern state-of-the-art software packages and hardware platforms.

As previously argued by Ref. [5, 6], significant improvement can be attained by making use of models we are already in possession of anyway, through what they termed “conservation machine learning”: conserve models across runs, users, and experiments—and make use of all of them. Herein, focusing on image-classification tasks, we ask whether, given a (possibly haphazard) collection of deep neural networks (DNNs), can the tools at our disposal—specifically, “good old-fashioned” ML algorithms, many of which have been around for quite some time—help improve prediction accuracy.

To wit, can we combine DL and ML in a manner that improves DL performance? We answer positively, with a major novelty being the use of the most DL and ML models to date within a single, comprehensive framework.

The section “Previous Work” discusses related previous work. The section “Deep GOld: Algorithmic Setup” describes Deep GOld—Deep Learning and Good Old-Fashioned Machine Learning—which employs 51 deep networks and 10 ML algorithms. The section “Results” presents the results of 120 experiments over four image-classification datasets: Fashion MNIST, CIFAR10, CIFAR100, and Tiny ImageNet. We end with a discussion in the section “Discussion” and concluding remarks in the section “Concluding Remarks”.

Previous Work

There are many works that involve some form or other of ensembling several models, and this section does not serve as a full review, but focuses on those papers found to be most relevant to our topic.

In an early work, [7] presented a technique called Addemup that uses a genetic algorithm to search for an accurate and diverse set of trained networks. Addemup works by creating an initial population of networks, then evolving new ones, keeping the set of networks that are as accurate as possible while disagreeing with each other as much as possible. They tested these on three DNA datasets of about 1000 samples.

A few years later, [8] presented an approach named Genetic Algorithm-based Selective ENsemble (GASEN) to select some neural networks from a pool of candidates, and assign weights to their contributions to the resultant ensemble. The networks had one hidden layer with five hidden units. The efficacy of this method was shown for regression and classification problems over structured (non-image) datasets of a few thousand samples. Another work by [9] studied financial-decision applications, wherein a neural-network ensemble prediction was similarly reached by weighting the decision of each ensemble member.

A more recent example (one of many) of straightforward ensembling is given in [10], who presented an ensemble neural-network model for real-time prediction of urban floods. Their ensemble approach used a number of artificial neural networks with identical topology, trained with different initial weights. The final result of maximum water level was the ensemble mean. Ensemble sizes examined were 1, 5, and 10.

In a similar vein, [11] trained multiple neural networks and combined their outputs using three combination strategies: simple average, weighted average, and what they termed a meta-learner, which applied a Bayesian regulation algorithm to the network outputs. The application field considered was real-time production monitoring in the oil and gas industry, specifically, virtual flow meters that infer multiphase flow rates from ancillary measurements, and are attractive and cost-effective solutions to meet monitoring demands, reduce operational costs, and improve oil recovery efficiency.

Ref. [12] trained five convolutional neural networks (CNNs) to detect ankle fractures in radiographic views. Model outputs were evaluated using both one and three radiographic views. Ensembles were created from a combination of CNNs after training. They implemented a simple voting method to consolidate the output from the three views and ensemble of models.

Ref. [13] presented a malware detection method called MalNet, which uses a stacking ensemble with two deep neural networks—CNN and LSTM—as first-level learners, and logistic regression as a second-level learner.

Ref. [14] examined neural-network ensemble classification for lung cancer disease diagnosis. They proposed an ensemble of Weight Optimized Neural Network with Maximum-Likelihood Boosting (WONN-MLB), which essentially seeks to find optimal weights for a weighted (linear) majority vote. Ref. [15] applied a neural-network ensemble to intrusion detection, again using weighted majority voting.

Ref. [16] recently presented a cogent case for the use of XGBoost for tabular data, demonstrating that it outperformed deep models. They also showed that an ensemble comprising four deep models and XGBoost, predicting through weighted voting, worked best for the tabular datasets considered.

Ref. [17] proposed an ensemble DNN for tumor detection in colorectal histology images. The mechanism consists of weights that are derived from individual models. The weights are assigned to the ensemble DNN based on their metrics and the ensemble model is then trained. The model is again retrained by freezing all the layers, except for the fully connected and dense layers.

Ref. [18] presented an ensemble DL method to detect retinal disorders. Their method comprised three pretrained architectures—DenseNet, VGG16, InceptionV3—and a fourth Custom CNN of their own design. The individual results obtained from the four architectures were then combined to form an ensemble network that yielded superb performance over a dataset of retinal images.

Ref. [19] examined Deep Q-learning, presenting an ensemble approach that improved stability during training, resulting in improved average performance.

As noted above, [5, 6] presented conservation machine learning, which conserves models across runs, users, and experiments, and makes use of them. They showed that significant improvement could be attained by employing ML models already available anyway.

Deep GOld: Algorithmic Setup

Stacking (or Stacked Generalization) [4] is an ensemble method that uses multiple models to tackle classification or regression problems. The main idea is to first train different models on the original problem. The outputs of these models are considered to be a first level, which are then passed on to a second level to perform the final prediction. The inputs to the second-level model are thus the outputs of the first-level models.

Our framework involves deep networks as first-level models and ML methods as second-level models. For the former we used PyTorch, one of the top-two most popular and performant deep-learning software tools [1]. The module torchvision.models contains 59 deep-network models that were pretrained on the large-scale (over 1 million images), 1000-class ImageNet dataset [20].

Of the 59 models, we retained 51 (8 models proved somewhat unwieldy or evoked a “not implemented” error). Each of the models was first retrained over the four datasets we experimented with in this paper: Fashion MNIST, CIFAR10, CIFAR100, and Tiny ImageNet. As seen in Table 1, these datasets contain between 50,000 and 100,000 greyscale or color images in the training set, 10,000 images in the test set, with number of classes ranging between 10 and 200. Retraining was necessary, since the datasets contain images that differ in size and number of classes from ImageNet.

Table 1 Datasets

For retraining, we replaced the last fully connected (FC) 1000-class layer with a sequence of blocks comprising three layers: {FC, batchnorm, leaky ReLU}, denoted FBL. The final number of features of the original network was reduced to the dataset’s number of classes through halving the number of nodes at each layer, starting with the closest power of 2. Consider en example: If the original network ended with 600 features, and the dataset contains 100 classes, then our modified network’s final layers comprised a 512-node, 3-layer FBL block (512 being the closest power of 2 to 600), followed by a 256-node FBL, followed by a 128-node FBL, and ending with the 100 classes. In addition, the first convolutional layer of the original network needed adjustment in some cases. The retraining phase is detailed in Algorithm 1.

figure a

Once Algorithm 1 is run for all four datasets, we are in possession of 51 trained models per dataset. We can now proceed to perform the two-level prediction, as detailed in Algorithm 2. Our interest herein was to study what one can do with models one has at hand. Towards this end, we first selected from the 51 retrained models three random ensembles of networks, of sizes 3, 7, and 11. Each network of an ensemble was then run over both the training and test sets of the dataset in question (without any training—only feed-forward output computation). These first-level outputs were then concatenated to form an input dataset for the second level. For example, if the ensemble contains seven networks, and the dataset in question is CIFAR100, then the first level creates two datasets: a training set with 50,000 samples and 701 features, and a test set with 10,000 samples and 701 features (701: 7 networks \(\times\) 100 classes + 1 target class).

figure b

After the first level produced output datasets, we passed these along to the second level, wherein we employed ten ML algorithms:

  1. 1.

    sklearn.linear_model.SGDClassifier: Linear classifiers with SGD training.

  2. 2.

    sklearn.linear_model.PassiveAggressiveClassifier: Passive Aggressive Classifier [21].

  3. 3.

    sklearn.linear_model.RidgeClassifier: Classifier using Ridge regression.

  4. 4.

    sklearn.linear_model.LogisticRegression: Logistic Regression classifier.

  5. 5.

    sklearn.neighbors.KNeighborsClassifier: Classifier implementing the k-nearest neighbors vote.

  6. 6.

    sklearn.ensemble.RandomForestClassifier: A random forest classifier.

  7. 7.

    sklearn.neural_network.MLPClassifier: Multi-layer Perceptron classifier, with five hidden layers of size 64 neurons each.

  8. 8.

    xgboost.XGBClassifier: XGBoost classifier [22].

  9. 9.

    lightgbm.LGBMClassifier: LightGBM classifier [23].

  10. 10.

    catboost.CatBoostClassifier: CatBoost classifier [24].

Results

Unsurprisingly, we found significant differences in the runtime of the level-2 ML algorithms (Algorithm 2). While some methods, such as RidgeClassifier and KNeighborsClassifier, were very fast, usually finishing within minutes, others proved slow (notably, XGBClassifier and CatBoostClassifier, which took several hours). While the number of samples of the generated ML datasets for the four problems studied is similar (identical to the original datasets—Table 1), the number of features differs by an order of magnitude: with ten classes for Fashion MNIST and CIFAR10, 100 classes for CIFAR100, and 200 classes for Tiny ImageNet, the latter two have 10 and 20 times more features than the former two, respectively. Some ML methods are known to scale less well with number of features.

ML runtimes for Fashion MNIST and CIFAR10 proved sufficiently fast to afford the use of hyperparamater tuning. Towards this end, we used Optuna, a state-of-the-art, automatic, hyperparameter optimization software framework [25], which we previously used successfully [26, 27]. Optuna offers a define-by-run style user API where one can dynamically construct the search space, and an efficient sampling algorithm and pruning algorithm. Moreover, our experience has shown it to be fairly easy to set up. Optuna formulates the hyperparameter optimization problem as a process of minimizing or maximizing an objective function given a set of hyperparameters as an input. The hyperparameter ranges and sets are given in Table 2. With CIFAR100 and Tiny ImageNet, we did not use Optuna, but rather ran the ML algorithms with their default values.

Table 2 Hyperparameter value ranges and sets used by Optuna

Table 3 presents our results (we set a 10-h limit on an ML algorithm’s run of a row in the table, i.e., the level-2 loop of Algorithm 2.) A total of 120 experiments were performed: 4 datasets \(\times\) ensembles of size 3, 7, and 11 \(\times\) 10 complete runs per dataset. In each experiment, we generated level-1 datasets and then executed the ML algorithms, as delineated in Algorithm 2. We then compared three values: (1) the test score of the top network amongst the random ensemble (known from Algorithm 1); (2) the test score of majority prediction, wherein the predicted class is determined through a majority vote amongst the ensemble’s networks’ outputs; (3) the test score of the top ML method. The code is available at https://github.com/moshesipper.

Table 3 Results for ensembles of 3, 7, and 11 random networks

Discussion

As observed in Table 3, of the total of 120 experiments, an ML algorithm won in all but ten experiments (four were won by the retrained network, and six by majority prediction).

We note that classical algorithms, notably Ridge regression and k-nearest neighbors, worked best (they account for 104 of the wins). They are also fast, scalable, and amenable to quick hyperparameter tuning. If one wishes to focus on a smaller batch of ML algorithms, these two seem like an excellent choice.

As noted in the section “Introduction”, we often find ourselves in possession of a plethora of models, either collected by us through many experiments, or by others (witness our use of pretrained models herein). Benefiting from current state-of-the-art technology, Deep GOld leverages this wealth of models to attain better performance. One can of course tailor the framework to available deep networks and to a personal predilection for any ML algorithm(s).

Concluding Remarks

We presented Deep GOld, a comprehensive, stacking-based framework for combining deep learning with machine learning. Our framework involves ensemble selection from 51 retrained pretrained deep networks as first-level models, and 10 machine-learning algorithms as second-level models. We demonstrated the unequivocal benefits of the approach over four image-classification datasets.

We suggest a number of paths for future research:

  • Further analysis of ML algorithms whose inputs are the outputs of deep networks. Do some ML methods inherently work better with such datasets?

  • Currently, the features for level 2 comprise only the level-1 outputs. We might enhance this setup through automatic feature construction.

  • Train (or retrain) the level-1 networks alongside a level-2 ML model: (1) After each training epoch of the networks in the ensemble, generate a dataset from the network outputs; (2) a level-2 ML algorithm then fits a model to the level-1 dataset; (3) the ML model generates class probabilities, which are used to ascribe loss values to the networks-in-training.