Combining Deep Learning with Good Old-Fashioned Machine Learning

Sipper, Moshe

doi:10.1007/s42979-022-01505-2

Combining Deep Learning with Good Old-Fashioned Machine Learning

Original Research
Published: 08 December 2022

Volume 4, article number 85, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

SN Computer Science Aims and scope Submit manuscript

Combining Deep Learning with Good Old-Fashioned Machine Learning

Download PDF

Moshe Sipper ORCID: orcid.org/0000-0003-1811-472X¹

170 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

We present a comprehensive, stacking-based framework for combining deep learning with good old-fashioned machine learning, called Deep GOld. Our framework involves ensemble selection from 51 retrained pretrained deep networks as first-level models, and 10 machine-learning algorithms as second-level models. Enabled by today’s state-of-the-art software tools and hardware platforms, Deep GOld delivers consistent improvement when tested on four image-classification datasets: Fashion MNIST, CIFAR10, CIFAR100, and Tiny ImageNet. Of 120 experiments, in all but 10 Deep GOld improved the original networks’ performance.

Ensembles of Networks Produced from Neural Architecture Search

Computational Neuroscience Offers Hints for More General Machine Learning

Analyzing the Performance of Multilayer Neural Networks for Object Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The rapid rise of artificial intelligence (AI) in recent years has been accompanied (and enabled) by staggering advances both in software and hardware technologies. Tools, such as PyTorch [1] for deep learning (DL), scikit-learn for machine learning (ML) [2, 3], and graphics processing unit (GPU) hardware, all enable faster and better prototyping and deployment of AI systems than was possible a mere half-decade ago.

While deep learning has taken the world by storm, often—it would seem—at the expense of other computational paradigms, these (plentiful) latter are still quite alive and kicking. We propose herein to revisit stacking-based modeling [4], but within a comprehensive framework enabled by modern state-of-the-art software packages and hardware platforms.

As previously argued by Ref. [5, 6], significant improvement can be attained by making use of models we are already in possession of anyway, through what they termed “conservation machine learning”: conserve models across runs, users, and experiments—and make use of all of them. Herein, focusing on image-classification tasks, we ask whether, given a (possibly haphazard) collection of deep neural networks (DNNs), can the tools at our disposal—specifically, “good old-fashioned” ML algorithms, many of which have been around for quite some time—help improve prediction accuracy.

To wit, can we combine DL and ML in a manner that improves DL performance? We answer positively, with a major novelty being the use of the most DL and ML models to date within a single, comprehensive framework.

The section “Previous Work” discusses related previous work. The section “Deep GOld: Algorithmic Setup” describes Deep GOld—Deep Learning and Good Old-Fashioned Machine Learning—which employs 51 deep networks and 10 ML algorithms. The section “Results” presents the results of 120 experiments over four image-classification datasets: Fashion MNIST, CIFAR10, CIFAR100, and Tiny ImageNet. We end with a discussion in the section “Discussion” and concluding remarks in the section “Concluding Remarks”.

Previous Work

There are many works that involve some form or other of ensembling several models, and this section does not serve as a full review, but focuses on those papers found to be most relevant to our topic.

In an early work, [7] presented a technique called Addemup that uses a genetic algorithm to search for an accurate and diverse set of trained networks. Addemup works by creating an initial population of networks, then evolving new ones, keeping the set of networks that are as accurate as possible while disagreeing with each other as much as possible. They tested these on three DNA datasets of about 1000 samples.

A few years later, [8] presented an approach named Genetic Algorithm-based Selective ENsemble (GASEN) to select some neural networks from a pool of candidates, and assign weights to their contributions to the resultant ensemble. The networks had one hidden layer with five hidden units. The efficacy of this method was shown for regression and classification problems over structured (non-image) datasets of a few thousand samples. Another work by [9] studied financial-decision applications, wherein a neural-network ensemble prediction was similarly reached by weighting the decision of each ensemble member.

A more recent example (one of many) of straightforward ensembling is given in [10], who presented an ensemble neural-network model for real-time prediction of urban floods. Their ensemble approach used a number of artificial neural networks with identical topology, trained with different initial weights. The final result of maximum water level was the ensemble mean. Ensemble sizes examined were 1, 5, and 10.

In a similar vein, [11] trained multiple neural networks and combined their outputs using three combination strategies: simple average, weighted average, and what they termed a meta-learner, which applied a Bayesian regulation algorithm to the network outputs. The application field considered was real-time production monitoring in the oil and gas industry, specifically, virtual flow meters that infer multiphase flow rates from ancillary measurements, and are attractive and cost-effective solutions to meet monitoring demands, reduce operational costs, and improve oil recovery efficiency.

Ref. [12] trained five convolutional neural networks (CNNs) to detect ankle fractures in radiographic views. Model outputs were evaluated using both one and three radiographic views. Ensembles were created from a combination of CNNs after training. They implemented a simple voting method to consolidate the output from the three views and ensemble of models.

Ref. [13] presented a malware detection method called MalNet, which uses a stacking ensemble with two deep neural networks—CNN and LSTM—as first-level learners, and logistic regression as a second-level learner.

Ref. [14] examined neural-network ensemble classification for lung cancer disease diagnosis. They proposed an ensemble of Weight Optimized Neural Network with Maximum-Likelihood Boosting (WONN-MLB), which essentially seeks to find optimal weights for a weighted (linear) majority vote. Ref. [15] applied a neural-network ensemble to intrusion detection, again using weighted majority voting.

Ref. [16] recently presented a cogent case for the use of XGBoost for tabular data, demonstrating that it outperformed deep models. They also showed that an ensemble comprising four deep models and XGBoost, predicting through weighted voting, worked best for the tabular datasets considered.

Ref. [17] proposed an ensemble DNN for tumor detection in colorectal histology images. The mechanism consists of weights that are derived from individual models. The weights are assigned to the ensemble DNN based on their metrics and the ensemble model is then trained. The model is again retrained by freezing all the layers, except for the fully connected and dense layers.

Ref. [18] presented an ensemble DL method to detect retinal disorders. Their method comprised three pretrained architectures—DenseNet, VGG16, InceptionV3—and a fourth Custom CNN of their own design. The individual results obtained from the four architectures were then combined to form an ensemble network that yielded superb performance over a dataset of retinal images.

Ref. [19] examined Deep Q-learning, presenting an ensemble approach that improved stability during training, resulting in improved average performance.

As noted above, [5, 6] presented conservation machine learning, which conserves models across runs, users, and experiments, and makes use of them. They showed that significant improvement could be attained by employing ML models already available anyway.

Deep GOld: Algorithmic Setup

Stacking (or Stacked Generalization) [4] is an ensemble method that uses multiple models to tackle classification or regression problems. The main idea is to first train different models on the original problem. The outputs of these models are considered to be a first level, which are then passed on to a second level to perform the final prediction. The inputs to the second-level model are thus the outputs of the first-level models.

Our framework involves deep networks as first-level models and ML methods as second-level models. For the former we used PyTorch, one of the top-two most popular and performant deep-learning software tools [1]. The module torchvision.models contains 59 deep-network models that were pretrained on the large-scale (over 1 million images), 1000-class ImageNet dataset [20].

Of the 59 models, we retained 51 (8 models proved somewhat unwieldy or evoked a “not implemented” error). Each of the models was first retrained over the four datasets we experimented with in this paper: Fashion MNIST, CIFAR10, CIFAR100, and Tiny ImageNet. As seen in Table 1, these datasets contain between 50,000 and 100,000 greyscale or color images in the training set, 10,000 images in the test set, with number of classes ranging between 10 and 200. Retraining was necessary, since the datasets contain images that differ in size and number of classes from ImageNet.

Table 1 Datasets

Full size table

For retraining, we replaced the last fully connected (FC) 1000-class layer with a sequence of blocks comprising three layers: {FC, batchnorm, leaky ReLU}, denoted FBL. The final number of features of the original network was reduced to the dataset’s number of classes through halving the number of nodes at each layer, starting with the closest power of 2. Consider en example: If the original network ended with 600 features, and the dataset contains 100 classes, then our modified network’s final layers comprised a 512-node, 3-layer FBL block (512 being the closest power of 2 to 600), followed by a 256-node FBL, followed by a 128-node FBL, and ending with the 100 classes. In addition, the first convolutional layer of the original network needed adjustment in some cases. The retraining phase is detailed in Algorithm 1.

Once Algorithm 1 is run for all four datasets, we are in possession of 51 trained models per dataset. We can now proceed to perform the two-level prediction, as detailed in Algorithm 2. Our interest herein was to study what one can do with models one has at hand. Towards this end, we first selected from the 51 retrained models three random ensembles of networks, of sizes 3, 7, and 11. Each network of an ensemble was then run over both the training and test sets of the dataset in question (without any training—only feed-forward output computation). These first-level outputs were then concatenated to form an input dataset for the second level. For example, if the ensemble contains seven networks, and the dataset in question is CIFAR100, then the first level creates two datasets: a training set with 50,000 samples and 701 features, and a test set with 10,000 samples and 701 features (701: 7 networks \(\times\) 100 classes + 1 target class).

After the first level produced output datasets, we passed these along to the second level, wherein we employed ten ML algorithms:

1.
sklearn.linear_model.SGDClassifier: Linear classifiers with SGD training.
2.
sklearn.linear_model.PassiveAggressiveClassifier: Passive Aggressive Classifier [21].
3.
sklearn.linear_model.RidgeClassifier: Classifier using Ridge regression.
4.
sklearn.linear_model.LogisticRegression: Logistic Regression classifier.
5.
sklearn.neighbors.KNeighborsClassifier: Classifier implementing the k-nearest neighbors vote.
6.
sklearn.ensemble.RandomForestClassifier: A random forest classifier.
7.
sklearn.neural_network.MLPClassifier: Multi-layer Perceptron classifier, with five hidden layers of size 64 neurons each.
8.
xgboost.XGBClassifier: XGBoost classifier [22].
9.
lightgbm.LGBMClassifier: LightGBM classifier [23].
10.
catboost.CatBoostClassifier: CatBoost classifier [24].

Results

Unsurprisingly, we found significant differences in the runtime of the level-2 ML algorithms (Algorithm 2). While some methods, such as RidgeClassifier and KNeighborsClassifier, were very fast, usually finishing within minutes, others proved slow (notably, XGBClassifier and CatBoostClassifier, which took several hours). While the number of samples of the generated ML datasets for the four problems studied is similar (identical to the original datasets—Table 1), the number of features differs by an order of magnitude: with ten classes for Fashion MNIST and CIFAR10, 100 classes for CIFAR100, and 200 classes for Tiny ImageNet, the latter two have 10 and 20 times more features than the former two, respectively. Some ML methods are known to scale less well with number of features.

ML runtimes for Fashion MNIST and CIFAR10 proved sufficiently fast to afford the use of hyperparamater tuning. Towards this end, we used Optuna, a state-of-the-art, automatic, hyperparameter optimization software framework [25], which we previously used successfully [26, 27]. Optuna offers a define-by-run style user API where one can dynamically construct the search space, and an efficient sampling algorithm and pruning algorithm. Moreover, our experience has shown it to be fairly easy to set up. Optuna formulates the hyperparameter optimization problem as a process of minimizing or maximizing an objective function given a set of hyperparameters as an input. The hyperparameter ranges and sets are given in Table 2. With CIFAR100 and Tiny ImageNet, we did not use Optuna, but rather ran the ML algorithms with their default values.

Table 2 Hyperparameter value ranges and sets used by Optuna

Full size table

Table 3 presents our results (we set a 10-h limit on an ML algorithm’s run of a row in the table, i.e., the level-2 loop of Algorithm 2.) A total of 120 experiments were performed: 4 datasets \(\times\) ensembles of size 3, 7, and 11 \(\times\) 10 complete runs per dataset. In each experiment, we generated level-1 datasets and then executed the ML algorithms, as delineated in Algorithm 2. We then compared three values: (1) the test score of the top network amongst the random ensemble (known from Algorithm 1); (2) the test score of majority prediction, wherein the predicted class is determined through a majority vote amongst the ensemble’s networks’ outputs; (3) the test score of the top ML method. The code is available at https://github.com/moshesipper.

Table 3 Results for ensembles of 3, 7, and 11 random networks

Full size table

Discussion

As observed in Table 3, of the total of 120 experiments, an ML algorithm won in all but ten experiments (four were won by the retrained network, and six by majority prediction).

We note that classical algorithms, notably Ridge regression and k-nearest neighbors, worked best (they account for 104 of the wins). They are also fast, scalable, and amenable to quick hyperparameter tuning. If one wishes to focus on a smaller batch of ML algorithms, these two seem like an excellent choice.

As noted in the section “Introduction”, we often find ourselves in possession of a plethora of models, either collected by us through many experiments, or by others (witness our use of pretrained models herein). Benefiting from current state-of-the-art technology, Deep GOld leverages this wealth of models to attain better performance. One can of course tailor the framework to available deep networks and to a personal predilection for any ML algorithm(s).

Concluding Remarks

We presented Deep GOld, a comprehensive, stacking-based framework for combining deep learning with machine learning. Our framework involves ensemble selection from 51 retrained pretrained deep networks as first-level models, and 10 machine-learning algorithms as second-level models. We demonstrated the unequivocal benefits of the approach over four image-classification datasets.

We suggest a number of paths for future research:

Further analysis of ML algorithms whose inputs are the outputs of deep networks. Do some ML methods inherently work better with such datasets?
Currently, the features for level 2 comprise only the level-1 outputs. We might enhance this setup through automatic feature construction.
Train (or retrain) the level-1 networks alongside a level-2 ML model: (1) After each training epoch of the networks in the ensemble, generate a dataset from the network outputs; (2) a level-2 ML algorithm then fits a model to the level-1 dataset; (3) the ML model generates class probabilities, which are used to ascribe loss values to the networks-in-training.

References

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. PyTorch: an imperative style, high-performance deep learning library. arXiv preprint; 2019. arXiv:1912.01703.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
MathSciNet MATH Google Scholar
Scikit-learn: machine learning in python; 2022. https://scikit-learn.org/. Accessed: 2022-1-12.
Wolpert DH. Stacked generalization. Neural Netw. 1992;5(2):241–59.
Article Google Scholar
Sipper M, Moore JH. Conservation machine learning. BioData Min. 2020;13(1):9.
Article Google Scholar
Sipper M, Moore JH. Conservation machine learning: a case study of random forests. Nat Sci Rep. 2021;11(1):3629.
Google Scholar
Opitz D, Shavlik J. Generating accurate and diverse members of a neural-network ensemble. In: Touretzky D, Mozer MC, Hasselmo M, editors. Advances in neural information processing systems, vol. 8. Cambridge: MIT Press; 1996.
MATH Google Scholar
Zhou Z-H, Wu J, Tang W. Ensembling neural networks: many could be better than all. Artif Intell. 2002;137(1–2):239–63.
Article MathSciNet MATH Google Scholar
West D, Dellana S, Qian J. Neural network ensemble strategies for financial decision applications. Comput Oper Res. 2005;32(10):2543–59.
Article MATH Google Scholar
Berkhahn S, Fuchs L, Neuweiler I. An ensemble neural network model for real-time prediction of urban floods. J Hydrol. 2019;575:743–54.
Article Google Scholar
Al-Qutami TA, Ibrahim R, Ismail I, Ishak MA. Virtual multiphase flow metering using diverse neural network ensemble and adaptive simulated annealing. Expert Syst Appl. 2018;93:72–85.
Article Google Scholar
Kitamura G, Chung CY, Moore BE. Ankle fracture detection utilizing a convolutional neural network ensemble implemented with a small sample, de novo training, and multiview incorporation. J Digit Imaging. 2019;32(4):672–7.
Article Google Scholar
Yan J, Qi Y, Rao Q. Detecting malware with an ensemble method based on deep neural network. Secur Commun Netw. 2018;2018:7247095. https://doi.org/10.1155/2018/7247095.
Article Google Scholar
Alzubi JA, Bharathikannan B, Tanwar S, Manikandan R, Khanna A, Thaventhiran C. Boosted neural network ensemble classification for lung cancer disease diagnosis. Appl Soft Comput. 2019;80:579–91.
Article Google Scholar
Ludwig SA. Applying a neural network ensemble to intrusion detection. J Artif Intell Soft Comput Res. 2019;9:177–88.
Article Google Scholar
Shwartz-Ziv R, Armon A. Tabular data: deep learning is not all you need. Inf Fusion. 2022;81:84–90.
Article Google Scholar
Ghosh S, Bandyopadhyay A, Sahay S, Ghosh R, Kundu I, Santosh KC. Colorectal histology tumor detection using ensemble deep neural network. Eng Appl Artif Intell. 2021;100: 104202.
Article Google Scholar
Paul D, Tewari A, Ghosh S, Santosh KC. OCTx: Ensembled deep learning model to detect retinal disorders. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS). 2020; p. 526–31.
Elliott DL, Santosh KC, Anderson C. Gradient boosting in crowd ensembles for Q-learning using weight sharing. Int J Mach Learn Cybern. 2020;11(10):2275–87.
Article Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. 2009; p. 248–55.
Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y. Online passive-aggressive algorithms. J Mach Learn Res. 2006;7(19):551–85.
MathSciNet MATH Google Scholar
Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’16. 2016; pp. 785–94, New York, NY, USA.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.
Google Scholar
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. arXiv preprint. 2017; arXiv:1706.09516.
Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019; p. 2623–31.
Sipper M. Neural networks with à la carte selection of activation functions. SN Comput Sci. 2021;2(470). https://doi.org/10.1007/s42979-021-00885-1.
Sipper M, Moore JH. AddGBoost: a gradient boosting-style algorithm based on strong learners. Mach Learn Appl. 2022;7: 100243.
Google Scholar

Download references

Acknowledgements

I would like to thank Raz Lapid for the helpful discussions.

Author information

Authors and Affiliations

Department of Computer Science, Ben-Gurion University, Beer Sheva, 84105, Israel
Moshe Sipper

Authors

Moshe Sipper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Moshe Sipper.

Ethics declarations

Conflict of Interest

M. Sipper declares that he has no conflict of interest.

Research Involving Human Participants and/or Animals

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Applied Image Processing and Pattern Recognition” guest edited by K. C. Santosh.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sipper, M. Combining Deep Learning with Good Old-Fashioned Machine Learning. SN COMPUT. SCI. 4, 85 (2023). https://doi.org/10.1007/s42979-022-01505-2

Download citation

Received: 04 March 2022
Accepted: 10 November 2022
Published: 08 December 2022
DOI: https://doi.org/10.1007/s42979-022-01505-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Combining Deep Learning with Good Old-Fashioned Machine Learning

Abstract

Similar content being viewed by others

Ensembles of Networks Produced from Neural Architecture Search

Computational Neuroscience Offers Hints for More General Machine Learning

Analyzing the Performance of Multilayer Neural Networks for Object Recognition

Introduction

Previous Work

Deep GOld: Algorithmic Setup

Results

Discussion

Concluding Remarks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Research Involving Human Participants and/or Animals

Informed Consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combining Deep Learning with Good Old-Fashioned Machine Learning

Abstract

Similar content being viewed by others

Ensembles of Networks Produced from Neural Architecture Search

Computational Neuroscience Offers Hints for More General Machine Learning

Analyzing the Performance of Multilayer Neural Networks for Object Recognition

Explore related subjects

Introduction

Previous Work

Deep GOld: Algorithmic Setup

Results

Discussion

Concluding Remarks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Research Involving Human Participants and/or Animals

Informed Consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation