1 Introduction

Choosing a suitable neural network for a dataset is challenging. In the case of underfitting, the learning model is simple and cannot learn the data relations (Dietterich 1995), while with overfitting, the model is complex and only memorizes the training data with limited generalizability (Dietterich 1995; Nowlan and Hinton 1992; Hawkins 2004). In both conditions, the model cannot recognize different unseen data. For example, consider a regression problem with one independent variable x and one dependent variable y that takes values of Table 1. The following three polynomial models and a neural network model are stated to predict y from x:

  • Model 1: \(y=0.8549x\)

  • Model 2: \(y=0.5x^2-2.2x+7.25\)

  • Model 3: \(y=0.001157546x^5+0.000444516x^4+1.969512896\)

  • Model 4: A Multi-Layer Perceptron (MLP) with a hidden layer including 100 neurons

As shown in Fig. 1, the first model is underfitted, while the third and the fourth models are overfitted. The second model is the best fit model with small errors. The same issue occurs in classification problems.

Table 1 Data of a simple regression example with one independent variable (x) and one dependent variable (y)
Fig. 1
figure 1

Comparison between an underfitted, a fit and two overfitted regression models

The simplest way to solve the underfitting problem is to extend the nonlinearity of the model. However, the overfitting cannot be solved simply (Lawrence et al. 1997). Overfitting needs model simplification by decreasing the number of free parameters, weight sharing, stopping training before overlearning, removing excess weights, decreasing redundant parameters by second-order gradient information, or penalizing the model complexity by a loss function (Nowlan and Hinton 1992; LeCun et al. 1990). For instance, low-rank factorization is a powerful method to simply learning model when the overfitting is high (Bejani and Ghatee 2020). Pattern deformation is another widely used approach in deep networks to improve the generalization power (Schmidhuber 2015).

On the other hand, overfitting is dependent on the approximation schemes (Girosi et al. 1995). For example, in Li et al. (2017) a regularized version of reinforcement learning has been used for function approximation. Classification models also approximate some separators between classes. To minimize the losses of these models on training data, one can augment the separators’ parameters. Nevertheless, a great number of parameters cause overfitting (Cawley and Talbot 2007; Tzafestas et al. 1996). It is the most important reason for overfitting as clearly shown in Fig. 2. In this figure, the training and the testing losses of 8 MLP networks with a single hidden layer are presented along with 1–8 hidden neurons. Besides, the separators made by these models are illustrated in Fig. 3. Simply speaking, a model with a small number of neurons cannot learn the data, and both testing and training losses are substantial. By increasing the number of hidden neurons, the model complexity increases, while both training and testing losses diminish. When the number of neurons is significantly large, the training loss possibly decreases, but the testing error grows because of model parameters’ freedom. In Fig. 2, this happens for an MLP with 4 hidden neurons. As can be seen in Fig. 3, this network can classify the training and the testing samples of two classes with the greatest accuracy. Thus, a learning model’s best complexity happens when the testing loss diverges from the training loss. This condition is useful to terminate the training process (Sarle 1995) before overfitting. Later, Sect. 3.2 defines the model complexity precisely. Then, it is possible to define a model with proper complexity for any dataset.

In the literature, the following reasons are also stated for the overfitting problem:

  1. 1.

    Noise of the training samples (Liu and Castagna 1999),

  2. 2.

    Lack of training samples (under-sampled training data) (Martín-Félez and Xiang 2014; Leung and Leung 2011; Zhao et al. 2020),

  3. 3.

    Biased or disproportionate training samples (Erhan et al. 2010),

  4. 4.

    Non-negligible variance of the estimation errors (Cawley and Talbot 2010),

  5. 5.

    Multiple patterns with different non-linearity levels that need different learning models (Caruana et al. 2001),

  6. 6.

    Biased predictions using different selections of variables (Reunanen 2003),

  7. 7.

    Stopping training procedure before convergence or dropping in a local minimum (Srivastava et al. 2014),

  8. 8.

    Different distributions for training and testing samples (Dai et al. 2007).

Fig. 2
figure 2

Overfitting and best complexity of a learning model

Fig. 3
figure 3

Learning models for a classification problem with 2 features and 2 classes: MLP networks with a single hidden layer including 1–8 neurons (from left to right the hidden nodes are added and the networks changes from underfitting to overfitting. The fourth model is the fit)

Any learning model should limit these overfitting sources by some overfitting controlling methods. These controllers can be categorized into three schemes, namely passive, active, and semi-active. They are defined as follows:

  • Passive schemes search for a suitable configuration of the network before training. Sometimes, they are referred to as model selection methods or hyper-parameter optimization techniques. After designing a suitable model, its hyper-parameters remain fixed throughout the training steps.

  • Active schemes impose a dynamic noise on the learning model or the training algorithm through the training steps, such that the model cannot memorize the details of data and the relationship between the features and outputs. These methods do not change the model architecture but activate some model components in each training step. They are also referred to as regularization schemes.

  • Semi-active schemes, similar to passive schemes, change the model architecture but throughout the training steps. Sometimes, they are addressed as dynamic architecture because they reconstruct the network concerning the training statues. They follow two approaches. The first is a network construction method, which starts training with a simple network, and incrementally adds hidden units during training. The second is network pruning, which simplifies a complex network along the training process.

Figure 4 illustrates some methods which follow these schemes. Note that the boundaries between these schemes are not rigid. For example, when an active scheme imposes a dynamic noise on some network weights, it changes the model through training implicitly, and it is similar to a semi-active method. Further, in some cases, a combination of some different controllers should be assigned to solve overfitting. For example, in Heidari et al. (2020) embedding-specific dropout, batch normalization, weight decay, shuffled terms are used to generalize the results. Furthermore, in some references, architecture selection is used instead of overfitting control. Architecture selection algorithms balance the complexity of a model with the required performance. Thus, the produced model fits almost all training data samples, but it cannot successfully generalize the results for unseen data. In Engelbrecht (2001), the architecture selection approaches are categorized into four groups, including brute-force, regularization, network construction, and pruning. Brute-force approaches are close to passive schemes. Regularization and active schemes are similar. The others belong to semi-active methods.

Fig. 4
figure 4

The categorization of the different methods to control overfitting in neural networks

Fig. 5
figure 5

The most frequent words in the titles, abstracts, and keywords of the related papers on neural network overfitting

This paper reviews different controllers for overfitting and their advantages and limitations in the continuation of these works. To the best of our knowledge, there is no comprehensive review to categorize these controllers. Although there are many review papers on shallow and deep neural networks (Ding et al. 2013; Liu et al. 2017; Li et al. 2018; Jaafra et al. 2019; Zhang et al. 2018; Darwish et al. 2020), the focus on the overfitting is limited to (Amer and Maul 2019) on modularization techniques, (Cheng et al. 2018) on model compression, (NarasingaRao et al. 2018) on some definition of overfitting, and (Bejani and Ghatee 2019) on some regularization methods. The contributions include the following:

  • A comprehensive review of overfitting controlling schemes in three branches, such that one can select an efficient overfitting controller for any shallow or deep neural network.

  • Summarization of the strengths and weaknesses of different methods with an in-depth discussion of their characteristics and relationships.

The rest of the paper is organized as follows. In Sect. 2, the searching methodology is presented. Section 3, presents the related topics to overfitting. Sections 4-6, include passive, active, and semi-active overfitting controlling methods. The final section ends the paper with some lessons.

2 Searching Methodology

The recommendations of the systematic review approach (Moher et al. 2009) were used to ensure the review research’s reproducibility. Here, Google Scholar and Scopus were used to search the papers. The keywords were looked for in papers’ titles, abstracts, and keywords. The initial keywords were chosen based on an influential paper written by Nowlan and Hinton (1992) as one of the first papers on neural network simplification and overfitting avoidance. Based on this paper, “overfitting”, “overlearning”, “generalization”, “regularization”, “simplifying neural network”, “neural network simplification”, “weight sharing”, “model selection”, “pruning of neural network”, and “neural network pruning” were considered as the initial keywords. Then, the most cited papers since 2010 were taken into account, and their references were considered deeply to find the most related papers. Then, the titles, abstracts, and the keywords of the found papers were tokenized by “Keras tokenization tool” (Chollet et al. 2015). After stemming process and removing stop-words, the most frequent words were extracted, and after post-processing, the results have been presented in Fig. 5. By searching these 40 keywords, a complimentary review was performed. As one can see, regularization, dropout, augmentation, deep, shakeout, image, random, generalization, redundant, and convolutional are the most 10 frequent words. These led to some important keywords related to overfitting. They also showed the importance of overfitting in deep networks, convolutional networks, and image processing problems. Finally, the words with the least frequencies were obtained. They were used as the emerging words in the relevant literature. The first 40 meaningful emerging words were as follows:

figure a

By searching on the related papers, the following emerging keywords were extracted:

figure b

The last search on all of the obtained keywords was done in August 2020, with no starting date. All of the obtained papers were categorized into passive, active, and semi-active schemes.

3 Related Topics to Overfitting

The overfitting happens in the gap between data complexity and model complexity. The model complexity can be stated in terms of VC-dimension (Vapnik 2006) and Bias-Variance trade-off (Geman et al. 1992). Some overviews have also been given in Hawkins (2004), NarasingaRao et al. (2018). To control the complexity of a model, there are different methods, see e.g., (Chen and Yao 2009; Li et al. 2018; Zhu et al. 2018; Liu et al. 2014; Eigenmann and Nossek 1999; Finnoff et al. 1993).

3.1 Dataset Complexity

In Ho and Basu (2002), Ho et al. (2006) for data complexity computation, three main measures have been considered, including the overlap of individual features, separability of classes, and geometry of the manifold. Their details are stated in Table 2. The last proposed method in this table simplifies the nonlinear space. These methods are known as nonlinear dimensionality reduction (Roweis and Saul 2000). The overfitting can be limited by removing unrelated features, and then the remaining features improve accuracy (Abpeykar and Ghatee 2019). Furthermore, in Abpeikar et al. (2020), the learning model is adapted with the data complexity.

Table 2 Different criteria to evaluate the complexity of datasets

On the other hand, some data visualization techniques enable to decrease the data complexity. As some instances, t-SNE (Maaten and Hinton 2008), LLE (Roweis and Saul 2000), Laplacian Eigenmap (Belkin and Niyogi 2002) can be addressed. They usually keep the distances between the samples. The knowledge representation in the space with a lower dimension is simpler where the unnecessary features can be neglected (Bejani and Ghatee 2018). As a wider scope, one can refer to feature extraction methods, including Principal Component Analysis (PCA), nonnegative matrix decomposition, binary decomposition, and other matrix decomposition techniques (Eldén 2019). Instead of feature extraction, a subset of features can be selected. For a review, one can refer to (Guyon and Elisseeff 2003). Feature clustering to define clusters of features with maximum relevancy and minimum redundancy is another approach to simplifying the data space. To see more details, one can refer to (Abpeykar et al. 2019).

3.2 Model Complexity

Reducing the data complexity is not applicable, but the stiffness of the model is adjustable. To compute the complexity of a learning model, VC-dimension (Vapnik and Chervonenkis 2015) states a function f of \(\theta .\) On the training data \(\{x_1,x_2,\ldots ,x_n\},\) \(\theta\) minimizes the error function of model f. When \(E_{Train}(f)\) and \(E_{Test}(f)\) are the error of model f on the training and the testing data, for each \(\eta \in [0,1],\) the following probabilistic inequality is met (Wittek 2014):

$$\begin{aligned} P\Bigg (E_{Test}(f) \le E_{Train}(f) + \sqrt{\frac{(h(\log {(2n/h)} + 1)-\log {(\eta /4)}}{n}}\Bigg )=1-\eta \end{aligned}$$
(1)

where h is VC-dimension of model f. The VC-dimension of a feed-forward neural network with ReLU activation function satisfies (Harvey et al. 2017):

$$\begin{aligned} h \ge WL \frac{\log (W/L)}{640}, \end{aligned}$$
(2)

where W is the number of weights, and L is the number of layers. For big W and L (in deep networks), h becomes great. Thus, the probability (1) indicates that the testing error increases for any \(\eta\) and the generalization becomes poor. Thus, controlling both training and testing errors is needed. In reality, the testing data does not exist. It is common to apply a part of the training data to measure the generalization power as the validation data. For any training step t, the following ratio of the errors on the validation and training data can be evaluated as the overfitting level (Bejani and Ghatee 2020a):

$$\begin{aligned} v(t) = \frac{E_{Validation}(t)}{E_{Train}(t)}. \end{aligned}$$
(3)

On the other hand, the complexity of neural network models can be evaluated by the condition number of the weight matrix W (Bejani and Ghatee 2020b):

$$\begin{aligned} cond_p(W) = \Vert W\Vert _p\Vert W^{-1}\Vert _p, \end{aligned}$$
(4)

where \(p\in \{1,2,\ldots ,\infty \}\) indicates the base for the norm. For nonlinear learning models, the condition number can be also extended (Bejani and Ghatee 2020, 2021).

4 Overfitting Control by Passive Methods

To select a model with suitable complexity, we can use model selection methods. They compare the different learning models on the validation data to choose the best model. When they use the validation data in the training process, the model probably overfits the validation data. Ng (Ng 1997) explains how overfitting occurs on validation data. He shows that if the set of hypotheses or learning models is large, folklore warns about overfitting the cross-validation data. Besides, the NAS method (Zoph and Le 2017), which chooses a controller to achieve a model with the maximum validation accuracy on the last five epochs, needs regularization methods on some small datasets. For such datasets, this reference uses a combination of embedding dropout and recurrent dropout techniques. To show what happens when the validation data is used for model selection, Table 3 presents a learning model selection example. In this example, some approximation models \(p_N(x)=\sum _{n=1}^Nc_nx^n\) estimate y for x where the training, the validation, and the testing datasets include 7, 4 and 4 samples. As the second part of this table shows, the best validation errors E(V) are obtained for \(N=5\) or \(N=6.\) Their training errors E(Tr) are zero. Thus a model selection method chooses \(p_5(x)\) or \(p_6(x).\) However, as the last column shows, the testing error E(Te) increases when N grows. The worst testing errors happen for \(N=5\) and 6. Thus, the best model based on the validation data cannot necessarily recognize unseen data. Meanwhile, if the testing data follow the same distribution of training data and validation data, and their sizes are great, the generalization results are reliable; see, e.g., (Nannen 2003) for more details.

Table 3 Approximation models \(p_N(x)=\sum _{n=1}^Nc_nx^n\) that are overfitted on the validation data

In a model selector, the estimated error can be decomposed into bias and variance. A model qualifies when its bias and variance are small. To avoid overfitting in model selection, see (Cawley and Talbot 2010). Since training of deep models is time-consuming, the model selection on deep models is not very useful. For some model selection researches, see Table 4. They cover search methods, modularization, ensembling, and adaptive statistics.

Table 4 Major algorithms to find optimal hyper-parameters of neural networks (Passive methods)

4.1 Search Methods

When a learning model is in hand and can simplify without losing performance (Asadi and Abbe 2020), the simplification is useful. In other cases, a model generator produces a set of models with different architectures to learn training samples. The different searching methods evaluate the models’ performances on the validation dataset, and the best one(s) are selected. The following searching methods have been widely considered in the literature. In the grid search, the different models are created based on a grid of data before the searching process, while the others define the models iteratively by a particular strategy during the searching process.

4.1.1 Grid Search

It uses a grid of parameters \({\varGamma }= \{\gamma _1,\ldots ,\gamma _S\}\) with a discrete number of values for each hyper-parameter of the learning model. When S increases, the chance of finding better hyper-parameters increases, and obviously, the model quality grows. In a grid search proposed by Liu et al. (2006), an evolutionary algorithm is used to search, but it can be substituted by other meta-heuristic algorithms directly. Jeng (2005) applies a competitive agglomeration clustering algorithm with a grid search to improve a regression model. Monari and Dreyfus (2002) uses a leave-one-out score for nonlinear model selection. It estimates the leverages and confidence intervals of samples during the training process and selects the best model among various models with and without equivalent complexities.

A similar approach to grid search for deep neural networks is given in Salman and Liu (2019) that trains N deep neural networks with different parameters. It also takes a pre-defined threshold to determine whether or not the class recognition probability is sufficient. For each sample x and model \(n\in \{1,\ldots ,N\}\), these probabilities for all classes are retained and their minimum is denoted with \(P_n(x)\). If \(max_{n=1,\ldots ,N}P_n(x)\) overcomes the threshold, the corresponding model classifies x, and otherwise the model rejects x to classify.

It is worth noting that a random search is similar to the grid search that takes the samples of \({\varGamma }\) randomly (Bergstra and Bengio 2012). When the random search takes samples by prior knowledge of hyper-parameters, it overcomes the grid search. For a random search on hyper-parameter optimization, see (Bergstra and Bengio 2012). Besides, (Bergstra and Bengio 2012) presents a model selection under uncertainty independent of the dataset. But this model causes overfitting in some cases.

4.1.2 Gradient Search

It applies a gradient descent algorithm to optimize a set of hyper-parameters of a learning model. To this end, the model selector states the learning performance in terms of model hyper-parameters implicitly or explicitly. Different linear and quadratic approximation functions are useful in this step. For example, (Bengio 2000) minimizes the following approximation loss function:

$$\begin{aligned} \min _{\theta (\gamma )} a(\gamma ) + b(\gamma )^T \theta (\gamma ) + \frac{1}{2} \theta ^T(\gamma ) H(\gamma ) \theta (\gamma ), \end{aligned}$$
(5)

where \(\gamma\) shows the vector of the model’s hyper-parameters, and \(\theta (\gamma )\) indicates the model parameter that is a function of \(\gamma\). An approximation algorithm obtains the real function \(a(\gamma )\), the vector function \(b(\gamma )\), and the matrix function \(H(\gamma )\). However, the complexity of the Hessian matrix calculation is high. Maclaurin et al. (2015) uses a reverse-mode gradient descent instead of the Hessian matrix. Franceschi et al. (2017) also states a forward gradient-based algorithm. Larsen et al. (2012) uses a gradient method to find the weight decay method’s parameters. Also, (Larsen et al. 2012) applies the gradient of the average of validation errors in k-fold cross-validation for the same purpose. Getting the gradient is not limited to supervised learning. For example, (Li et al. 2015) uses a gradient method to regularize the Least Squares Temporal Difference (LSTD) algorithm for unsupervised learning.

4.1.3 Bayesian Optimization

It finds the global solution of a learning performance function using a Gaussian Process (GP) on all available information about samples (Mockus et al. 1978). For example, assume a learning model with a hyper-parameter with different values \(\theta _1,\ldots ,\theta _n\). Commonly, the learning model is trained with these values separately. One can model the achieved training errors \(E_1(\theta _1),\ldots ,E_n(\theta _n)\) with the following multivariate Gaussian distribution with respect to the vector of hyper-parameters \(\theta =(\theta _1,\ldots ,\theta _n)\):

$$\begin{aligned} E(\theta ) = \begin{pmatrix} E_1(\theta _1) \\ \vdots \\ E_n(\theta _n) \end{pmatrix} = {\mathcal {N}}\Big ( \begin{pmatrix} \mu _1 \\ \vdots \\ \mu _n \end{pmatrix}, \begin{pmatrix} K_{1,1} &{}\quad K_{1,2} &{}\quad \cdots &{}\quad K_{1,n} \\ K_{2,1} &{} \quad K_{2,2} &{}\quad \cdots &{}\quad K_{2,n} \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ K_{n,1} &{}\quad K_{n,2} &{}\quad \cdots &{}\quad K_{n,n} \end{pmatrix} \Big ), \end{aligned}$$
(6)

For \(i,j=1,\ldots ,n\), \(\mu _i\) indicates the mean of \(E_i(\theta _i)\) on the training samples and \(K_{i,j}\) is the covariance of \(E_i\) and \(E_j\) on the same samples. The Bayesian optimization techniques try to find a minimum of \(E(\theta )\) by using acquisition functions to construct \(E(\theta )\) from the model posterior. More practically, to simplify the computations of \(K_{i,j}\), some kernel functions are presented. For more details, see (Snoek et al. 2012). Besides, (Snoek et al. 2015) applies a neural network instead of GP to accelerate the searching in the solution space. Zhao et al. (2019) proposes an adaptive and data-dependent regularization based on the empirical Bayes method. For more different applications of Bayesian techniques for hyper-parameter optimization, including evidence maximization, Bayesian model averaging, slice sampling, and empirical Bayes, see (Feurer and Hutter 2019). Furthermore, (Crammer et al. 2013) states the model parameters with the normal distribution. In the learning process, it updates the mean and the standard deviation of this distribution such that the probability of getting correct solutions increases. By sampling the network weights from the normal distribution, uncertainty increases and avoids overfitting. But this process is very time-consuming.

4.1.4 Meta-Heuristic Algorithms

Meta-heuristic algorithms such as evolutionary algorithms can be used to design a network or achieve network parameters to prevent overfitting (Ding et al. 2013; Abpeykar et al. 2019; Abpeikar et al. 2020). (Sharma et al. 2013) optimizes SVM parameters and regularizes the results with the help of the firefly algorithm, Particle Swarm Optimization (PSO), and accelerated PSO. Suganuma et al. (2017) proposes a genetic algorithm to design a Convolutional Neural Network (CNN) architecture, where the fitness is the network accuracy. Darwish et al. (2020) incorporates a dropout method in the evolutionary-deep-networks. Besides, (Fong et al. 2018) reviews on meta-heuristic algorithms for deep learning models in the context of big data analytics. This reference also points out some research directions to cover the gaps between meta-heuristics and deep learning models.

4.1.5 Network Architecture Search

Network Architecture Search (NAS) (Zoph and Le 2017) uses a recurrent neural network as a controller to find an architecture of a neural network with the highest accuracy on the validation dataset. This controller is trained by a reinforcement learning method and includes some similar blocks of some common CNN layers (convolution layers and pooling layers). This method is very time-consuming and probably overfitted on the validation data. For advanced versions of NAS, refer to (Real et al. 2019; Jaafra et al. 2019). Real et al. (2019), obtains the best architecture by using an iterative algorithm based on searching on a queue of different networks and two different mutations generating new networks.

Besides, in NAS (Zoph and Le 2017), an LSTM network is considered with two layers. In every five time-steps, the outputs of LSTM are determined to state the architecture of a layer. The LSTM network results determine the filter height, filter width, stride height, stride width, and the number of the filters for a convolution layer. Based on the number of layers, the LSTM network generates an architecture. The designed architecture is trained on the dataset, and the performance on the validation dataset is considered as a reward signal, R, to prepare the LSTM network by reinforcement learning. In this process, the expectation of R should be maximized. Since the reward signal, R is not differentiable, (Williams 1992) uses a different policy to compute the gradient. To see a deep Q-learning, one can refer to (Mnih et al. 2013).

4.2 Modularization

A modular neural network is a neural network that breaks down into several relatively independent, replicable, and composable networks (or modules). Topological modularization acts as a kind of regularization. On the other hand, tightly coupled modules lead to overfitting. As an instance, (Kirsch et al. 2018), proposes a modularization scheme to design a network with high performance. It uses a two-step algorithm. In the first step, it defines the compositions of the modules. In the second step, it trains all modules and a controller to find better performance. Formally, this approach seeks to find the best composition of the trained modules by maximizing the following log-likelihood:

$$\begin{aligned} {\mathcal {L}}(\theta , \phi ) = \sum _{n=1}^N \log P(y_n | x_n, \theta , \phi ), \end{aligned}$$
(7)

where \(\theta\) is the parameters of the used modules, \(\phi\) is the parameters of the controller, and \(\{x_i, y_i\}_{i=1}^N\) is the training dataset. For a comprehensive review, see (Amer and Maul 2019).

4.3 Ensembling

By training multiple neural networks and aggregating their results, one can improve the overall performance significantly. Such ensembles of neural networks can predict unknown samples more accurately than the individuals (Li et al. 2018; Bejani and Ghatee 2018; Abbasi et al. 2016). In Abbasi et al. (2016) an ensemble of Radial Bases Function (RBF) networks is stated. In Abpeykar and Ghatee (2019), Abpeykar et al. (2019), different ensembles of neural networks in the decision tree structures are investigated. In Abbasi et al. (2016) a regularization term is added to the ensemble method. In Abpeykar et al. (2019), Abpeykar and Ghatee (2019) clustering algorithms are used to decompose the features into some subsets to reduce redundancy and increase dependency. This approach simplifies data space and controls overfitting in high-dimensional datasets. Also, in Abpeykar and Ghatee (2019), a cut point is used in a neural tree to decrease the computation time and prevent overfitting. Besides, in Abpeikar et al. (2020), an expert system is developed to control overfitting. This expert system that is located in any node of the neural tree evaluates data complexity and selects a suitable neural network. A meta-heuristic algorithm is also used to define a reasonable set of features to avoid overfitting when it is high.

4.4 Conclusions and Future Works on the Passive Schemes

The passive methods are some meta-models that create learning models. After training, it evaluates the performance of the learning models on the validation data. Based on these evaluations, it updates the learning models. One can repeat this procedure until achieving a suitable model for a learning task. Passive methods’ problems are their long computation time and overfitting on the validation data. For solving the first problem, someone uses prior knowledge to initialize the model parameters or modify the searching process (Yam and Chow 2000; Ivanova and Kubat 1995). For solving the second, it is possible to use random sampling in the training process (Bergstra and Bengio 2012). The following topics are important when choosing a passive method:

  • They add some hyper-parameters to the learning models, and finding hyper-parameters is hard.

  • These methods can be used on shallow models usually, as they contain few parameters.

  • These methods are very time-consuming for applying on deep models.

  • They produce models possibly overfitted to the validation data.

  • By combining randomness in the passive methods, the results probably improve, although the processing time remains long.

Using the uncertainty and prior knowledge in the process of model selection can be considered in future works. To accelerate the processing time of passive schemes, one can decrease the number of hyper-parameters and use the previous results to limit the searching space. Considering multi-criteria decision-making techniques for model selection can also be followed, but it seems useful for shallow models. To date, no effective model selection method has been proposed for deep learning models.

5 Overfitting Control by Active Methods

Regularization methods simplify complex models along the training process. They impose dynamic noises on the model parameters or the training parameters (Blanc et al. 2020). It causes control overfitting. Figure 6 displays the published papers’ trend on the regularization methods. Because of the importance of these methods, in what follows, the related references are compared substantially.

Fig. 6
figure 6

The number of published papers on “regularization” + “neural network” based on the Google-Scholar database

5.1 Imposing Noise to the Learning Model

Tikhonov and Arsenin (1977) have proposed this method to find a stable solution for ill-posed systems. In Natterer (1984) the error bound of Tikhonov regularization has been determined. In Scherzer et al. (1993) a posterior strategy has been presented for choosing the regularization parameters of the Tikhonov model. Some applications have been also given in Golub et al. (1999), Beck and Ben-Tal (2006), Calvetti and Reichel (2003), Nashed (1986). Because the loss function of a neural network can be minimized by solving a nonlinear system, the Tikhonov method can be extended to construct a fit neural network. Any stable solution of a nonlinear system leads to a learning model with low complexity. Based on (Bishop 1995), Tikhonov regularization for neural network training is equivalent to the training with noise augmentation. Generalized Tikhonov methods are also considered in Vauhkonen et al. (1998). In Bejani and Ghatee (2020b), the effect of Tikhonov term in CNN training is analyzed, and it is combined with Singular Value Decomposition (SVD) of weights. In Bejani and Ghatee (2020), Tikhonov term is incorporated with low-rank matrix decomposition. To present some details, consider the following empirical error for learning model f on the training samples \(\{x_i,y_i\}_{i=1}^D\):

$$\begin{aligned} \min _{f} E(f;\{x_i,y_i\}_{i=1}^D). \end{aligned}$$
(8)

To train these D samples, the empirical error should be minimized to achieve the hyper-parameters (weights) of f. Meanwhile, the error of a model on unseen data is important to judge the quality of the learning model. As Fig. 1 shows, sometimes the complexity of the model is not proportional to the data complexity. To adapt these complexities, one can add a regularization term to the empirical error as the following:

$$\begin{aligned} \min _{f}E^*(f;\{x_i,y_i\}_{i=1}^D) = E(f;\{x_i,y_i\}_{i=1}^D) + \lambda R(f), \end{aligned}$$
(9)

where, the nonnegative function R(f) measures the complexity of the learning model f. The closer f to a linear function, the more R(f) converges to 0. Conversely, for a non-linear learning model f, the value of R(f) is high. This function can be stated as the following:

$$\begin{aligned} R(f) = \int \left| \frac{\partial ^m f}{\partial ^m x}\right| dx \end{aligned}$$
(10)

By setting \(m=1\) \((m=2)\) in Eq. 10, R(f) interprets the summation of values of the first (the second) derivative of the learner model f. The coefficient \(\lambda\) adapts the regularization effect and depends on the complexity of data. In many cases, data complexity cannot be defined precisely. Instead, \(\lambda R(f)\) changes randomly to add a noisy effect on the loss function of the learning model. The optimal solution of Problem 9, leads to a simple learning model with the minimum empirical error. This function does not memorize the training data, and hence the overfitting diminishes. In Table 5, some researches on Tikhonov regularization that impose the noise on the learning model have been described. This table covers three main approaches. The first approach leads to the minimization of the network weights magnitudes. The second increases the correlation between the network weights. The third re-frames the inputs with respect to the outputs.

Furthermore, imposing the noise to the learning model can also be defined independently of the Tikhonov term. Optimally Pruned Extreme Learning Machine (OP-ELM) (Miche et al. 2011) with Gaussian kernel is an example. In Belkin et al. (2006) a family of learning algorithms based on the manifold regularization has been developed by exploiting the geometry of the marginal distribution. Also, in Abbasi et al. (2016) a weight decay regularization term for an ensemble learning model is augmented.

Table 5 Active methods with imposing noise to learning models (The Tikhonov regularization schemes)

5.2 Imposing Noise to the Learning Algorithm

These schemes implicitly affect the models and the training dataset. They can be referred to as non-Tikhonov regularization and are categorized as follows.

5.2.1 Dropout Family

The dropout (Srivastava et al. 2014) is the most popular scheme to control overfitting in the training processes. This method considers a Bernoulli probability p in each training iteration to decide whether or not the training process updates neurons’ weights. The dropout scheme can be represented as a Tikhonov method by using the following complexity function (Demyanov 2015, P. 66):

$$\begin{aligned} R(f) = \frac{1-p}{p}||{\varLambda }w||_F^2 \end{aligned}$$
(11)

where p is the keep probability, \({\varLambda }\) is a diagonal matrix of eigenvalues of the covariance matrix of the dataset, and w is the weight vector of the linear model. In recent works, Low-Rank matrix Factorization (LRF) has been used to drop out some parameters of a learning model along the training process when the complexity of each layer is high (Bejani and Ghatee 2020). This method, which is entitled “Adaptive LRF”, simplifies the network only when it is needed. In Bejani and Ghatee (2021), Laliga is defined that applies the least auxiliary loss-functions together adaptive weights to regularize VGG neural networks.

In another approach, the noise is imposed on inputs. The constant noise leads to overfitting because, in the training process, the model learns this noise as a principal part of the input data. By imposing the random noise on the input data in each iteration of the learning process, the model cannot learn the constant noise of the input data. As an instance, the whiteout scheme (Li and Liu 2016) adds the following noise to the inputs of layers:

$$\begin{aligned} {\tilde{x}}_i^{(l)} = x_i^{(l)} + e_i \end{aligned}$$
(12)

where the components of \(e_i=(e_{i,j})\) are stated with the following normal distribution:

$$\begin{aligned} e_{i,j} \sim N(0, \frac{\sigma ^2}{|w_{i,j}^{(l+1)}|^\gamma } + \lambda ) \end{aligned}$$
(13)

\(\gamma \in [0,2]\), \(\sigma > 0\), and \(\lambda > 0\) are the parameters of this method. This regularization method changes the inputs in each epoch and does not allow the network to memorize the training data. For other extensions, see (Wan et al. 2013; Kang et al. 2018; Khan et al. 2018; Krueger et al. 2017; Larsson et al. 2017; Khan et al. 2019; Liu et al. 2020). In Zhao et al. (2020) wavelet transforms of data are incorporated into the training process to produce a set of diverse data when the training data are limited. Discrete wavelet transforms are also used by Eftekhari and Ghatee (2018) to extract the features and to improve generalization. In Heidari et al. (2020), the SVD of weights are used to produce diverse patterns to improve generalization.

In another approach, reweighting algorithms are used on the samples to control overfitting. In Ren et al. (2018), a meta-learning gradient-based algorithm is defined to assign weights to training samples of a mini-batch to minimize the loss on a clean, unbiased validation dataset.

5.2.2 Augmentation Family

The augmentation of a dataset is a popular method to increase the number of samples and to increase generalization power. In some instances (Kwasigroch et al. 2017; Wąsowicz et al. 2017; Galdran et al. 2017), the experimental results of the affine transition augmentation have been presented. Also, in Jin et al. (2015), the effect of the noise on the input of models has been discussed. In Yang et al. (2020), the gradient augmentation is defined that works on randomly transformed training samples to regularize a set of sub-networks to learn well-generalized and more diverse representations. Besides, the overfitting in graph-structured convolutional neural networks has been presented by Kipf and Welling (2017). In Verma et al. (2019), a regularized training scheme for graph neural network is developed by using data augmentation. In Table 6, some important augmentation schemes have been expressed.

On the other hand, a Generative Adversarial Network (GAN) (Goodfellow et al. 2014) enables raising the number of samples. GAN models include convolutional based GAN (Yu et al. 2017), condition-based GAN (Mirza and Osindero 2014), and autoencoder based GAN (Donahue et al. 2017). See (Pan et al. 2019) for more details. The application of GAN for data augmentation can be seen in Antoniou et al. (2017). Also, in Perez and Wang (2017), a GAN has been trained to generate different styles in a dataset.

Table 6 The augmentation schemes for increasing the samples and decreasing overfitting

5.2.3 Normalization Family

A simple but effective regularization scheme is batch-normalization (Ioffe and Szegedy 2015). This scheme has been proposed for solving the internal covariate shift, which occurs in neural networks. Each batch of training samples produces a new distribution in the output of each layer of the network, and each layer has to learn the new distribution per batch data. In Luo et al. (2019), it is shown that this scheme can control overfitting problems. More researches have been stated in Arpit et al. (2016), Ioffe (2017), Wu and He (2018), Ba et al. (2016), Heidari et al. (2020). In a more different approach, the weights of a network are normalized to increase the generalization power (Miyato et al. 2018; Salimans and Kingma 2016). However, normalization schemes decrease the speed of learning.

5.2.4 Activation Function Family

The activation function is a part of a neural network offering non-linearity power to the network. If all activation functions are linear, the entire network is just a linear model. Thus, activation functions have great roles in the non-linearity of the networks. In shallow networks, sigmoid or tanh are usually applied. In deep networks, the exploiting and vanishing of gradient (Bengio et al. 1994) are common problems. For solving vanishing gradient, Rectified Linear Units (ReLU) is utilized as the activation function (Nair et al. 2010). The derivative of this function is greater than one when the corresponding neuron is active. In Glorot and Bengio (2010), the role of activation function on the learning performance of deep models has been discussed. The extensions of this activation function have been also used for regularization, see (Clevert et al. 2016; Xu et al. 2015; He et al. 2015; Jie et al. 2020).

5.3 Conclusions and Future Works on the Active Schemes

Since active methods impose a dynamic noise on the learning model or learning algorithm, the model does not learn the training data’s noises. A model that learns the training patterns’ noise fails to perform well on the testing dataset (Salman and Liu 2019). To investigate the effect of different noise types on the training process, consider an example on the Yale dataset (Lee et al. 2005), including 165 images from 15 classes. The input size is \(32*32\). To train a multilayer perceptron (MLP) neural network, we consider the following six scenarios:

  • (S1): Training on data set without noise and regularization,

  • (S2): Training on data set without noise and with Dropout regularization,

  • (S3): Training on data set with fixed noise and without regularization,

  • (S4): Training on data set with fixed noise and Dropout regularization,

  • (S5): Training on data set with fixed noise and applying PCA on data for noise smoothing and without regularization,

  • S6: Training on data set with fixed noise and using PCA on data for noise smoothing and Dropout regularization.

Figure 7 compares the results of the training and testing errors for these scenarios. As we expected, the worst case happens when the learning model tries to learn the noises (S3). The (S5) scenario also converse rapidly, similar to (S3), while its testing error is better than (S3). This result shows that smoothing the noise by PCA causes an improvement in the testing error. Comparing (S3) and (S5) with (S1) shows that the imposing noise accelerates the training process, while it cannot solve overfitting completely. The best testing result is obtained from (S2) that trains with regularization. The results of (S6) and (S4) are also close to (S2). It shows that the active methods can rarely remove the effect of data noise.

Fig. 7
figure 7

Comparison between training and testing errors (in the top and down sub-figures) by considering the noise on samples and training process

In Table 7, the results of active methods are compared, considering the learning components. The following results can be summarized:

  • They increase the training time but probably less than passive methods do.

  • If the imposing noise is excessive, the model learns slowly, while if the noise is limited, the model is likely to overfit. To show evidence, consider a VGG-16 network to train CIFAR-10 data with different p for Dropout regularization. Let p belongs to \(\{0.9, 0.3, 0.01\}\). Figure 8 presents the training results. Since \(p=0.9\) is very high, a tremendous amount of noise is imposed on the learning model, and the convergence speed is meager. \(p=0.01\) is very small, and it seldom affects the training process, and the testing loss increases after some iterations. Finally, \(p=0.3\) performs well to control overfitting. However, in the final steps, the testing error rises slightly. As a result, this experiment shows that regularization hyper-parameters are fundamental, and the training is highly dependent on these values.

  • The level of the overfitting changes in each training iteration and the magnitude of noise should be adapted.

  • Typically, the active regularization methods do not adapt their parameters with respect to the level of overfitting. For an instance of an adaptable version, see (Bejani and Ghatee 2020a) which uses a small parameter for overfitting control in the first steps of the training, as the model is underfitted. Conversely, it increases the regularization parameters when v(t) becomes large.

  • The combination of data augmentation and adaptive regularization can also be investigated in future studies.

Table 7 The regularization schemes for neural networks (Active methods with imposing noise on learning algorithms)
Fig. 8
figure 8

Comparison between training and testing errors (in the top and down sub-figures) by considering different p for Dropout regularization method

6 Overfitting Control by Semi-active Methods

These methods do not affect the entire architecture of the network and do not train several models simultaneously. They update networks by adding or removing neurons or connections. They are designed with bottom-up and top-down paradigms. The methods that add neurons or connections are called bottom-up, while other methods that remove neurons or connections are called top-down.

6.1 Pruning Method (Top-Down)

To simplify a model that learns the samples but drops in overfitting, the pruning procedure removes the unnecessary parameters (Han et al. 2015). Han et al. (2015) prunes a network in three steps. The first step is training the network and finding essential connections. The second step removes all links whose weights are less than a predefined threshold. In the last step, the network is trained again to retrieve the performance. It uses different regularization schemes, including weight decay or dropout, to improve the results. The dropout parameters are adjustable for each layer in the retraining step by the following:

$$\begin{aligned} D_r = D_o \sqrt{\frac{C_{ir}}{C_{io}}}, \end{aligned}$$
(14)

where \(D_o\) is initialized randomly. \(C_{ir}\) and \(C_{io}\) are the numbers of connections of the ith layer before and after pruning.

Hu et al. (2016) introduces a network trimming method for deep networks that iteratively prunes unimportant neurons based on output analysis. For a convolution layer, Averages Percentage of Zeros (APoZ) is evaluated as the following:

$$\begin{aligned} APoZ_c^{(i)} = \frac{\sum _{n=1}^N \sum _{m=1}^M f(O_{c,m}^{(i)}(n) = 0)}{N \times M}, \end{aligned}$$
(15)

where \(O_c^{(i)}\) is the output of the cth channel in ith layer. \(f(x) = 1\) if x is true, and \(f(x)=0\) otherwise. N is the number of validation samples, and M is the dimension of output \(O_c^{(i)}\). When \(APoZ_c^{(i)}\) is great, cth channel in ith layer is eliminated. Then, the network is trained again. These steps repeat while the validation error can improve.

Matrix computation methods are also useful to prune the network and to avoid overfitting. For example, (Hassibi et al. 1993) uses the inverse of the Hessian matrix to prune the network. Schittenkopf et al. (1997), applies principal components to reduce the dimension of both inputs and internal representations, causing better generalization and less overfitting. There are different criteria to detect the connections for pruning. Cottrell et al. (1994) removes the weights when they have low relevancy. The details of the pruning schemes are summarized in Table 8. For more study, one can also see (Reed 1993).

Table 8 The pruning schemes for neural networks simplification (Semi-active methods)

6.2 Network Construction Method (Bottom-Up)

Assume a simple network that cannot learn the samples. Adding new parameters, including neurons or a hidden layer, the network complexity increases, and the new network is retrained to improve the results. This iterative method can be executed several times.

Moody and Antsaklis (1996) uses a dependency identification algorithm to add a hidden layer when the performance is poor. In each iteration, if the error of the trained network is less than a pre-defined threshold, the following optimization problem is solved for all data-batch \(i\in B\) to find the weights vector \(w_i\):

$$\begin{aligned} \min _{w_i} trace(({\varPhi }(w_iu) - d)^T ({\varPhi }(w_iu) - d)), \end{aligned}$$
(16)

where m is the batch-size, n is the output feature, \(u \in {\mathbb {R}}^{m\times n}\) is the output of the last layer, \({\varPhi }()\) is an activation function, and d is the desired output. The function trace() returns the trace of the matrix. After solving this optimization problem, matrix W containing columns \(w_i\) is considered as the weights of a new hidden layer.

Fahlman and Lebiere (1990) adds a hidden neuron that maximizes the following sum of correlations over candidate neurons \(o\in O\) and samples \(p\in P\):

$$\begin{aligned} o=arg-max\left \{ \sum _{o\in O} \big |\sum _{p\in P} (V_{p,o} - \bar{V_o})(E_{p,o} - {\bar{E}}_o) \big | \right \}, \end{aligned}$$
(17)

where \(V_{p,o}\) is the output of neuron o for the sample p, \(E_{p,o}\) is the corresponding residual error, and \(\bar{V_o}\) and \({\bar{E}}_o\) are the means of \(V_{p,o}\) and \(E_{p,o}\) over all samples.

Setiono (2001) uses the cross-validation and trains a small network. If adding a new hidden neuron improves accuracy, it continues. Otherwise, it comes back to the former network. Similarly, to add multiple hidden nodes, consider two networks, including H and \(H + h\) hidden neurons. If the accuracy of the first network overcomes the second one, the algorithm terminates. Otherwise, it trains a network with \(H + 2h\) hidden neurons. This process continues while the accuracy can improve. To avoid overfitting during the training of these networks, one can use the following regularization term (Setiono 1997):

$$\begin{aligned} R(w,v) = \epsilon _1 \left (\sum _{i,j}\frac{\beta w_{i,j}^2}{1 + \beta w_{i,j}^2} + \sum _{i,j}\frac{\beta v_{i,j}^2}{1 + \beta v_{i,j}^2}\right) + \epsilon _2 \left(\sum _{i,j}w_{i,j}^2 + \sum _{i,j}v_{i,j}^2\right), \end{aligned}$$
(18)

where \(\epsilon _1\), \(\epsilon _2\), and \(\beta\) are positive coefficients, w is the weight matrix between the input layer and the hidden layer, and v is the weight matrix between the hidden layer and the output layer.

Zhang et al. (2014) constructs a neural network architecture by a two-stage orthogonal least-squares method. It determines model terms, model coefficients, and model size by minimizing the following Akaike Information Criterion (AIC):

$$\begin{aligned} AIC = N\log \Big (\frac{1}{N} \langle E(M), E(M) \rangle \Big ) + 2M_s, \end{aligned}$$
(19)

where E(M) is the residual error of model M, \(\langle .,.\rangle\) is the inner product, N is the size of the dataset, and \(M_s\) is the model size. Let model \(M^*\) minimizes AIC. To determine the necessary terms of \(M^*\) in each step, one can minimize the Error Reduction Ratio (ERR). Based on the ERR, a term that maximally reduces the error is selected.

The bottom-up methods for network construction are compared based on their termination criteria and the construction methods in Table 9.

Table 9 The construction schemes for neural networks designing (Semi-active methods)

6.3 Conclusions and Future Works on the Semi-active Schemes

The construction methods suffer from some problems. Firstly, they use the validation dataset to construct the next layer or the next block. Secondly, there are few efforts to construct deep neural networks for a specific task. Meanwhile, to construct a network, a subset of network parameters can be optimized, but it is an NP-hard problem (He et al. 2017). Usually, network construction is used for simple tasks and shallow networks.

On the other hand, compression techniques simplify the network (Serra et al. 2020; Yang et al. 2020). They belong to the semi-active methods, but they are somewhat different. Many parameters of a deep network cause some difficulties in computing the output, especially on mobile devices with power limitation (Bejani and Ghatee 2020a). To solve this problem, Moblie-Net uses depth-wise convolution, and so the number of operations decreases, while its accuracy remains acceptable (Howard et al. 2017; Sandler et al. 2018). There are four major approaches to compressing the deep networks, including parameter pruning and sharing, low-rank factorization, compact convolutional filters, and knowledge distillation. In this regard, (Cheng et al. 2018) reviews different compression methods. They seek and eliminate unnecessary details of deep models. Thus, the models resulting from compression schemes are smoother, and then the corresponding models are regularized and can avoid overfitting. However, compression and overfitting controlling methods are different and play different roles. Thus, they cannot replace each other. The compression methods should decrease the complexity of the model and may lower the model performance. Conversely, the semi-active methods do not disturb the model performance and improve generalization power. The following conclusions on semi-active methods are useful:

  • The computation time of semi-active methods is long. But they overcome passive methods.

  • Using statistical tests for semi-active methods improves their performance.

  • Semi-active methods are widely applicable in the shallow networks, including RBF networks, Extreme Learning Machines (ELM), and other feed-forward networks.

  • Semi-active methods can construct the networks or prune them in parallel.

However, the application of semi-active methods in deep networks is challenging. For deep network construction, trial and error schemes are tedious, and usually, expertise plays a more critical role. Possibly, a recommender system enables to collect such expertness to generate new semi-active methods to control overfitting in deep networks. For example, (Abpeikar et al. 2020) develops an expert node in the neural tree to prevent overfitting. Similarly, one can create a new block in deep learning models to prune or re-construct the network.

Table 10 A general comparison between different methods to control overfitting
Table 11 Continue of Table 10

7 Lessons from Literature

The overfitting occurs in shallow and deep networks, but it is not easy to control its effect. Obviously, this phenomenon affects deep networks more broadly. To descibe the effect of overfitting controllers in deep networks, an example is given by Bejani and Ghatee (2020b) on GPS dataset where the differences of the training accuracy and testing accuracy for different regularization schemes including ASR, early stopping, dropout, dropconnect, shakeout, spectral dropout, bridgeout, noise injection, and weight decay are 11.2, 13.1, \(-2.6\), 18.1, 4.7, 0.4, \(-0.5\), 14.9 and 17.1, respectively. This difference for the learning model without regularization is 15.6. This example proves the necessity of defining an effective overfitting controller for any learning task. On the other hand, using some general measures such as complexity, efficiency, convergence speed, and scalability in different networks, the effectiveness of each overfitting controller can be evaluated. Tables 10, 11 compare these measures. Generally, we conclude the following points.

Lesson 1: Overfitting controllers usually decrease the convergence speed of the training algorithms. For online learning, the methods proposed in Clevert et al. (2016), Ba et al. (2016), Ioffe and Szegedy (2015), Bejani and Ghatee (2020b), Bejani and Ghatee (2020a) are useful. The low-speed methods can only be used for off-line applications on very complex datasets. For better understanding, the processing times of different learning models with and without overfitting control can be compared. Unfortunately, many references have not reported such a comparison. To quantitatively measure the processing time, the number of FLoating-point OPeration (FLOP) can be considered. A lower FLOP indicates the superiority of the model and low computation complexity. Generally speaking, the overfitting controllers do not cause a heavy computational overhead. Meanwhile, in some cases, such as pruning methods (Han et al. 2015; Vu et al. 2019), the computational burden can be reduced, and definitely, the accuracy can slightly diminish. Table 12 reports more information on FLOPs.

Table 12 Comparison between FLOP of the different schemes with and without overfitting control on different datasets
Table 13 Comparison between percent of errors of the different schemes with and without overfitting control on famous image datasets

Lesson 2: To study the effect of overfitting controllers on the accuracy, their results on the famous standard image datasets are analyzed. Table 13 presents the top-1 error on CIFAR-10, CIFAR-100, SVHN and MNIST, together with the top-5 error on ImageNet. Although not all of the references have evaluated the same analysis, the reported results are sufficiently insightful. By comparing these results with baseline networks without overfitting control, auto-augmentation provides results with minimum errors. Also, when augmentation and other kinds of overfitting controllers are used together, the generalization power grows significantly (Bejani and Ghatee 2020a). To reach a fair comparison, the Average of Relative Improvement (ARI) is defined as follows:

$$\begin{aligned} ARI=\frac{\sum _{d\in DataSets}max\{(Error_{WC}-Error_{OC}),0\}/Error_{WC}}{|DataSets|} \end{aligned}$$
(20)

where |DataSets| shows the number of datasets, \(Error_{WC}\) and \(Error_{OC}\) indicate the errors of the models without and with the overfitting control. Based on the last column of Table 13, the overfitting controllers can be ranked. As can be seen, dropout (Srivastava et al. 2014), adaptive dropout (Bejani and Ghatee 2020a), adaptive weight decay (Bejani and Ghatee 2020a), auto-augmentation (Cubuk et al. 2019) and enhancing diversity of features (Ayinde et al. 2019) have shown the best results.

Lesson 3: In addition to the accuracy, overfitting controllers can be compared by other machine learning measures, including precision, recall, and f-measure (Powers 2011). Such comparisons on regularized deep networks are limited. Sometimes, this gap leads to misunderstandings. Table 14 outlines the comparisons found. Some references, including figures of these measures (Li et al. 2016), as well as the references without the baseline results (Shekar and Dagnew 2020), have been excluded from this table. Note that since the datasets are different, the results are not comparable. However, the methods with great precision and recall are the best candidates to follow in general learning tasks.

Table 14 Comparison between different regularization schemes on deep models based on precision, recall, F-measure

Lesson 4: There is no priority among active and passive schemes to improve training performance. Nevertheless, active methods are almost less time-consuming than passive methods. Sometimes, an active method provides better accuracy and computational time. For example, dropout (Srivastava et al. 2014) does not add more computational burden, while its performance is better than meta-heuristic algorithms (Suganuma et al. 2017).

Lesson 5: When the computational capacity and the training time are not limited, passive methods can be used. However, the resulting model may overfit on the validation data. Random sampling can help solve this problem.

Lesson 6: When the training time is limited, semi-active methods are suggested for shallow networks.

Lesson 7: In cases with limited computational capacity or limited training time, the active methods are the best options. Meanwhile, their hyper-parameters can be adjusted by an expert system; see (Abpeikar et al. 2020) for an initial plan.

Lesson 8: Sometimes, regularization methods decrease the generalization power, as they make mistake in choosing the model with the right complexity due to the data complexity. In these cases, regularization is destructive. For example, in Bejani and Ghatee (2020b), there is an experiment on MobileNet without regularization where the training accuracy is 100%, and the testing accuracy is 84%. When the regularization methods, including early stopping, dropout, dropconnect, shakeout, spectral dropout, and noise injection, have been used, the testing accuracies decrease to 65.1%, 39.2%, 80.1%, 79.4%, and 75.3%. In these situations, the tuning of the regularization parameters with respect to the data complexity is essentially required.

Lesson 9: The training procedure is related to a nonlinear optimization problem that can be transformed into a system of linear or nonlinear equations. When these systems are ill-posed (Tikhonov and Arsenin 1977), the solution is unstable, and the corresponding learning model becomes overfitted (Bejani and Ghatee 2020b). Using condition number, it is possible to detect an ill-posed component of the system. Through causality analysis, once a component is discovered as the reason for the ill-posed problem, only this component can be corrected to improve the system performance. Possibly, this type of causality analysis can be furthered to tune-up overfitting controllers of deep models in future works.

8 Conclusion

Learning models deal with unseen data prediction based on training on finite seen data. Sometimes the seen data are not adequate to give a fair inference. When a learning model fits these data, the model cannot learn some real data details. Although the model learns seen data entirely, it cannot generalize on unseen data. It is entitled to overfitting. There are three branches of schemes to control overfitting; passive, active, and semi-active methods. They affect the different parts of the learning model, training algorithm, or training data. In this paper, we classified the most important schemes to control overfitting in these branches. We compare their concepts, criteria, advantages, and disadvantages for each of them and extract their lessons. We also stated some future works for each branch. This systematic review helps find a roadmap for the next research and select a reasonable overfitting controller for different shallow and deep neural network models.