1 Introduction

Deep Neural Networks use multiple layers to extract higher-level features from the raw input progressively. The ability to automatically learn features at multiple levels of abstractions makes them a powerful machine learning system that can learn complex relationships between input and output. Seminal work by Zhang et al. [30] investigates the expressive power of neural networks on finite sample sizes. They show that even when trained on completely random labeling of the true data, neural networks achieve zero training error, increasing training time and effort by only a constant factor. Such potential of brute force memorization makes it challenging to explain the generalization ability of deep neural networks. They further illustrate that the phenomena of neural network fitting on random labeling of training data is largely unaffected by explicit regularization (such as weight decay, dropout, and data augmentation). They suggest that explicit regularization may improve generalization performance but is neither necessary nor by itself sufficient for controlling generalization error. Moreover, recent works show that generalization (and test) error in neural networks reduces as we increase the number of parameters [22, 23], which contradicts the traditional wisdom that overparameterization leads to overfitting. These observations have given rise to a branch of research that focuses on explaining the neural network’s generalization error rather than just looking at their test performance [24].

Fig. 1.
figure 1

Training of AllConv and IOC-AllConv on CIFAR-10 dataset. (a) Loss curve while training with true labels. AllConv starts overfitting after few epochs. IOC-AllConv does not exhibit overfitting, and the test loss nicely follows the training loss. (b) Accuracy plots while training with randomized labels (labels were randomized for all the training images). If sufficiently trained, even a simple network like MLP achieves 100% training accuracy and gives around 10% test accuracy. IOC-MLP resists any learning on the randomized data and gives 0% generalization gap. (c) and (d) Loss and accuracy plots on CIFAR-10 data when trained with 50% labels randomized in the training set.

We propose a principled and reliable alternative that tries to affirmatively resolve the concerns raised in [30]. More specifically, we investigate a novel constrained family of neural networks called Input Output Convex Neural Networks (IOC-NNs), which learn a convex function between input and output. Convexity in machine learning typically refers to convexity in terms of the parameters w.r.t to the loss [3], which is not the case in our work. We use an IOC prefix to indicate the Input Output Convexity explicitly. Amos et al. [1] have previously explored the idea of Input Output convexity; however, their experiments limit to Partially Input Convex Neural Networks (PICNNs), where the output is convex w.r.t some of the inputs. They deem fully convex networks unnecessary in their studied setting of structured prediction, highly restricted on the allowable class of models, highly limited, even failing to do simple identity mapping without additional skip (pass-through) connections. Hence, they do not present even a single experiment on fully convex networks.

We wake this sleeping giant up and thoroughly investigate fully convex networks (outputs are convex w.r.t to all the inputs) on the task of multi-class classification. Each class in multi-class classification is represented as a convex function, and the resulting decision boundaries are formed as an \(\texttt {argmax}\) of convex functions. Being able to train IOC with NN-like capacity, we, for the first time, discover the beautiful underlying properties, especially in terms of generalization abilities and robustness to label noise. We investigate IOC-NNs on six commonly used image classification benchmarks and pose them as a preferred alternative over the non-convex architectures. Our experiments suggest that IOC-NNs avoid fitting over the noisy part of the data, in contrast to the typical neural network behavior. Previous work shows that [2] neural networks tend to learn simpler hypotheses first. Our experiments show that IOC-NNs tend to hold on to the simpler hypothesis even in the presence of noise, without overfitting in most settings.

A motivating example is illustrated in Fig. 1, where we train an All Convolutional network (AllConv) [28] and its convex counterpart IOC-AllConv on the CIFAR-10 dataset. AllConv starts overfitting the train data after a few epochs (Fig. 1(a)). In contrast, IOC-AllConv shows no signs of overfitting and flattens at the end (the test loss values pleasantly follow the training curve). Such an observation is consistent across all our experiments on IOC-NNs across different datasets and architectures, suggesting that IOC-NNs have lesser reliance on explicit regularization like early stopping. Fig. 1(b) presents the accuracy plots for the randomized test where we train Multi-Layer Perceptron (MLP) and IOC-MLP on a copy of the data where the true labels were replaced by random labels. MLP achieves 100% accuracy on the train set and gives a random chance performance on the test set (observations are coherent with [30]). IOC-MLP resists any learning and gives random chance performance (10% accuracy) on both train and test sets. As MLP achieves zero training error, the test error is the same as generalization error, i.e., 90% (the performance of random guessing on CIFAR10). In contrast, the IOC-MLP has a near 0% generalization error. We further present experiment with 50% noisy labels Fig. 1(c). The neural network training profile concurs with the observation of Krueger et al. [17], where the network learns a simpler hypothesis first and then starts memorizing. On the other hand, IOC-NN converges to the simpler hypothesis, showing strong resistance to fit the noise labels.

Input Output Convexity shows a promising paradigm, as any feed-forward network can be re-worked into its convex counterpart by choosing a non-decreasing (and convex) activation function and restricting its weights to be non-negative (for all but the first layer). Our experiments suggest that activation functions that allow negative outputs (like leaky ReLU or ELU) are more suited for the task as they help retain negative values flowing to subsequent layers in the network. We show that IOC-MLPs outperforms traditional MLPs in terms of test accuracy on five of the six studied datasets and IOC-NNs almost recover the performance of the base network in case of convolutional networks. In almost all studied scenarios, IOC networks achieve multi-fold improvements in terms of generalization error over unconstrained Neural Networks. Overall, our work makes the following contributions:

  • We bring to light the little known idea of Input Output Convexity in neural networks. We propose a revised formulation to efficiently train IOC-NNs, retaining adequate capacity (with changes like using ELU, increasing nodes in the first layer, whitening transform at the input, etc.). To the best of our knowledge, we for the first time explore a usable form of IOC-NNs, and shows that they can be trained with NN like capacity.

  • Through a set of intuitive experiments, we detail its internal functioning, especially in terms of its self regularization properties and decision boundaries. We show that how sufficiently complex decision boundaries can be learned using an \(\texttt {argmax}\) over a set of convex functions (where each class is represented by a single convex function). We further propose a framework to learn the ensemble of IOC-NNs.

  • With a comprehensive set of quantitative and qualitative experiments, we demonstrate IOC-NN’s outstanding generalization abilities. IOC-MLPs achieve near zero generalization error in all the studied datasets and a negative generalization error (test accuracy is higher than train accuracy) in a couple of them, even at convergence. Such never seen behaviour opens up a promising avenue for more future explorations.

  • We explore the robustness of IOC-NNs to label noise and find that it strongly resists fitting the random labels. Even while training, IOC-NNs show no signs of fitting on noisy data and efficiently learns patterns from non noisy data. Our findings ignites explorations towards tighter generalization bounds for neural networks.

2 Related Work

Simple Convex Models: Our work relates to parameter estimation on models that are guaranteed to be convex by its construction. For regression problems, Magnani and Boyd [19] study the problem of fitting a convex piecewise linear function to a given set of data points. For classification problems, this traditionally translates to polyhedral classifiers. A polyhedral classifier can be described as an intersection of a finite number of hyperplanes. There have been several attempts to address the problem of learning polyhedral classifiers [15, 20]. However, these algorithms require the number of hyperplanes as an input, which is a major constraint. Furthermore, these classifiers do not give completely smooth boundaries (at the intersection of hyperplanes). As another major limitation, these classifiers cannot model the boundaries in which each class is distributed over the union of non-intersecting convex regions (e.g., XOR problem). The proposed IOC-NN (even with a single hidden layer) supersedes this direction of work.

Convex Neural Networks: Amos et al. [1] mentions the possibility of fully convex networks, however, does not present any experiments with it. The focus of their work is to achieve structured predictions using partially convex network (using convexity w.r.t to some of the inputs). They propose a specific architecture called FICNN which is fully convex and has fully connected layers with skip connections. The skip connections are a must because their architecture cannot even achieve identity mapping without them. In contrast, our work can take any given architecture and derive its convex counterpart (we use the IOC suffix to suggest model agnostic nature of our work). The work by Kent et al. [16] analyze the links between polynomial functions and input convex neural networks to understand the trade-offs between model expressiveness and ease of optimization. Chen et al. [7, 8] explore the use of input convex neural network in a variety of control applications like voltage regulation. The literature on input convex neural networks has been limited to niche tailored scenarios. Two key highlights of our work are: (a) to use activations that allow the flow of negative values (like ELU, leaky ReLU, etc.), which enables a richer representation (retaining fundamental properties like identity mapping which are not achievable using ReLU) and (b) to bring a more in-depth perspective on the functioning of convex networks and the resulting decision boundaries. Consequently, we present IOC-NNs as a preferred option over the base architectures, especially in terms of generalization abilities, using experiments on mainstream image classification benchmarks.

Generalization in Deep Neural Nets: Conventional machine learning wisdom says that overparameterization leads to poor generalization performance owing to overfitting. Counter-intuitively, empirical evidence shows that neural networks give better generalization with an increased number of parameters even without any explicit regularization [25]. Explaining how neural networks generalize despite being overparameterized is an important question in deep learning [22, 25].

Neyshabur et al. [23] study different complexity measures and capacity bounds based on the number of parameters, VC dimension, Rademacher complexity etc., and conclude that these bounds fail to explain the generalization behavior of neural networks on overparameterization. Neyshabur et al. [24] suggest that restricting the hypothesis class gives a generalization bound that decreases with an increase in the number of parameters. Their experiments show that restricting the spectral norm of the hidden layer leads to tighter generalization bounds.

The above discussion implies that a hypothetical neural network that can fit any hypothesis will have a worse generalization than the practical neural networks which span a restricted hypothesis class. Inspired by this idea, we propose a principled way of restricting the hypothesis class of neural networks (by convexity constraints) that improves their generalization ability in practice. In the previous efforts to train fully input output convex networks, they were shown to have a limited capacity compared to its neural network counterpart [1, 3], making their generalization capabilities ineffective in practice. To our knowledge, we for the first time present a method to formulate and efficiently train IOC-NNs opening an avenue to explore their generalization ability.

3 Input Output Convex Networks

We first consider the case of an MLP with k hidden layers. The output of \(i^{th}\) neuron in the \(l^{th}\) hidden layer will be denoted as \(h_{i}^{(l)}\). For an input \(\mathbf {x}=(x_1,\ldots ,x_d)\), \(h_{i}^{(l)}\) is defined as:

$$\begin{aligned} h_{i}^{(l)} = \phi ( \sum _{j} w_{ij}^{(l)} h_j^{l-1} + b_{i}^{(l)} ), \end{aligned}$$
(1)

where, \(h_{j}^{(0)} = x_j\) (\(j=1\ldots d\)) and \(h_{j}^{(k+1)} = y_j\) (\(j^{th}\) output). The first hidden layer represents an affine mapping of input and preserves the convexity (i.e. each neuron in \(h^{(1)}\) is convex function of input). The subsequent layers are a weighted sum of neurons from the previous layer followed by an activation function. The final output \(\mathbf {y}\) is convex with respect to the input \(\mathbf {x}\) by ensuring two conditions: (a) \(w_{ij}^{(2:k+1)} \ge 0 \) and (b) \(\phi \) is convex and a non-decreasing function. The proof follows from the operator properties [5] that the non-negative sum of convex functions is convex and the composition f(g(x)) is convex if g is convex and f is convex and non-decreasing.

A similar intuition follows for convolutional architectures as well, where each neuron in the next layer is a weighted sum of the previous layer. Convexity can be assured by restricting filter weights to be non-negative and using a convex and non-decreasing activation function. Filter weights in the first convolutional layer can take negative values, as they only represent an affine mapping of the input. The maxpool operation also preserves convexity since point-wise maximum of convex functions is convex [5]. Also, the skip connection does not violate Input Output Convexity, since the input to each layer is still a non-negative weighted sum of convex functions.

We use an ELU activation to allow negative values; this is a minor but a key change from the previous efforts that rely on ReLU activation. For instance, with non-negativity constraints on weights (\( w_{ij}^{(2:k+1)} \ge 0 \)), ReLU activations restrict the allowable use of hidden units that mirror the identity mapping. Previous works rely on passthrough/skip connections to address [1] this concern. The use of ELU enables identity mapping and allows us to use the convex counterparts of existing networks without any architectural changes.

3.1 Convexity as Self Regularizer

We define self regularization as the property in which the network itself has some functional constraints. Inducing convexity can be viewed as a self regularization technique. For example, consider a quadratic classifier in \(\mathbb {R}^2\) of the form \(f(x_1,x_2)=w_1x_1^2+w_2x_2^2+w_3x_1x_2+w_4x_1+w_5x_2+w_0\). If we want the function f to be convex, then it is required that the network imposes following constraints on the parameters, \(w_1\ge 0,\;w_2\ge 0,\;-2\sqrt{w_1w_2}\le w_3\le 2\sqrt{w_1w_2}\), which essentially means that we are restricting the hypothesis space.

Fig. 2.
figure 2

Decision boundaries of different networks trained for two class classification. (a) Original data: one class shown by blue and the other orange. (b) Decision boundary learnt using MLP. (c) Decision boundary learnt using IOC-MLP with single node in the output layer. (d) Decision boundary learnt using IOC-MLP with two nodes in the output layer (ground truth as one hot vectors) (Color figure online)

Fig. 3.
figure 3

(a) Using two simple 1-D functions we illustrate that \(\texttt {argmax}\) of two convex functions can result into non-convex decision boundaries. (b) Two convex functions whose \(\texttt {argmax}\) results into the decision boundaries shown in Fig. 2(d). The same plot is shown from two different viewpoints.

Similar inferences can be drawn by taking the example of polyhedral classifiers. Polyhedral classifiers are a special class of Mixture of Experts (MoE) network [13, 26]. VC-dimension of a polyhedral classifier in d-dimension formed by the intersection of m hyperplanes is upper bounded by \(2(d+1)m\log (3m)\) [29]. On the other hand, VC-dimension of a standard mixture of m binary experts in d-dimension is \(O(m^4d^2)\) [14]. Thus, by imposing convexity, the VC-dimension becomes linear with the data dimension d and \(m\log (m)\) with the number of experts. This is a huge reduction in the overall representation capacity compared to the standard mixture of binary experts.

Furthermore, adding non-negativity constraints alone can lead to regularization. For example, the VC dimension of a sign constrained linear classifier in \(\mathbb {R}^d\) reduces from \(d+1\) to d [6, 18]. The proposed IOC-NN uses a combination of sign constraints and restrictions on the family of activation functions for inducing convexity. The representation capacity of the resulting network reduces, and therefore, regularization comes into effect. This effectively helps in improving generalization and controlling overfitting, as clearly observed in our empirical studies (Sect. 4.1).

3.2 IOC-NN Decision Boundaries

Consider a scenario of binary classification in 2D space as presented in Fig. 2(a). We train a three-layer MLP with a single output and a sigmoid activation for the last layer. The network comfortably learns to separate the two classes. The learned boundaries by the MLP are shown in Fig. 2(b). We then train an IOC-MLP with the same architecture. The learned boundary is shown in Fig. 2(c). IOC-MLP learns a single convex function as output w.r.t the input and its contour at the value of 0.5 define the decision boundary. The use of non-convex activation like sigmoid in the last layer does not distort convexity of decision boundary (Appendix A).

Fig. 4.
figure 4

(a) Original Data. (b) Output of the gating network, each color represents picking a particular expert. (c) Decision boundaries of the individual IOC-MLPs. We mark the correspondences between each expert and the segment for which it was selected. Notice how the V-shape is partitioned and classified using two different IOC-MLPs. (Color figure online)

We further explore IOC-MLP with a variant architecture where the ground truth is presented as a one-hot vector (allowing two outputs). The network learns two convex functions f and g representing each class, and their \(\texttt {argmax}\) defines the decision boundary. Thus, if \(g(\mathbf {x})-f(\mathbf {x})>0\), then \(\mathbf {x}\) is assigned to class C1 and C2 otherwise. Therefore, it can learn non-convex decision boundaries as shown in Fig. 3. Please note that \(g-f\) is no more convex unless \(g''-f''\ge 0\). In the considered problem of binary classification in Fig. 2, using one-hot output allows the network to learn non-convex boundaries (Fig. 2 (d)). The corresponding two output functions (one for each class) are illustrated in Fig. 3 (b). We can observe that both the individual functions are convex; however, their arrangement is such that the \(\texttt {argmax}\) leads to a reasonably complex decision boundary.This happens due to the fact that the sets \(S_1=\{\mathbf {x}\;|\;g(\mathbf {x})-f(\mathbf {x})>0\}\) and \(S_2=\{\mathbf {x}\;|\;g(\mathbf {x})-f(\mathbf {x})\le 0\}\) can both be non-convex (even though functions f(.) and g(.) are convex).

3.3 Ensemble of IOC-NN

We further explore the ensemble of IOC-NN for multi-class classification. We explore two different ways to learn the ensembles:

  1. 1.

    Mixture of IOC-NN Experts: Training a mixture of IOC-NNs and an additional gating network [13]. The gating network can be non-convex and outputs a scalar weight for each expert. The gating network and the multiple IOC-NNs (experts) are trained in an Expectation-Maximization (EM) framework, i.e., training the gating network and the experts iteratively.

  2. 2.

    Boosting + Gating: In this setup, each IOC-NN is trained individually. The first model is trained on the whole data, and the consecutive models are trained with exaggerated data on the samples on which the previous model performs poorly. For bootstrapping, we use a simple re-weighting mechanism as in [10]. A gating network is then trained over the ensemble of IOC-NNs. The weights of the individual networks are frozen while training the gating network.

We detail the idea of ensembles using a representative experiment for binary classification on the data presented in Fig. 4(a). We train a mixture of \(\mathbf {p}\) IOC-MLPs with a gating network using the EM algorithm. The gating network is an MLP with a single hidden layer, the output of which is a \(\mathbf {p}\) dimensional vector. Each of the IOC-MLP is a three-layer MLP with a single output. We keep a single output to ensure that each IOC-MLP learns a convex decision boundary. The output of the gating network is illustrated in Fig. 4(b). A particular IOC-MLP was selected for each partition and led to five partitions. The decision boundaries of individual IOC-MLPs are shown in Fig. 4(c). It is interesting to note that the MoE of binary IOC-MLPs fractures the input space into sub-spaces where a convex boundary is sufficient for classification.

4 Experiments

Dataset and Architectures: To show the significance of enhanced performance of IOC-MLP over traditional NN, we train them on six different datasets: MNIST, FMNIST, STL-10, SVHN, CIFAR-10, and CIFAR-100. We use an MLP with three hidden layers and 800 nodes in each layer. We use batch normalization between every layer, and it’s activation in all hidden layers. ReLU and ELU are used as activations for NN and IOC respectively, and softmax is used in the last layer. We use Adam optimizer with an initial learning rate of 0.0001 and use validation accuracy for early stopping.

We perform experiments that involve two additional architectures to extend the comparative study between IOC and NN on CIFAR-10 and CIFAR-100 datasets. We use a fully convolutional [28], and a densely connected architecture [12]. We choose DenseNet with growth rate k = 12, for our experiments. We term the convex counterparts as IOC-AllConv, IOC-DenseNet, respectively, and compare against their base neural network counterparts [12, 28]. In all comparative studies, we follow the same training and augmentation strategy to train IOC-NNs, as used by the aforementioned neural networks.

Training on Duplicate Free Data: The test sets of CIFAR-10 and CIFAR-100 datasets have 3.25% and 10% duplicate images, respectively [4]. Neural networks show higher performance on these datasets due to the bias created by this duplicate data (neural networks have been shown to memorize the data). CIFAIR-10 and CIFAIR-100 datasets are variants of CIFAR-10 and CIFAR-100 respectively, where all the duplicate images in the test data are replaced with new images. Barz et al. [4] observed that the performance of most neural architectures drops when trained and tested on bias-free CIFAIR data. We train IOC-NN and their neural network counterparts on CIFAIR-10 data with three different architectures: a fully connected network (MLP), a fully convolutional network (AllConv) [28] and a densely connected network (DenseNet) [12].

Training IOC Architectures: We tried four variations for weight constraints to enforce convexity constraints: clipping negative weights to zero, taking absolute of weights, exponentiation of negative weights and shifting the weights after each iteration. We use exponentiation strategy in all experiments, as it gave the best results. We exponentiate the negative weights after every update. The IOC constrained optimization algorithm differs only by a single step from the traditional algorithms (Appendix B).

To conserve convexity in the batch-normalization layer, we also constrain the gamma scaler with exponentiation. However, in practice we found that the IOC networks retains all desirable properties without constraining the gamma scalar. We make few additional modifications to facilitate the training of IOC-NNs. Such changes do not affect the performance of the base neural networks. We use ELU as an activation function instead of ReLU in IOC-NNs. We apply the whitening transformation to the input so that it is zero-centered, decorrelated, and spans over positive and negative values equally. We also increase the number of nodes in the first layer (the only layer where parameters can take negative values). We use a slower schedule for learning rate decay than the base counterparts. The IOC-NNs have a softmax layer at the last layer and are trained with cross-entropy loss (same as neural networks).

Training Ensembles of Binary Experts: We divide CIFAR-10 dataset into 2 classes, namely: ‘Animal’ (CIFAR-10 labels: ‘Bird’, ‘Cat’, ‘Deer’, ‘Dog’, ‘Frog’ and ‘Horse’) and ‘Not Animal’. We train an ensemble of IOC-MLP, where each expert is a three-layer MLP with one output (with sigmoid activation at the output node). The gating network in the EM approach is a one layer MLP which takes an image as input and predicts the weights by which the individual expert predictions get averaged. We report test results of ensembles with each additional expert. This experiment resembles the study shown in Fig. 4.

Training Boosted Ensembles: The lower training accuracy of IOC-NNs makes them suitable for boosting (while the training accuracy saturates in non-convex counterparts). For bootstrapping, we use a simple re-weighting mechanism as in [10]. We train three experts for each experiment. The gating network is a regular neural network, which is a shallow version of the actual experts. We train an MLP with only one hidden layer, a four-layer fully convolutional network, and a DenseNet with two dense-blocks as the gate for the three respective architectures. We report the accuracy of the ensemble trained in this fashion as well as the accuracy if we would have used an oracle instead of the gating network.

Table 1. Table shows train accuracy, test accuracy and generalization gap for MLP and IOC-MLP on six different datasets.
Table 2. Train accuracy, test accuracy and generalization gap of three neural architectures and their IOC counterparts

Partially Randomized Labeling: Here, we investigate IOC-NN’s behavior in the presence of partial label noise. We do a comparative study between IOC and neural networks using All-Conv architecture, similar to the experiment performed by [30]. We use CIFAR-10 dataset and make them noisy by systematically randomizing the labels of a selected percentage of training data. We report the performance of All-Conv, and it’s IOC counterpart on 20, 40, 60, 80 and 100% noise in the train data. We report train and test scores at peak performance (performance if we had used early stopping) and at convergence (if loss goes below 0.001 or at 2000 epochs).

4.1 Results

IOC as a Preferred Alternative for Multi-Layer-Perceptrons: MLP is most basic and earliest explored form of neural networks. We compare the train and test scores of MLP and IOC-MLP in Table 1. With a sufficient number of parameters, MLP (a basic NN architecture) perfectly fits the training data. However, it fails to generalize well on the test data owing to brute force memorization. The results in Table 1 indicate that IOC-MLP gives a smaller generalization gap (the difference between train and test accuracies) compared to MLP. The generalization gap even goes to negative values on three of the datasets. MLP (being poorly optimized for parameter utilization) is one of the architectures prone to overfitting the most, and IOC constraints help retain test performance resisting the tendency to overfit. Obtaining negative or almost zero generalization error even at convergence is a never seen behaviour in deep networks and the results clearly suggest the profound generalization abilities of Input Output Convexity, especially when applied to fully connected networks.

Furthermore, having the IOC constraints significantly boosts the test accuracy on datasets where neural network gives a high generalization gap (Table 1). This trend is clearly visible in Fig. 5(b). For the CIFAR-10 dataset, unconstrained MLP gives 34.16% generalization gap, while IOC-NN brings down the generalization gap by more than ten folds and boosts the test performance by about 6%. Even in scenarios where neural networks give a smaller generalization gap (like MNIST and SVHN), IOC-NN marginally outperforms regular NN and gives an advantage in generalization. Overall, the results in Table 1 highlight that IOC constraints are extremely beneficial when training Multi Layer Perceptrons for image classification, giving comprehensive advantages in terms of generalization and test performance.

Better Generalization: We investigate the generalization capability of IOC-NN on other architectures. The results of the base architectures and their convex counterparts on CIFAR-10 and CIFAR-100 datasets are presented in Table 2. IOC-NN outperforms base NN on MLP architecture and gives comparable test accuracies for convolutional architectures. The train accuracies are saturated in the base networks (reaching above 99% in most experiments). The lower train accuracy in IOC-NNs suggests that there might still be room for improvement, possibly through better design choices tailored for IOC-NNs. In Table 2, the difference in train and test accuracy across all the architectures (generalization gap) demonstrates the better generalization ability of IOC-NNs. The generalization gap of base architectures is at least twofold more than IOC-NNs on the CIFAR-100 dataset. For instance, the generalization error of IOC-AllConv on CIFAR-100 is only 1.99%, in contrast to 28.4% in AllConv. The generalization ability of IOC-NNs is further qualitatively reflected using the training and validation loss profiles (e.g., Fig. 1(a)). We present a table showing the confidence intervals of prediction across all three architectures with repeated runs in Appendix C.

Table 3. Results for systematically randomized labels at peak and at convergence for both IOC-NN and NN. The IOC constraints bring huge improvements in generalization error and test accuracy at convergence.
Table 4. Results comparing FICNN [1] with IOC-NN on CIFAR-10 using MLP architecture. First column shows base MLP results. Second column presents results with a convex MLP using ReLU activation. Third and final columns show the accuracies of FICNN and IOC-NN, respectively.

Table 5 shows the train and test performance of the three architectures on CIFAR-10 dataset and the drop incurred when trained on CIFAIR-10. The drop in test performance of IOC-NNs is smaller than the typical neural network. This further strengthens the claim that IOC-NNs are not memorizing the training data but learning a generic hypothesis.

Comparison with FICNN: Table 4 shows the results of IOC-NN and FICNN [1] on CIFAR-10 data. For comparison, we use a three layer MLP with 800 nodes in each layer, for both IOC-NN and FICNN. FICNN uses a skip connection from input layer to each of the intermittent layers. This enables each layer to learn identity mapping inspite of non-negative constraint. The number of parameters in FICNN model is almost twice compared to the base MLP and IOC models but still the test performance drops by more than 10%. The results clearly shows that IOC-NN gives better test accuracy and lower generalization gap compared to FICNN, while using the same number of parameters as the base MLP architecture.

Robustness to Random Label Noise: Robustness of IOC-NNs on partial and fully randomized labels (Fig. 1 (b, c, and d)) is one of its key properties. We further investigate this property by systematically randomizing increasing portion of labels. We report the results of neural networks and their convex counterparts with percentage of label noise varying from 20% to 100% in Table 3. The train performance of neural networks at convergence is near 100% across all noise levels. It is interesting to note that IOC-NN gives a large negative generalization gap, where the train accuracy is almost equal to the percentage of true labels in the data. This observation shows that IOC-NNs significantly resist learning noise in labels as compared to neural networks. Both neural network and it’s convex counterpart learns the simple hypothesis first. While IOC-NN holds on to this, in later epochs, the neural network starts brute force memorization of noisy labels. The observations are coherent with findings in [17, 27], demonstrating neural network’s heavy reliance on early stopping. IOC-AllConv outperforms test accuracy of AllConv + early stopping with a much-coveted generalization behavior. It is clear from this experiment that IOC-NN performs better in the presence of random label noise in the data in terms of test accuracy both at peak and convergence.

Table 5. Results on CIFAIR-10 dataset
Fig. 5.
figure 5

(a) shows the test accuracy of IOC-MLP with increasing number of experts in the binary classification setting. Average performance of normal MLP is shown in red since it does not change with increase in number of experts. (b) The generalization gap of MLP plotted against the improvement gained by the IOC-MLP for the six different datasets (represented by every point on the plot). The performance gain with IOC constraints increase with the increase in generalization gap of MLP.

Leverage IOC Properties to Train Ensembles: We train binary MoE on the modified two-class setting of CIFAR-10 as described in Sect. 4. The result is shown in Fig. 5 (a). Traditional neural network gives a test accuracy of 89.63% with a generalization gap of 10%. Gated MoE of NNs does not improve the test performance as we increase the number of experts. In contrast, the performance of ensemble of IOC-NNs goes up with the addition of each expert and moves closer to the performance of neural networks. It is interesting to note that even in the higher dimensional space (like CIFAR-10 images), the intuitions derived from Fig. 4 holds. We also note that gate fractures the space into p partitions (where p is the number of experts). Moreover, in the binary case for a single expert, the generalization gap is almost zero. This can be attributed to the convex hull like smooth decision boundary that the network predicts in the binary setting with a single output.

The results with the boosted ensembles of IOC-NNs are presented in Table 6. The boosted ensemble improves the test accuracies of IOC-NNs, matching or outperforming the base architectures. However, this performance gain comes at the cost of increased generalization error (still lower than the base architectures). In the boosted ensemble, the performance significantly improves if the gating network is replaced by an oracle. This observation suggests that there is a scope of improvement in model selection ability, possibly by using a better gating architecture.

Table 6. Result for single expert, gated MoE and with oracle on CIFAR-10 for three architectures
Fig. 6.
figure 6

These diagrams show expected sample accuracy as a function of confidence [9]. The blue bar shows the confidence of the bin and the orange bar shows the percentage correctness of prediction in that bin. If the model is perfectly calibrated, the bars align to form identity function. Any deviation from a perfect diagonal is a miscalibration. (Color figure online)

Confidence Calibration of IOC-NNs: In a classification setting, given an input, the neural network predicts probability-like scores towards each class. The class with the maximum score is considered the predicted output, and the corresponding score to be the confidence. The confidence and accuracy being correlated is a desirable property, especially in high-risk applications like self-driving cars, medical diagnoses, etc. However, many modern multi-class classification networks are poorly calibrated, i.e., the probability values that they associate with the class labels they predict overestimate the likelihoods of those class labels being correct in the real world [11]. Recent works have explored methods to improve the calibration of neural networks [11, 21].

We observe that adding IOC constraints improve calibration error on the base NN architecture. We present the reliability diagrams [9] (presenting accuracy as a function of confidence) of three neural architectures and their convex counterparts in Fig. 6. The sum of the difference between the blue bars and the orange bars represents the Expected Calibration Error. IOC constraints show improved calibration in all three architectures (with notable improvements in the case of MLP and AllConv). Better calibration further strengthens the case for IOC-NNs from the application perspective.

5 Conclusions

We present a subclass of neural networks, where the output is a convex function of the input. We show that with minimal constraints, existing neural networks can be adopted to this subclass called Input Output Convex Neural Networks. With a set of carefully chosen experiments, we unveil that IOC-NNs show outstanding generalization ability and robustness to label noise while retaining adequate capacity. We show that in scenarios where the neural network gives a large generalization gap, IOC-NN can give better test performance. An alternate interpretation of our work can be self regularization (regularization through functional constraints). IOC-NN puts to rest the concerns around brute force memorization of deep neural networks and opens a promising horizon for the community to explore. We show that in the case of Multi-Layer-Perceptrons, IOC constraints improve accuracy, generalization, calibration, and robustness to noise, making an ideal proposition from a deployment perspective. The improved generalization, calibration, and robustness to noise are also observed in convolutional architectures while retaining the accuracy. In future work, we plan to investigate the use of IOC-NNs for recurrent architectures. Furthermore, we plan to explore the interpretability aspects of IOC-NNs and study the effect of convexity constraints on the generalization bounds.