1 Introduction

Convolutional neural network (CNN) is state-of-the-art for a wide range of computer vision problems such as object recognition, image classification, and semantic segmentation [28]. Moreover, CNN has also been proven effective for natural language processing and forecasting problems [12, 18]. CNN is a subclass of representation learning techniques or algorithms. The output of each hidden layer in the network is considered a representation of the original data. CNN constantly changes the presentation of data by modifying its weights throughout its learning process. In general, a system is said to be modular if it can be divided into several independent subsystems, called modules. A modular convolutional neural network (MCNN) is a CNN that incorporates the concepts and practices of modular design. Although the modular design is achieved with an additional cost, it is still preferred over a monolithic system. Each module in a modular system is designed to solve an isolated sub-problem. This structure of loosely coupled modules coordinating with each other enhances the system’s fault tolerance. The modular design also facilitates parallelism and scaling up the system to add additional functionality with no interference to the existing functions. In addition, to facilitate scaling up the system, modularity also assists in the reallocation of the functional units to new tasks [41].

Major concerns in deep learning models are latency and resource consumption. In literature many techniques have proposed for deep model compression [6, 16, 17, 25]. Modhej et al. [31] proposed a novel approach based on the computational function of the dentate gyrus of the hippocampus for pattern separation. One of the prominent features of the proposed network is the employment of two excitation steps and two inhibition steps to activate or deactivate a node. The numerical simulation results indicate that the proposed network requires fewer training iterations to achieve comparable accuracy because weak nodes are silenced in different steps of the proposed network.

Practically, deep neural network-based solutions have a high computational cost and high power consumption, making it challenging for real-time applications. Studies have shown that a trained deep neural network (DNN) does not require all its weights to perform a task [20]. It has been proved that removing unwanted weights can reduce the computation cost of neural nets and improve their performance. The process of compressing neural networks by removing one or more of their core structural elements is called pruning. Different pruning algorithms imply other selection criteria that rank network elements intended to be pruned. The pruning can be done at runtime or offline.

Runtime pruning allows temporarily pruning the network for a single iteration and restoring it to its original state for the next iteration. In combination with proper domain decomposition, it can induce modularity in a CNN. Modular CNN has more resemblance with biological neural networks than monolithic CNN. Furthermore, the modular structure of CNN helps in understanding the overall working of the network and can learn additional tasks without damaging the already known information. This behavior encourages using a pre-trained network for learning even a heterogeneous task while retaining the old data. Our contributions in this paper are summarized as follow:

  • We induced modularity in a pre-trained CNN model by utilizing the information learned by the network during training.

  • Based on learned knowledge, we decompose the input domain.

  • The decomposed input domain is utilized to achieve modularity.

Based on the literature, we hypothesize that the learned knowledge of a network provides us with enough information that can be used for input domain decomposition. Furthermore, the domain decomposition can raise modularity in a neural network if supported by an information routing control mechanism.

The rest of the paper is organized as follows: the literature review is discussed in Section 2, in Section 3, we discuss the proposed methodology in detail, then in Section 4, we discuss experimental setup and results, and finally, in Section 5, we conclude our findings.

2 Literature review

Modularity in ANN has been a center of interest for researchers since the 1980s [43]. The next decade follows the trend, and several fundamental techniques for MNN have been introduced for various machine learning problems [15, 21, 33]. A modularization technique can contribute to one or more of the mentioned categories. Domain refers to the input domain of an MNN, which defines the problem addressed. Topology is the overall architecture of an MNN where formation corresponds to path selection for input inside an MNN. It is the process that helps to attain modularity in the network. However, integration algorithms integrate the output of different modules that contribute to the network’s final decision.

The domain consists of all the data that a neural network process and learn from to generalize unseen data. It defines the problem that needs to be addressed. Domain modularization is based on the rationale that a problem can be divided into pieces, each acting on a separate subdomain. Consequently, a module in MNN that is constructed by other techniques at different modularization levels can process one of the subspaces instead of the entire domain.

Dimensional domain partitioning is another type that occurs per data sample. In this type of partitioning, a data instance is decomposed based on its features. A different set of features are assigned to various modules of an MNN. For example, in [11] dimensional domain partitioning is used by applying additional filters and processing the output by other modules. Manual domain partitioning is done by partitioning the domain in multiple subspaces based on some analytical solution or expert knowledge. Each subspace in the domain corresponds to different subproblems. Therefore, manual domain partitioning methods for one problem do not necessarily work for another problem. Recently, the age recognition algorithms based on facial features have been proposed [4, 5]. For example, a survey on feature selection methods for face recognition problems in [8] concludes that feature wise manual decomposition of a face dataset is analytically not feasible.

Learned domain partitioning used learning algorithms for data partitioning. It clusters data based on complex representations usually invisible to experts due to complex mathematical structures or incomplete understanding of the problem. In literature, different learning algorithms are used for clustering input domains. For example, [48] use a three-factor selection method, [34] use fuzzy clustering and [46] utilize Divide and Conquer Learning (DCL) to cluster input domain.

The network’s topology is the connectivity between different nodes and modules that produce the overall structure of a network. Modularity in the topology of a network creates clusters of nodes called modules. Nodes inside a module are densely connected while sparsely connected with nodes of other modules. The intra-module and inter-module connectivity patterns define modularity in a modular structure. Structural modularity in deep learning also provides a solution for the problems in monolithic neural network architectures, such as overfitting and vanishing gradient problems, to name a few. In [45] authors show that inducing structural modularity in a neural network improves the generalization error of the network. Moreover, the work in [45] produces modularity in CNN by hierarchical clustering the feature vectors of the hidden layer and obtaining a short path for gradient flow in the backpropagation phase. Similarly, effects of modularity are also reported in [35] where the flow of information in the network is controlled by specialized units called gating units. The authors refer to their proposed network architecture as a highway network. Another widely used group of modularization techniques in neural networks is repeated block. Recurrent neural networks (RNN) [30] is a well-known architecture that uses the repeated blocks method. An RNN has several long short-term memory (LSTM) [26]. In sequential topology, the structural units of architecture consist of whole modules. The very famous Inception networks [37] and Xception networks [9] are multi-path topology-based networks composed of sequentially arranged convolution modules. The Highway networks described in [35] are also built around sequential composition principles.

Formation in modular neural networks refers to the process that generates modularity by path selection for input inside a modular network. In a given set of available modules in a network, the formation technique selects that will process the input. A formation can either be manual or automatic. In manual formation, human experts utilize heuristics and intuition to define modularity in a network. Automated techniques use evolutionary and machine learning algorithms for formation. Authors in [24] effectively use connection cost minimization as a formation technique to produce a modular network. It is concluded in the mentioned work that adding connection cost to HyperNEAT network and its variations [42] results in a significant improvement in its performance and its modular structure as well. Moreover, evolutionary algorithms are used to combine networks for knowledge transfer in [7]. Dropout [36], a widely used technique for regularization in neural networks, is an implicit formation technique that randomly drops nodes during the learning process. DropCircuit [32] is also a dropout technique used in parallel circuits, which is a multipath neural network.

Integration defines how the final output from different modules is calculated. It could be either cooperative or competitive. Only one out of all the modules’ output is selected to contribute to the final output in competitive integration. However, cooperative integration leads to the contribution of all the modules to the final output. To the best of our knowledge, integration techniques combine modules output by arithmetic logic or learning algorithms. [19] use competitive integration with a multi-network architecture. Similarly, in [44] the output of two CNN networks is integrated logically for character recognition where one CNN detects characters in an image, and the other recognizes the detected characters. Integration through learning algorithms provides an optimal combination of modules with the best overall performance. For example, [29] utilizes fuzzy logic to integrate different neural networks of the MNN trained for the image recognition task. Neural networks are evaluated by a fuzzy inference system and integrated the output by Sugeno integral. Likewise, a work proposed in [1] adds new modules to a pre-trained network to achieve transfer learning. A similar approach for integration is also adapted in [40] where the network is trained for multi-task learning by adding a new module.

Researchers have proposed various ways to accelerate inference with deep convolutional networks in the literature. These methods speed up convolutions without critical degradation of the accuracy of the models, which are Factorization and Decomposition of convolution’s kernels [39], Separable Convolutions [10], and Pruning [2]. Researchers are also proposing different techniques and algorithms for accelerating the CNNs, which include [27, 50].

Zhao et al. [50] presented a technique that is based on the Strassen algorithm and Winograd minimal algorithm for filtering. They showed a theoretically colossal reduction in computational complexity and also, in practice, proposed algorithms providing optimal performance. Still, the problem is that these are very expensive implementation-wise and cannot run on embedded devices and in real-time systems. Also, it increased parallelizing difficulty for an acceleration of hardware. New classes of fast techniques for CNN are introduced by [27] which used minimal Winograd filtering algorithms. The method used small tiles, which reduced computational complexity.

Spatial separable convolutions are also used later on in [23], MobileNets, followed by [49], which introduced highly computationally efficient CNN architecture known as ShuffleNet. It is specially designed for mobile devices and requires limited computing power of 10–15 MFLOPs. Zhang et al. [49] used new operations of channel shuffle and pointwise group convolutions. ShuffleNet achieved a speedup of 13× over AlexNet without affecting the accuracy. They lowered top-1 error to 7.8% absolute than [23] on ImageNet classification problem. Another slimmer model was introduced specially for mobile phones and embedded systems [14], known as EffNet. The EffNet outperformed MobileNet and ShuffleNet in accuracy and computational burden.

Pruning is another promising technique for highly efficient CNN keeping low complexity. Pruning in CNN is a technique that removes less significant neurons, feature maps, kernels, or possibly layers in large-sized networks. Random pruning can affect the accuracy of a network significantly. In pursuance of deep neural network acceleration, channel pruning was firstly proposed by [47]. A channel pruning method has been introduced by [22]. A two-step repetitive algorithm for pruning has been performed of some trained models of the CNN. This technique accelerated networks like Xception [10] and ResNet [38] up to 2× speedup but the model accuracy is reduced by 2%. Also, the training time of the method is drastically high, and they have used the off-the-shelf libraries for the networks.

3 Proposed methodology

In this paper, a novel technique to induce modularity in convolution neural networks for image classification problems. This work contributes to three categories, i.e., domain decomposition, formation, and integration. We decompose the input domain to a CNN into multiple groups or clusters using the information learned by the CNN. The topology of modular neural networks remains the same; however, we utilize runtime pruning for modular topology formation. The results of a module are integrated into other modules.

3.1 Clustering input domain

Clustering the input domain (classes) is one of the main contributions of this work. It is a crucial step that enables us to achieve modularity in neural networks. Modularity aims to process input data in an organized way that similar input follows similar paths throughout the network. A modular neural network needs to group similar data based on some similarity index. Therefore, clustering input plays a crucial role in our proposed framework. We also propose Confusion Matrix driven Centroid Based Clustering (CMCBC) for clustering. CMCBC is an unsupervised clustering technique that utilizes k-Medoid algorithm. However, there is no distance function involved in the proposed method. Instead, it uses a confusion matrix to find similarities between each pair of classes and medoid for every cluster.

3.1.1 Confusion matrix driven centroid based clustering (CMCBC)

Generally, clustering algorithms start by computing the distance between every pair of units to be clustered. However, we use a confusion matrix obtained from a trained neural network as a distance matrix in this work. Although any confusion matrix does not follow the symmetry rule, we present further in this section that how it can be used effectively as a distance matrix in combination with k-Medoid clustering algorithm. The confusion matrix obtained from the CNN model trained on the MNIST dataset is shown in Fig. 1. The confusion matrix indicates the correlation between different dataset classes concerning the neural network. It demonstrates that the CNN confuses three class 0 with class 6 and 2 samples with class 8. It indicates some degree of correlation among different samples of different classes that can be exploited to cluster the dataset. In order to use confusion matrix for clustering, it needs to be preprocessed. Preprocessing confusion matrix in this work consists of two steps:

  • Normalization

  • Distance calculation

Fig. 1
figure 1

MNIST dataset confusion matrix

Fig. 2
figure 2

MNIST model normalized matrix

Normalization The confusion matrix is normalized just before using it for distance calculation. All the values in the matrix are transformed within the range of 0–1 using (1) where cm is the confusion matrix, i is used for row and j for column. Sum function returns the sum of the ith row of the confusion matrix obtained using cmi. Figure 2 is the normalized version of the confusion matrix, shown in Fig. 1 calculated using (1).

$$ c{{m}_{j}^{i}} = \frac{c{{m}_{j}^{i}}}{sum(cm^{i})} $$
(1)

Distance calculation The normalized confusion matrix can be viewed as a similarity matrix where the highest values are at the diagonal. In order to convert it to a distance matrix, we have used (2) where \({{x}_{j}^{i}}\) is a placeholder for a value at row i and column j of normalized confusion matrix.

$$ {{y}_{j}^{i}} = 1 - {{x}_{j}^{i}} $$
(2)

Figure 3 depicts a distance matrix for MNIST dataset calculated from the normalized confusion matrix shown in Fig. 2. It can be seen that the MNIST distance matrix is asymmetrical. The values in the lower triangle of the matrix do not match the corresponding values in the upper triangle. In other words, the distance between x and y is not equal to the distance between y and x. Furthermore, the diagonal values are not equal to 0. Even though our distance matrix does not comply with the given rules of a distance matrix, it still can be used effectively in combination with k-Medoid for clustering. The pseudocode is presented as Algorithm 1.

Fig. 3
figure 3

MNIST dataset distance matrix

figure a

k-Medoid is a greedy algorithm that iteratively makes greedy choices. It compares the distances between a data point and medoids and clusters it with the closest medoid. The process repeats until an optimal state of the overall configuration is achieved. Provided the distance matrix in Fig. 3, the k-Medoid selects the smallest distance between two competing distances. The asymmetrical property of any distance matrix calculated from a confusion matrix does not affect the performance of k-Medoid clustering algorithm.

3.2 Inhibition mask based training

Runtime pruning is a type of network pruning in which the network is pruned dynamically. Kernels, feature maps, layer nodes, and channels can be pruned at run time. Models pruned with static pruning methods may permanently lose a significant amount of information. Runtime pruning solves this problem by temporarily removing information. We believe that this behavior of runtime pruning can be used effectively to attain modularity in neural networks. In this method, we have used inhibition mask-based runtime feature map pruning. It is a training process in which a pre-trained model is retrained to get a pruned model. Feature map pruning is enforced during retraining by an inhibition mask at each target layer. Target layers are those layers that are subject to pruning. The hit and trial method decides an optimal number of target layers. We use the term modular layer interchangeably also with the term target layer. The inhibition mask itself is a binary mask that works as a filter. It allows only a certain number of feature maps to pass to the next layer in each feedforward pass. Inhibition mask for each layer is designed based on intuition. Figure 4 shows a high-level design of the process.

Fig. 4
figure 4

High level diagram of inhibition mask based training process

3.2.1 Clusters

Clusters obtained using k-medoid algorithm in the previous step are given as input. It provides a base to select a mask for each input image so that all the input images that belong to the same cluster pass through the same module in a layer. It means that the number of clusters at a target layer equals the number of modules at that layer. Mathematically, suppose there are LT target layers out of the network’s total LN layers. In that case, the number of clusters C, for the entire neural network equals the sum of the number of clusters in each target layer. The number of modules in the ith target layer of the network is denoted as \({M_{T}^{i}}\). Similarly, the total number of modules in the network, represented by M, equals C. Thus, the number of clusters per target layer equals the number of modules in that specific layer.

3.2.2 Mask

Mask is a 3D binary tensor that is used for pruning. It works as a filter that allows some feature maps to move forward in feedforward pass while blocking the rest. The size of the mask for an overall network is equal to the sum of total modules at each target layer, as shown in (3). The sparsity degree and the pattern of the mask are decided empirically.

$$ M_{N} = \sum\limits_{i=0}^{L_{T}} M_{i} $$
(3)

Equation (3) indicates a sub-mask with a 2D shape for each target layer in a network. Each layer has a different number of modules and various units. Both of these terms define the shape of the sub-mask. The shape of each sub mask at a specific target layer can be calculated as the number of units in that layer times the total number of modules assigned to the layer.

3.2.3 Trained model

Weights of the trained model are used to initialize another identical model, which is then trained with runtime pruning enabled. This initialization process makes the new model retain the knowledge of the pre-trained model that we used to obtain clusters and gives a good point for the new model to start training. Initializing a new model from a pre-trained model is essential as we want to keep the old weights for further experiments.

3.2.4 Retraining and runtime pruning

The new model, initialized with old weights, is fine-tuned, taking all the above input parameters. The retraining phase enables the model to support pruning to achieve modularity. The difference between this retraining process and the standard training is that this process also utilizes our feature map pruning Algorithm 2. The algorithm receives feature maps from the target layer as input and clusters them to one of the predefined clusters provided as a parameter. Based on the clustering result, a sequence of predefined masks is generated. Mask is a sequence of binary values with the same length as the number of modules in the layer. Each sub mask corresponding to one module has a length equal to the number of filters in the layer. We have implemented our framework so that the feature maps corresponding to the true values of a sub mask are zeroed. We are not removing the entire feature map from the list to maintain its shape for the next layer. The effect of pruning can be seen in the backward pass when gradients are propagated backward. Gradients calculated for the filters corresponding to the pruned feature maps are zero. Figure 5 graphically represents this whole idea.

Fig. 5
figure 5

Convolutional layer operations

figure b

3.3 Inhibition masked based prediction

Using the previous step, the modular CNN can outperform the non-modular CNN of the same architecture. In non-modular convolutional neural networks, the prediction on the test dataset consists of only a feedforward pass. The forward pass of our modular CNN also includes a runtime pruning algorithm. The pruning algorithm uses the true labels provided with the training dataset to cluster input at each modular layer. A sub mask from a list of predefined masks is selected based on this clustering result. Each layer has a different number of units and additional modules, so the mask list for each modular layer is different. The selected mask activates only one module in each modular layer while deactivating the rest. In other words, all the modular layer units that do not belong to the active module are pruned. Inhibition mask-based pruning is a soft pruning technique. The Algorithm 2 works fine for training, but it cannot be embedded as it is in the feedforward pass of the testing phase. The problem that we have in the testing phase is that the system has no prior knowledge about the actual labels of the test data. Of course, labels are provided with test data, but that can only be used to evaluate the final performance of a trained network. To solve this problem, we present Algorithm 3 which is an extended version of Algorithm 2. The extended pruning algorithm utilizes neural networks to cluster inputs. Since every cluster corresponds to the mask, we call these neural network mask prediction models. One mask prediction model is trained for each modular layer. The model predicts the mask index in the mask list for input feature maps.

figure c

Training mask prediction model Mask or cluster prediction for runtime feature map pruning problem in the testing phase is solved by deploying additional neural networks. The number of mask prediction models in the modular neural network equals the number of modular layers. Each mask prediction model is deployed after its corresponding modular layer. It accepts feature maps from its preceding layer and returns the index of a cluster. Thus, additional computation for mask prediction models is the only overhead in our modular CNN. For training mask prediction models, we conducted two experiments. In one experiment, we trained the models on pre-activation of the modular layer, while in the other experiment, we used post-activations. Pre-activations are the output of the modular layer before the activation function is applied, and post-activation is the output of the activation function. The concept is depicted in Fig. 6.

Fig. 6
figure 6

Convolutional layer operations

Preparing dataset The datasets for training mask prediction models is prepared using the trained modular neural network, its input dataset, and the clustering information. The trained modular CNN is used to save the layer activations for which the mask prediction model is trained. For this purpose, all three sections (train, test, and validation) of the input dataset are passed to the modular CNN. Activations of train data are used to train the mask prediction models, validation activations are used for validation, and test data is used to evaluate model performance after training. Labels for activations are generated using clustering information.

Training For the MNIST dataset, we have used fully connected neural networks (FCNN), while CNN models are used for the CIFAR10 dataset. In order to reduce the computational complexity of FCNN, a preprocessing step precedes the training. We have divided each n × n feature map into an m × m grid. The sum of every cell in the grid is normalized and fed to the input layer. This method reduces the computational complexity by (n/m)2 times. Further details about the experiments are shared in the experiments section.

Deploying mask prediction models Each mask prediction model work as a subprocess of our modular neural network. Each process is called in every iteration of data on a modular neural network. The output of the mask prediction models is used for mask selection which is then applied to feature maps for dynamic pruning. One of the disadvantages of using sub-models is extra computation, while the other is that the loss of these sub-models is added to the loss of MCNN. Thus, it potentially reduces the overall accuracy of the MNIST model by less than 1%. However, it results in a 2% increase in the CIFAR10 model.

3.4 Single shot training for modular CNN

The framework presented in this section to induce modularity in CNN is a multi-step and time-consuming process. Training all the models or networks individually and manually integrating them is a tedious task. Another drawback is that feature maps that are used to train mask prediction models often need a massive amount of memory storage. It occupies ample space and consumes time to load into memory. We have experimented with multi-output CNN with an additional custom layer to solve these problems. A multi-output network is a neural network that consists of one or more sub-networks. Each network returns its output. All the sub-nets can take input either directly from the input layer or any other layer of any sub-network. All the models are trained simultaneously in a single training loop. In the case of supervised learning, labels for every sub-model with output are provided to the base model with a tag attached to it. The tag identifies which data is intended for which model. All sub-models can share the basic configurations (loss function, optimizer function, etc.) or can be configured independently. However, the loss of every sub-model accumulates to the loss of the base model. We have integrated our framework into a single training process using a multi-output model design concept. Also, all the data processing has been moved outside the main training loop. With this approach, we need to prepare data only before the training. The overall model structure seems like the mask prediction models trained separately are now embedded inside the base model. So we call it Embedded Modular CNN (EMCNN) shown in Fig. 7.

Fig. 7
figure 7

Embedded CNN for MNIST

3.4.1 Runtime pruning layer algorithm

The runtime pruning layer has no trainable parameters. It receives output from a mask prediction model and input feature maps from its preceding layer indicated by mout and F respectively in the Algorithm 4. The mask M in the input is a 3-dimensional binary mask for the modular layer. If the mask’s shape is represented by w × h × t, then w and h are equal to the width and height of feature maps F passed to the algorithm, and t is equal to the number of modules in the layer. All the feature maps corresponding to the zero matrices in the mask get pruned by the layer. The argmax function decodes the one-hot encoded output of the sub-network where the multiply function performs pointwise multiplication on its arguments. The returned sparse feature maps represented by sparse_fm are forwarded to the next layer in the network.

figure d

4 Experiments and results

4.1 Preparing datasets

We tested our method on two benchmark datasets: MNIST and CIFAR10. The MNIST dataset contains a total of 70,000 28 × 28 grayscale images of handwritten digits. The dataset is divided into 60,000 training samples and 10,000 test samples which are categorized into ten classes. CIFAR10 consists of 60,000 32 × 32 RGB images. All the samples are categorized into ten different classes. The dataset is divided into 50,000 training samples and 10,000 test samples. Furthermore, the proposed method is evaluated on the Facial Aging database, IFDB. The IFDB contains around 1200 facial images along with age information.

4.2 Training MNIST and CIFAR10 networks

First, we discuss the experimental setup and network architecture for the MNIST dataset. The network architecture is outlined in Table 1 as CNNMNIST. The network has three convolution layers followed by two fully connected layers. Each of the first two convolution layers precedes a max-pooling layer with 2 × 2 filter size and 1 × 1 stride. CIFAR10 experiment is conducted using CNNCIFAR10 architecture outlined in Table 1. We use the default parameter settings as reported in coarse_filter_pruning. The network architecture is reported by authors with an alphanumeric string [3]. The (2 × 128C3) depicts two consecutive convolution layers with 3 × 3 convolution kernels and 128 feature maps. MP2 represents a single overlapped max-pooling layer with a kernel size of 3 × 3 and stride size of 2 for both x and y axes.

Table 1 CNN architecture details

4.3 MNIST experiment

4.3.1 Clustering result for MNIST

In this step, we first cluster the input domain using k-Medoid and MNIST distance matrix calculated in Section 3.1.1. Column 1 of the table shows the number of clusters. While column 2, named “Clusters” represents items of each cluster. For example, the second row shows that all the labels are grouped in two clusters such that 0, 5, 6, and 8 consists of one cluster and 2, 1, 3, 4, 7, and 9 consists of the other cluster. The first element of each cluster in the table is used as the medoid of the cluster (Table 2).

Table 2 MNIST clusters

4.3.2 Inducing modularity in MNIST network

CNNMNIST is extended to modular CNN (MCNNMNIST by marking 2 of its convolution layers as modular layers. Configuration details for MCNNMNIST are given in Table 3. The configuration table shows that there are seven layers in the MNIST model. Two layers at index 2 and 4 are modular layers. The first modular layer has six convolution units and the second layer has 120 convolution units. The clusters or modules are 2 and 3, respectively, for both layers. Modularity has been induced in the second and third convolution layers for the 2nd and 3rd modules. The feature maps of the second convolution layer are clustered into 2 clusters, while feature maps of the third layer are clustered into 3 clusters according to Table 2.

Table 3 MCNNMNIST configuration

Inhibition mask for MCNNMNIST The binary inhibition mask for MCNNMNIST is designed intuitively. The mask for \({L_{T}^{1}}\) has a shape of 2 × 16 since it has two modules and 16 feature maps, while the mask for \({L_{T}^{2}}\) has a shape of 3 × 120. The masks \({m}_{1}^{i=1,2}\) induce 50% sparsity in the layer, and both masks are mutually exclusive. Similarly, the masks \(m_{2}^{i=1,2,3}\) induce 50% sparsity in the layer but are mutually inclusive. \({{m}_{2}^{1}}\) and \({{m}_{2}^{2}}\) are mutually exclusive to one another but 50% inclusive to \({{m}_{2}^{3}}\). Table 4 makes the idea more clear by presenting the structure of inhibition masks for MCNNMNIST. The table shows that the mask for the first module (i = 1) in the first modular layer (t = 1), \({{m}_{1}^{1}}\) prunes the first eight feature maps from 1-8 (both 1 and 8 are inclusive) for the input images that belong to cluster \({{C}_{1}^{1}}\). The mask \({{m}_{1}^{2}}\) does the opposite. It prunes the last eight features of the layer from 9-16 for the input images that belong to cluster \({{C}_{1}^{2}}\). Inhibition mask in the second modular layer where t = 2, all the three masks induce 50% sparsity in the layer by pruning 60 out of 120 feature maps.

Table 4 Inhibition mask structure for MCNNMNIST

Mask prediction models for MCNNMNIST The Mask prediction model is trained for each modular layer in the network. For MNIST, we have trained fully connected networks with different architectures. Feature maps of training, validation, and testing datasets obtained from each modular layer feed FCN models. All the three datasets are preprocessed by dividing each channel of feature maps into 2 × 2 sub-matrices followed by calculating the sum of each sub-matrix. The preprocessed feature maps are normalized and flattened in row-major order. Since each feature map has 16 channels, we get 784 values as input. For the second modular layer, feature maps have a shape of 7 × 7 × 120, giving us 1080 values for input after preprocessing. Table 5 show details about the experiment. It is evident in the table that the first mask prediction model is performing well in both aspects, i.e., accuracy and computational complexity.

Table 5 MNIST mask prediction models details

Based on these results, we selected the first network as mask prediction models for their corresponding modular layer. The first network architecture achieves the highest accuracy and lowest computational complexity among the given networks for \({{L}_{T}^{1}}\). However, it reaches the third-highest accuracy for \({{L}_{T}^{2}}\), but the difference between the top 3 accuracies, in this case, is negligible.

4.3.3 Modular CNN for MNIST results

Using the CNNMNIST model specified at Table 1, we achieved 99.39% accuracy on test dataset. We induced modularity for 2 clusters and 3 clusters in two consecutive convolution layers. After retraining the model with runtime, pruning enabled, we calculated test accuracy in two ways, by simulating mask prediction models with 100% accuracy and deploying actual mask prediction models. Furthermore, we evaluated the effect of pruning at each layer by enabling pruning at each layer individually and performed test data iterations. For simulating mask prediction models with 100% accuracy, we cluster input to a modular convolution layer by directly reading its label. The feature maps are not forwarded to any of the mask prediction models.

Table 6 Modular convolution neural network for MNIST results

In the Table 6, the column “Mask Prediction Model” indicates whether mask prediction models are deployed or not. If a column value is unchecked at a specific row, the accuracy is calculated by simulating mask prediction models. The test accuracy for all the cases in the table shows that the loss of the mask prediction models contributes to the loss of the modular base model.

Furthermore, the abrupt drop of accuracy mentioned in the last row of the table where the pruned modular CNN is evaluated with runtime pruning disabled validate our hypothesis that our framework enforces a set of specific convolution units to fit a particular input group. It enables different kernels to specialize in different sub-domains of the input by allowing only a subset of filters to training at a modular layer and precluding the others. We can call it targeted training. The results in the table are also illustrated in the bar graph Fig. 8. The blue bars represent our modular CNN accuracy on test data while simulating mask prediction models. In contrast, the other bars represent test accuracy reported in the table with the mask prediction model column set to enabled.

Fig. 8
figure 8

Modular CNN for MNIST results bar graph

4.4 CIFAR10 experiment

4.4.1 Cluster result for CIFAR10

In this step, we first cluster the input domain based on the distance matrix using the k-Medoid algorithm. First we take the normalized confusion matrix of our trained CNNCIFAR10 model which has 84% accuracy, show in Fig. 9. It can be observed that 123 sample dogs are confused as a cat by the model, the truck is confused with the airplane, and deer are mostly confused with birds. As mentioned in Section 3, we calculated the distance matrix as shown in Fig. 10, from the CIFAR10 normalized confusion matrix. Next, we applied k-Medoid algorithm on the distance matrix to cluster the CIFAR10 dataset (Table 7).

Fig. 9
figure 9

CIFAR10 normalized confusion matrix

Fig. 10
figure 10

CIFAR10 distance matrix

Table 7 CIFAR10 clusters

4.4.2 Inducing modularity in CIFAR10 network

CNNCIFAR10 is extended to modular CNN (MCNNCIFAR10) by marking 3 of its convolution layers as modular layers. Configuration details for MCNNCIFAR10 are given in Table 8. The configuration settings depicted by the table indicates that there is a total of 10 layers (LN) in CNNCIFAR10 3 of which are modular layer denoted by LT. The modular layers lie at indexes 2, 4, and 7 of the MCNNCIFAR10 respectively (indexes start from 1). \(L_{FM}^{i}\) denotes the total number of feature maps produced by ith layer of the network. M and C denote a total number of modules and clusters, respectively, in the entire modular CNN.

Table 8 MCNNCIFAR10 configuration

Inhibition mask for MCNNCIFAR10 We have defined a binary inhibition mask for each modular layer of MCNNCIFAR10. As shown in Table 8, the 3 modular layers has 128, 128 and 256 ReLu units and 2, 3 and 4 modules respectively. These two factors define the shape of the inhibition mask for its corresponding modular layer. The first mask (inhibition mask for the first modular layer) has a shape of 2 × 128, the second mask has a shape of 3 × 128, and the third mask has a shape of 4 × 256. Table 9 explains the structure of each of the masks. All the masks induce 50% sparsity in their corresponding modular layer at runtime. Only \({m}_{1}^{i=1,2}\) are mutually exclusive. The sparsity of the mask determines the mutual exclusion property of the mask. One must be traded off for another. For example, to obtain a mutual exclusive mask for \({M_{T}^{3}}\) the sparsity must be greater than or equal to 75%. We have used masks with 50% sparsity in all our experiments for this work.

Table 9 Inhibition mask structure for MCNNMNIST

Mask prediction models for MCNNCIFAR10 Mask prediction models specified in Table 10 are convolution neural networks. The “ReLu Layers” column refers to the number of hidden convolution layers and the number of convolution filters in each layer. All the layers have a 3 × 3 filter and ReLu as activation function. The model input comes from preceding modular layers, and the output of the models is used for mask selection for runtime pruning. Adam optimizer is used with an initial learning rate of 0.001 besides with early stopping callback to stop the training at a convergence point. The table specifies all the model architectures we trained on the feature maps of their respective modular layers. The model in row 1 is selected for \({{L}_{T}^{1}}\) and \({{L}_{T}^{2}}\), and the model in row 2 is used the layer \({{L}_{T}^{3}}\) due to their high accuracy on the corresponding layer feature maps and lower computation complexity.

Table 10 CIFAR10 mask prediction models details

4.4.3 Modular CNN for CIFAR10 results

The modular CNN for CIFAR10 results is compiled in Table 11. We have collected results for different scenarios. MCNNCIFAR10 has been evaluated with both simulated mask prediction models and actual mask prediction models. Simulated mask prediction models mean that we have predicted masks for a set of feature maps based on its associated label even during inference. It enables us to observe base model performance without any additional loss due to the mask prediction model. The tests which incorporate “Pruning” and do not incorporate the “Mask Prediction Model” uses this technique. The fourth row of the table shows that our modular convolutional neural network has 88.86% accuracy. This is the maximum accuracy this model can achieve if mask prediction models are improved. The last row of the table report 78.01% accuracy with the best trained model and is about 6% less than the base model accuracy (84%). Based on our experiments, we hypothesized that the weights of mask prediction models are not aligned with the base model weights, resulting in a significant drop in accuracy. To make the alignments, we retrained our MCNNCIFAR10 with all the mask prediction models enabled in the entire training phase. After retraining, our modular CNN outperforms the baseline model accuracy by 2.78% and achieves 86.78% accuracy. We refer to this retraining as weight alignment training in Table 11. The results in the table are also illustrated in Fig. 11. The blue bars represent our modular CNN accuracy on test data while simulating mask prediction models. In contrast, the other bars represent test accuracy reported in the table with the mask prediction model column set to enabled.

Table 11 Modular convolution neural network for CIFAR10 results
Fig. 11
figure 11

Modular CNN for CIFAR10 results bar graph

4.5 Modular CNN for MNIST: single shot training results

We trained our model represented in Fig. 7 using Keras functional API with Adam as an optimizer and early stopping algorithm with 0.0001 minimum delta, six epochs patience, and the restore best weights flag enabled. The base model layers are initialized with MNIST base model (CNNMNIST) weights while mask prediction sub-models are initialized with the default algorithm used by Keras. Furthermore, the MNIST training data provided to the model has been split into 5,000 validation images and 45,000 training images. We have achieved 98.69% test accuracy on the base model with all the above. The mask prediction models embedded into the base model for \({{L}_{T}^{1}}\) and \({{L}_{T}^{2}}\) have 99.12% and 99.36% accuracy. The loss of the overall mechanism is recorded to be 0.90, which is the Sum of MNIST base model loss (0.45) and the two mask prediction models loss (0.25, 0.20). It can be concluded from the results that the drop in the MNIST base model accuracy is due to the performance of mask prediction models. Trained mask prediction models add a certain level of overhead to the base model. We theoretically calculated the overhead and computational complexity reduction by runtime pruning in terms of multiplications. Table 12 represents the number of multiplications that can be avoided per sample iteration through the model. The first column represents the layer’s name; the second column represents the total number of multiplications performed. The third column shows the number of multiplications that can potentially be avoided due to the sparsity added to feature maps. The 16Conv layer is the first modular layer, and the 120Conv is the second modular layer. The amount of complexity reduction is directly proportional to the sparsity of the pruning mask. The table further indicates that pruning at one layer also affects its next layer. Although the inhibition mask for the 120 Conv layer mentioned in Table 4 has a 50% sparsity ratio, we get 75% sparsity in the layer. The additional 25% sparsity is added due to channel reduction in its preceding layer, 16Conv. Similarly, pruning at 12Conv influences its succeeding fully connected layer named 84FC in the table. Despite the sparsity added to the model, mask prediction models add additional computations to the base model. The mask prediction models add 289,920 and 107,712 multiplications in total for \({{L}_{T}^{1}}\) and \({{L}_{T}^{2}}\) layers, respectively. Considering the sparsity in modular and fully connected layers, we get a net overhead of 333,504 multiplications. The mask prediction model architecture directly influences this overhead. Multiple cost reduction techniques can use to reduce the overhead.

Table 12 MCNNMNIST Modular layers complexity calculations

Our proposed model achieved accuracy near the baseline model for the MNIST dataset. In the case of CIFAR10, our approach outperformed the baseline model when the weighted alignment (i.e., retraining of the network) was performed. But without weight alignment, the performance was almost 6% less than the baseline performance. In this work, at one end, we compromised the performance of the models with a slight fraction, but at the same time, we saved around 53% multiplications in the Convolution Layers. Keeping these multiplications helped to achieve low computational complexity.

4.6 Experiments of IFDB dataset

In order to further validate our approach, we evaluated it on the IFDB database and compared the results with the method proposed in [13]. The data was divided into four different age groups as shown in Table 13. There is accuracy loss due to the samples per class. Since we have used the same models, the models are not too generalized to adapt to every problem. The same models are used for fair comparison and observed accuracy loss. But at the same time, the computational time was reduced.

Table 13 Results on IFDB dataset

5 Conclusion

CNN delivers state-of-the-art performance on various computer vision problems. Its early success for image classification problems developed the interest of researchers in the field. The hierarchical structure of CNN is inspired by the cortical region of the human brain. The hidden convolution layers pull out useful information from input images and forward it to succeeding layers. CNN has a monolithic structure and works as a black box despite the hierarchical processing mechanism. In this work, we proposed a framework to instigate modularity in a pre-trained convolution neural network for image classification. We exploit the information available in the confusion matrix of the model. We hypothesized that the confusion matrix of a trained or frozen CNN provides enough information to cluster the input domain. After the initial composition of modules, we utilized the confusion matrix as a distance matrix for domain clustering. The clustering divides the datasets into several groups or clusters.

In module composition, we manually selected the layers to induce modularity and composed the shape of the modules in each layer. We deactivate several modules by inhibiting mask-based runtime feature map pruning using artificial neural networks to route a group of input images through a specific path formed by modules. We call the routing networks to mask prediction models. The mask prediction models accept feature maps from a modular layer in the CNN and classify them to one of the available clusters. We select the inhibition mask for the cluster and produce zero sparsity in the input feature maps based on the result. To train the mask prediction models for modular layers, we prepared the train, validation, and test datasets to save the corresponding layers’ feature maps and the available clustering and module configuration information.

Moreover, we use arithmetic integration of the module’s output which operates in a competitive environment. The proposed framework is evaluated on two benchmarks datasets, MNIST and CIFAR10. On the MNIST dataset, we achieved 98.51% accuracy using our proposed Modular CNN compared to the baseline accuracy of 99.39%. But at the same time, we saved 53% multiplications in the network, which significantly reduced the complexity. Similarly, on the CIFAR10 dataset, our model achieves 78.01% accuracy, 6% less than the baseline accuracy (84%). But when we retrain the network to align the weights further, our model outperformed the baseline model accuracy by 2.78% and achieved 86.78% accuracy. Modularity produces sparsity in the network, but the computation overhead added to mask prediction models exceeds the sparsity produced. Also, modularity adds additional hyper-parameters to tune. However, we consider computational overhead reduction for future work. In the future, we will explore methods to predict the optimal configuration for overall modular network structure, such as the number of modules and composition of modules and inhibition mask. We also consider adding modules to a pre-trained network for a heterogeneous task and profound hierarchical domain decomposition, and expanding modularity in the entire network. Furthermore, we intend to evaluate the capability of our modular CNN for knowledge distillation and transfer learning.