Keywords

1 Introduction

Cyber security becomes more and more important nowadays as cyberspace infiltrates into our life rapidly. Every day billions of images containing massive information are generated and transmitted in cyberspace, which poses great challenges to the security of cyberspace. It is necessary for us to search for more efficient and robust methods of image recognition to retrieve useful information from images automatically.

Over the last decade, convolutional neural networks (CNNs) have been widely used in various challenging tasks in computer vision for its remarkable learning capacity. CNNs share weights across positions on the image to achieve translation invariance, which is reasonable but not robust enough when dealing with complex transformations caused by viewpoint changes or part deformation. To correct these deficiencies, Hinton et al. [5] proposed a concept of capsule, which aims to learn features equivariant to transformations resulted from viewpoint changes. Built upon capsules, capsule networks [6, 11] had achieved state-of-the-art performance on MNIST and smallNORB benchmarks.

However, there is still a large room for capsule networks to approach state-of-the-art performance on natural image datasets. CapsNets [6, 11] comprise much fewer layers than current well-performed models such as ResNet [4] and DenseNet [7], which contain hundreds of layers. It has been empirically proven that neural networks with deeper and wider architecture are more capable of learning complex hierarchies inside visual entities. Naturally it is worthwhile to explore a deeper capsule network architecture for enhancing its performance on complex data. As discussed in [10], simply stacking up fully-connected capsule layers towards a deep architecture will lead to some undesired problems like expensive computational cost and gradient vanishing. To address these limitations, we start our work with an analysis of the routing algorithm which involves complicated computations.

Routing algorithm in the standard CapsNets routes capsules from low level to high level according to coupling coefficients, which are computed with multiple iterative steps. From another aspect, the routing procedure can be explained as a parallel attention mechanism. So we formulate the computation process as a regression of multiple attention maps and implement the computation of coupling coefficients by two fully-connected layers. Capsules at one position are taken as input to the two-layer subnetwork to output coupling coefficients. Weights in the fully-connected layers are shared across different positions, so capsules at different positions can be routed according to the same criterion. As a result, the routing procedure is feasible to be accomplished in one stage and performed with much less computational cost. We name this novel routing algorithm as Capsule-wise Attention Routing, since our motivation comes from attention mechanism. Besides the change in computing coupling coefficients, another modification is the adoption of 2D convolution with larger kernel to transform capsules from low level to high level. Since the original matrix multiplication is equivalent to \( 1 \times 1 \) convolution, the modification can be regarded as an enlargement of the convolutional kernel. To prevent the amount of parameters increasing proportionally to the size of convolutional kernel, we implement the convolutions with the idea from depthwise separable convolutions [2].

As the computational cost in routing gets successfully reduced, we are able to build up a deeper capsule network with more capsule layers. We name our model as CARNet after the routing algorithm we proposed previously. In addition, we set skip connections between different levels of capsules to help transport gradient flow into low layers during training. With the skip connections, capsules at low level can be routed directly to the top capsules and involved in the final inference.

The rest of our paper is organized as follows. In Sect. 2, we review the related work on capsule networks. In Sect. 3 we introduce how capsule-wise attention routing works and elaborate the architecture of CARNet. In Sect. 4, we evaluate capsule-wise attention routing and CARNets on four object recognition benchmarks including MNIST, Fashion-MNIST, SVHN and CIFAR-10. Finally we summarize our work and discuss possible future work in Sect. 5.

2 Related Work

Capsule is a neural unit aiming to learn viewpoint-equivariant instantiation parameters and viewpoint-invariant activation probability of some visual entity in images. In dynamic capsules [11], a capsule is organized as a vector called activation vector. Entries of the vector are explained as multiple implicitly defined instantiation parameters of the visual entity, while the length of the vector is the probability indicating the presence. In EM capsules [6], instantiation parameters and activation probability were separately represented by a \( 4 \times 4 \) pose matrix and a logistic unit. The complicated internal structure determines that capsule has more complicated intra-computation and inter-computation than single neuron.

Routing procedure happens between adjacent capsule layers. Each capsule in the lower layer will first make predictions for capsules in the higher layer respectively. If two lower capsules make similar predictions for one higher capsule, they are supposed to be routed to that capsule. For every higher capsule, it will receive a cluster of predictions from capsules below, and aggregate them as output. With the routing-by-agreement mechanism, CapsNet not only achieved state-of-the-art performance on MNIST and smallNORB benchmarks but also showed out superiority in distinguishing overlapping digits [11] and resisting white box adversarial attacks [6].

Capsule networks have been further explored in the literature. Wang and Liu [12] formulated dynamic routing as an optimization problem of minimizing clustering loss with a KL regularization term and modified the routing procedure with motivation from solving a clustering object function. Lenssen et al. [9] proposed group equivariant capsule networks with provable equivariance and invariance properties. Zhang et al. [15] proposed two fast routing methods based on kernel weighted density estimation. These works improved the routing algorithm from different aspects but the networks were still relatively shallow.

Concurrent with our work, Rajasegaran et al. [10] proposed a deep capsule network architecture named DeepCaps with a similar motivation to ours. DeepCaps includes 17 capsule layers and impressively outperforms the state-of-the-art capsule networks on Fashion-MNIST, SVHN and CIFAR-10 with much less parameters than the original CapsNet. DeepCaps maintains the framework of dynamic routing and adopts 3D convolution to implement the transformation in routing. Weights among input capsules are shared so that the computational cost can be reduced. DeepCaps also proposes a class-independent reconstruction network at the top of network. Different from DeepCaps, we propose a novel routing algorithm in which the coupling coefficients are computed by a two-layer subnetwork and the transformation is performed by 2D convolution. Besides, our model uses skip connections to connect capsules at different levels in a different way from DeepCaps. And the reconstruction network is not considered for regularization. Performance of DeepCaps and our proposed approach will be compared in Sect. 4.

3 CARNet

In this section, the details of the proposed capsule-wise attention routing algorithm and the architecture of CARNet are presented.

3.1 Capsule-Wise Attention Routing

Consider an intermediate capsule layer that processes \( N_l \) input capsules and outputs \( N_{l+1} \) capsules. We denote the i-th capsule in layer l at some position by \( \mathbf {u}_i^{(x, y)} \in \mathbb {R}^{a_l} \), where \( \left( x, y \right) \) is the coordinate of capsule and \( a_l \) is the dimension of activation vector.

First, capsules at one position are concatenated as a vector \( \mathbf {u}^{(x, y)} \). Since the computation of coupling coefficients across positions are performed in the same way, we omit the coordinate for clarity below. Then we pass \( \mathbf {u} \) to two cascaded fully-connected layers to regress the log prior probabilities \( \mathbf {h} \in \mathbb {R}^{N_l \times N_{l+1}} \). We choose vector \( \mathbf {u} \) as the input of the subnetwork because \( \mathbf {u} \) is supposed to aggregate all the semantic information of its local receptive field at current layer, which can help generate more proper coupling coefficients. The computation of \( \mathbf {h} \) is written as:

$$\begin{aligned} \mathbf {h} = \mathbf {W}_2 \cdot g\left( \mathbf {W}_1 \mathbf {u} + \mathbf {b}_1 \right) + \mathbf {b}_2, \end{aligned}$$
(3.1)

where \( g \left( \cdot \right) \) is the ReLU function. We reorganize vector \( \mathbf {h} \) into an \( N_l \times N_{l+1} \) matrix and subsequently feed it to the softmax function to get coupling coefficients \( \mathbf {c} \).

$$\begin{aligned} \hat{\mathbf {u}}_{i|j} = c_{ij} \mathbf {u}_i,\qquad c_{ij} = \frac{\exp \left( h_{ij} \right) }{\sum _k \exp \left( h_{ik} \right) }, \end{aligned}$$
(3.2)

where \( c_{ij} \) is the coupling coefficient of capsule i and capsule j, \( \hat{\mathbf {u}}_{i|j} \) is a weighted activation vector passed from capsule i to capsule j. Each capsule in the higher layer will receive \( N_L \) weighted activation vectors from the lower layer and then transform them into the higher capsule space. Dynamic routing implements the transformation from low level to high level by matrix multiplication (\( 1 \times 1 \) convolution), which makes no use of features in the neighbourhood. Here we adopt 2D convolution with larger receptive field to perform the transformation.

In details, for capsule j, all the weighted capsules \( \lbrace \hat{\mathbf {u}}_{i|j} \mid 1 \le i \le N_l \rbrace \) are concatenated to be \( \hat{\mathbf {u}}_j \), which is followed by a Conv-BN-ReLU block to generate output capsule \( \mathbf {v}_j \). Parameters in the blocks are not shared among higher capsules, so these parallel blocks can learn part-whole relationship independently of each others.

figure a

These convolutions are performed parallelly in capsules, which are equivalent to group convolutions. So we implement the parallel convolutions with inspiration from depthwise separable convolution [2], which splits the original convolutional operation into a depthwise convolution and a pointwise convolution. First, we concatenate the weighted capsules and perform depthwise convolution on it. Second, we separate the tensor back to the form of capsules and perform pointwise matrix multiplication. When the receptive field is \( 1 \times 1 \), we would omit the first step and perform the second step only, which is equivalent to the transformation in dynamic routing. By this method, we avoid using convolution for each capsule tensor iteratively and take advantage of the speed-up of convolution. In addition, for kernel size \( k > 1 \), our implementation would reduce the amount of parameters used for transformation by

$$\begin{aligned} \varDelta N_{param} = k^2 N_l a_l + N_l a_l N_{l+1} a_{l+1} - k^2 N_l a_l N_{l+1} a_{l+1}. \end{aligned}$$
(3.3)

Since \( k^2 \ll N_l a_l \) for every layer l in practice, so the reduction rate of parameters is nearly \( 1 / k^2 \), which is considerable even when \( k = 3 \).

Activation probability of a capsule still depends on the length of the activation vector as in [11]. But we don’t squeeze the length of vector into \( \left[ 0, 1 \right] \) at the end of routing by the squashing function. We only compute the activation probability by function

$$\begin{aligned} P\left( \mathbf {v} \right) = \frac{\Vert \mathbf {v} \Vert ^2}{1 + \Vert \mathbf {v} \Vert ^2} \end{aligned}$$
(3.4)

if the probability is needed. We skip the vector-squashing operation in the intermediate capsules and choose ReLU as the activation function to prevent gradient vanishing.

Capsule-wise attention routing computes the coupling coefficients by a two-layer subnetwork, turning the mechanism behind from routing-by-agreement to routing-by-learning. In this way, we avoid computing coupling coefficients iteratively and reduce the cost. Besides, since the computation of coupling coefficients and the low-to-high transformation can both be implemented by convolutional operations, the speed of inference can be accelerated with GPUs. Thanks to the reduction of computational cost in each capsule layer, we can cascade more capsule layers to attain a higher learning capacity.

3.2 CARNet Architecture

The architecture of our proposed CARNet is shown in Table 1. Similar to the standard capsule networks, our deep capsule network starts with several convolutional layers, which extract low-level features from the original image. Then the feature map is reorganized into the form of capsule tensor and passed through cascaded capsule layers. At the top of the network, we compute the prediction probability of each category based on the corresponding capsule. The details of CARNet are demonstrated as follows.

Table 1. CARNet architecture for SVHN and CIFAR-10. Note that “conv” in the table refers to Conv-BN-ReLU block and “CAR” is short for “capsule-wise attention routing”. All the convolutional layers are performed with padding except the ones with superscript “\( * \)”. Layers bracketed together comprise a capsule block.

Low-Level Feature Extraction. CapsNet proposed in [11] uses a single convolutional layer with a relatively large receptive field to extract low-level features from the image. The convolution is performed without padding, so a large receptive field can help scale down the size of feature map and further reduce the computational cost in the subsequent capsule layers. While in CARNet, to reap the benefit of deeper networks, we replace the single convolutional layer by multiple cascaded convolutional layers with smaller receptive field. And we also use convolutions with zero padding to keep the size of some feature maps fixed.

Skip Connections. We combine three cascaded capsule layers as a single capsule block and set short paths to connect capsule blocks at different levels. The aim of short paths is to downsample lower capsules to make their size consistent with higher ones and then merge them together.

Let us denote the input and output of the n-th capsule block by \( \mathbf {U}_n \) and \( \mathbf {V}_n \). Due to the convolutional operations in the capsule block, \( \mathbf {V}_n \) would get a smaller size than \( \mathbf {U}_n \). We use a \( 1 \times 1 \) pooling with the same stride and padding (as the convolutional layer) to downsample the input \( \mathbf {U}_n \). So the receptive field of the downsampled tensors \( \mathbf {U}'_n \) are center-aligned to the receptive field of \( \mathbf {V}_n \) at every position. Subsequently \( \mathbf {U}'_n \) and \( \mathbf {V}_n \) are concatenated and fed to the next capsule block, i.e. \( \mathbf {U}_{n+1} = \left[ \mathbf {U}'_n; \mathbf {V}_n \right] \). In this way, lower capsules with the same receptive field centers are preserved and delivered to any higher capsule blocks by the skip connections, which means capsules at all levels would make contribution to the final result of classification. In other words, every capsule block is allowed to receive all the outputs of its preceding capsule blocks to generate its own output (Fig. 1).

Fig. 1.
figure 1

Short paths connecting capsules in different levels.

Implementation Details. At the bottom of CARNet, we set five convolutional layers to extract features with 256 channels from input image. Each convolutional layer is followed by a BN layer and an activation function ReLU. Then we split up the tensor into 32 tensors with 8 channels, termed primary capsules. Primary capsules are subsequently fed to three cascaded capsule blocks. Every capsule layers in the blocks output 8D-capsules of 16 types. Skip connections merge capsules from preceding capsule blocks and transport them to the next capsule block. Finally, the primary capsules and capsules from three capsule blocks would be merged and fed to the final capsule layer to generate 10 16D category-specified capsules, from which we compute the recognition probability by Eq. 3.4.

3.3 Loss Function

The loss function we adopted is margin loss proposed in [11], defined as

$$\begin{aligned} L = \max (0, m^+ - \Vert \mathbf {v}_t \Vert )^2 + \lambda \sum _{i \ne t} \max (0, \Vert \mathbf {v}_i \Vert - m^-)^2, \end{aligned}$$
(3.5)

where t is index of the correct category. We use \( m^+ = 0.95 \) and \( m^- = 0.05 \) as the upper bound for the correct category and lower bound for the wrong category. The weight for losses from wrong categories \( \lambda \) is set as 0.5 for the whole training procedure.

4 Experiments

We empirically evaluated our capsule-wise attention routing and CARNet on four object recognition benchmarks including MNIST, Fashion MNIST, SVHN and CIFAR-10. Each of them collects images from 10 categories. We implemented CARNet in TensorFlow [1] framework and trained our models on GTX 1080 Ti GPUs. We used Adam optimizer [8] for the training and set the initial learning rate as 0.001, which would get reduced by 0.5 every 20,000 steps.

4.1 Datasets

MNIST and Fashion-MNIST. MNIST is a dataset of handwritten digit images while Fashion-MNIST is a dataset of fashion product images. MNIST and Fashion-MNIST provide images of the same amount (60,000 images for training and 10,000 for test) and in the same format (\( 28 \times 28 \) greyscale image). All images are captured in white background. For every training image, we randomly shifted it in every directions by up to 2 pixels. No preprocessing was performed for the test images.

SVHN. The Street View House Numbers dataset provides \( 32 \times 32 \) RGB images containing numbers in natural scene. A large number of images are available for training (604,388) and test (26,032). Our training and test were performed without data preprocessing and data augmentation.

CIFAR-10. CIFAR-10 dataset is also a natural image dataset. The objects in images come from objects in our daily life. 50,000 training images and 10,000 test images are provided. The images are also \( 32 \times 32 \) color images. During training, we perform random shift, random horizontal flipping and random adjustment of brightness and contrast as the data preprocessing. Both the training images and the test images are normalized to have zero mean and unit standard variance before they were fed to the network.

4.2 Capsule-Wise Attention Routing in CapsNet

To evaluate the effectiveness of our novel routing algorithm, we first designed an experiment to compare capsule-wise attention routing with dynamic routing in a shallow capsule network.

We trained a wider version of CapsNetFootnote 1, in which the number of intermediate capsules increased to 32 and the dimensions of activation vectors increased to 8 and 16 in primary capsule layer and final capsule layer. Then we got another model by replacing the dynamic routing procedure with two layers of capsule-wise attention routing in the final capsule layer. Note that we don’t set reconstruction networks for both of them. Details of the networks are depicted in Table 2.

As shown in Table 3, CapsNet with capsule-wise attention routing consumes only half of parameters in its couterpart but achieves higher accuracies on both SVHN and CIFAR-10. The replacement of routing procedure also helps to speed up the training of CapsNet by about 50%.

Table 2. Architectures of two capsule networks. “CapsNet*” represents CapsNet with capsule-wise attention routing. Number n in “dynamic routing \( \times \, n \)” indicates the iteration times in dynamic routing.
Table 3. Accuracies (%) of CapsNet and CapsNet* on SVHN and CIFAR-10.

4.3 Performance of CARNet

We trained our CARNet on four benchmarks and compared the performance with proposed capsule networks. We also evaluated the effect of skip connections in our model. In Table 4 we list out the error rates achieved by our models, CapsNet, DeepCaps and variants of ResNet and DenseNet. All the results are achieved by single model.

As capsule network goes deeper, the performance gets improved accordingly especially on natural image datasets. CARNet without skip connections leads the performance of capsule networks on SVHN and CIFAR-10 and performs close to the state-of-the-art capsule networks on MNIST and Fashion-MNIST. While with skip connections, the performance can be further improved on four benchmarks. Our best model achieves an accuracy of 97.72% on SVHN and 92.46% on CIFAR-10 that surpass DeepCaps by 0.56% and 1.45% respectively.

Besides, CARNets also show out a more efficient capability of utilizing parameters than the existing capsule networks. CARNet consumes 3.96M parameters, which can be cut down to 2.73M when the skip connections are removed. The amount of trainable parameters in CARNet is less than CapsNet for MNIST (8.2M) or DeepCaps for CIFAR-10 (7.22M), and much less than other well-performed CNN-based models listed in Table 4. The efficiency of utilizing parameters comes from the use of convolutions in the transformation step in capsule-wise attention routing, which allows capsules to leverage local features when learning part-whole relationship inside the visual entity. On the other hand, the increment in the amount of parameters is controlled within an acceptable limit thanks to the implementation based on depthwise separable convolutions.

Table 4. Accuracies (%) on MNIST, Fashion-MNIST, SVHN and CIFAR-10 datasets. “SC” is short for “skip connections”. Results in bold are the best in the domain of capsule networks.

5 Conclusion

In this paper, we proposed a novel routing algorithm, capsule-wise attention routing, which uses a two-layer subnetwork to regress coupling coefficients as multiple attention maps. We adopted 2D convolution to replace the linear transformation so that local features can be utilized in transforming capsules from low level to high level. In addition, we formulated the parallel transformation among capsules as group convolutions and implemented it with the inspiration from depthwise separable convolutions. The new implementation was consistent with the original transformation in dynamic routing and was proven to help utilize parameters more efficiently.

Based on capsule-wise attention routing, we further proposed a deep capsule network called CARNet. We stacked multiple capsule layers in our model and set skip connections to densely connect different levels of capsules. CARNet achieved state-of-the-art performance on MNIST and Fashion-MNIST and outperformed the state-of-the-art capsule network on SVHN and CIFAR-10, which are both datasets containing natural images. It is an inspiring step of capsule networks to approach the state-of-the-art performance on natural image datasets, but the performance gap still exists between capsule network based models and the state-of-the-art CNN models. In the future, we plan to explore more efficient routing algorithm and make further attempt to deepen the capsule network architecture for better performance on complex data.