1 Introduction

Image classification is a fundamental problem in computer vision which implies a large number of applications. One of the most popular approaches for image classification is the bag-of-features (BoF) model (Csurka et al. 2004), a statistics-based algorithm in which local features are extracted, encoded and summarized into global image representation. Recently, as the availability of large-scale image databases (Deng et al. 2009) and powerful computational resources, convolutional neural networks (CNN) have been dominant in either large-scale image classification (Krizhevsky et al. 2012), or extracting transferrable features (Donahue et al. 2014; Jia et al. 2014; Razavian et al. 2014) for various computer vision tasks.

Fig. 1
figure 1

SIFT (Lowe 2004) matching with (red) and without (blue) reversal invariance (best viewed in color). In the latter case, it is difficult to find feature matches even between an image and its reversed copy (the above example). RIDE (illustrated in Sect. 4) brings reversal invariance to local descriptors, and significantly reduces the feature (e.g., BoF) distance between each pair of reversed objects (Color figure online)

People often capture images or photos without caring about its left/right orientation, since an image and its reversed copy often deliver the same visual concept. However, as we shall see in Sect. 3, statistics-based image representation is not often robust to image reversal. The reason mainly lies in that handcrafted descriptors such as SIFT (Lowe 2004) might change completely after being reversed (Fig. 1), therefore it is difficult to find feature correspondence between an image and its reversed version. Consequently, the BoF representation of an image might be totally different after it is reversed. Meanwhile, most CNN models are somewhat sensitive to image reversal, since convolution is not reversal-invariant. The unsatisfied feature stability limits machine learning algorithms from learning discriminative models. To cope with, researchers propose an effective approach named data augmentation, which works by adding a reversed copy for each image (Chatfield et al. 2011; Chai et al. 2013), or reversing each training image in the CNN training process with a probability of \(50\%\) (Krizhevsky et al. 2012). Although data augmentation consistently improves recognition accuracy, it still suffers the disadvantage of being more computationally expensive, especially on the online testing stage of the BoF model.

This paper presents an alternative idea, i.e., designing reversal-invariant representation of local patterns for both the BoF and CNN models. Although this idea has been previously used to deal with descriptor matching issues (Guo and Cao 2010; Ma et al. 2010; Zhao and Ngo 2013), we argue that these existing approaches are not mainly designed for image classification, and their performance is below satisfaction due to the lack of consideration on some key properties. We will detail this point in Sect. 4.5.

On the BoF model, we start with observing the difference between the original and reversed descriptors, and then suggest computing the orientation of each descriptor so that we can cancel out the impact of image reversal. For orientation estimation, we adopt an approximated summation on the gradient-based histograms of SIFT. Based on this theory, we propose Max-SIFT and RIDE (Reversal-Invariant Descriptor Enhancement), two simple, fast yet generalized algorithms which bring reversal invariance to local descriptors. Both Max-SIFT and RIDE guarantee to generate identical representation for an image and its left-right reversed copy. Experiments reveal that Max-SIFT and RIDE produce consistent accuracy improvement to image classification. RIDE outperforms data augmentation with higher recognition rates and lower time/memory consumptions. Max-SIFT and RIDE appear as preliminary publications (Xie et al. 2015b) and (Xie et al. 2015d), respectively.

In this extended journal version, we generalize the idea to the state-of-the-art CNN architectures. We first propose RI-Deep, a simple algorithm which extracts reversal-invariant deep features by post-processing. Then we design a reversal-invariant convolution operation (RI-Conv) and plug it into conventional CNNs, so that we can train reversal-invariant deep networks, which generate reversal-invariant deep features directly (without requiring post-processing). RI-Conv enjoys the advantage of enlarging the network capacity without increasing the model complexity. Experiments verify the effectiveness of our algorithms, demonstrating the importance of reversal invariance in training efficient CNN models and transferring deep features.

The remainder of this paper is organized as follows. Section 2 briefly introduces related work. Section 3 elaborates the importance of reversal invariance of image representation. Sections 4 and 5 illustrate our algorithms towards reversal-invariant representation of local patterns, and the application on the BoF and CNN models, respectively. Experiments are shown in each section. Finally, we conclude our work in Sect. 6.

2 Related Work

2.1 The BoF Model

The BoF model (Csurka et al. 2004) starts with describing local patches. Due to the limited descriptive power of raw image pixels, handcrafted descriptors, such as SIFT (Lowe 2004), HOG (Dalal and Triggs 2005) and LCS (Clinchant et al. 2007), are widely adopted. Although these descriptors can be automatically detected using operators such as DoG (Lowe 2004) and MSER (Matas et al. 2004), the dense sampling strategy (Bosch et al. 2006; Tuytelaars 2010) often works better on classification tasks.

Next, a visual vocabulary (codebook) is trained to estimate the feature space distribution. The codebook is often computed with iterative algorithms such as K-Means or GMM. Descriptors are then encoded with the codebook. Popular feature encoding methods include hard quantization, sparse coding (Yang et al. 2009), LLC encoding (Wang et al. 2010), super-vector encoding (Zhou et al. 2010), Fisher vector encoding (Sanchez et al. 2013), etc.

On the final stage, quantized feature vectors are aggregated as compact image representation. Sum pooling, max-pooling and \(\ell _p\)-norm pooling (Feng et al. 2011) can be different choices, and visual phrases (Zhang et al. 2009; Xie et al. 2014a) and/or spatial pyramids (Grauman and Darrell 2005; Lazebnik et al. 2006) are constructed for richer spatial context modeling. The representation vectors are then summarized (Xie et al. 2015c) and fed into machine learning algorithms such as the SVM.

It is also important to organize local features according to the property of the image dataset. A popular case is fine-grained object recognition, which is aimed at predicting the object class at a finer level of granularity. Given that each image contains, say, a bird, it remains to decide which species is depicted. As observed in Berg and Belhumeur (2013), Chai et al. (2013), Gavves et al. (2014), the key to fine-grained recognition is the alignment of semantic object parts, such as the head or tail of a bird. Meanwhile, for scene understanding, it is reasonable to capture other types of visual clues to assist recognition, such as orientations (Xie et al. 2014b) and important semantic regions (Lin et al. 2014b).

2.2 Convolutional Neural Networks

The Convolutional Neural Network (CNN) serves as a hierarchical model for large-scale visual recognition. It is based on a network with enough neurons is able to fit any complicated data distribution. In the early years, neural networks were shown effective for simple recognition tasks such as digit recognition (LeCun et al. 1990). More recently, the availability of large-scale training data (e.g., ImageNet (Deng et al. 2009)) and powerful GPUs makes it possible to train deep CNNs (Krizhevsky et al. 2012) which significantly outperform the BoF-based models. A CNN is composed of several stacked layers. In each of them, responses from the previous layer are convoluted and activated by a differentiable function. Hence, a CNN can be considered as a composite function, and is trained by back-propagating error signals defined by the difference between supervised and predicted labels at the top level. Recently, efficient methods were proposed to help CNNs converge faster and prevent over-fitting, such as ReLU activation (Krizhevsky et al. 2012), dropout and batch normalization (Ioffe and Szegedy 2015). It is believed that deeper networks produce better recognition results (Simonyan and Zisserman 2015; Szegedy et al. 2015).

The intermediate responses of CNNs, or the so-called deep features, serve as an efficient image description or a set of latent visual attributes (Donahue et al. 2014). They can be used for various types of vision applications, including image classification (Jia et al. 2014), image retrieval (Razavian et al. 2014; Xie et al. 2015a) and object detection (Girshick et al. 2014). A discussion of how different CNN configurations impact deep feature performance is available in (Chatfield et al. 2014). Visualization also helps understanding the behaviour of CNN models (Zeiler and Fergus 2014).

2.3 The Invariance of Descriptors

One of the major shortcomings of the BoF and CNN models is the unsatisfied stability of image representation. An important way of improvement is to study the invariance of local descriptors or patch operators. SIFT (Lowe 2004) achieves scale and rotation invariance by selecting the maxima in the scale space, and picking up a dominant orientation, via gradient computation, and rotating the local patch accordingly. Other examples include Shape Context (Belongie et al. 2002), SURF (Bay et al. 2008), BRIEF (Calonder et al. 2010), BRISK (Leutenegger et al. 2011), ORB (Rublee et al. 2011), FREAK (Alahi et al. 2012), etc. Radial transform (Takacs et al. 2013) and polar analysis (Liu et al. 2014) play important roles in generating rotation-invariant local features.

In some vision tasks such as fine-grained recognition, objects might have different left/right orientations. Since handcrafted descriptors (such as SIFT) and convolution operations are not reversal-invariant, feature representation of an image and its reversed version might be totally different. To this point, researchers propose to augment the image datasets by adding a reversed copy for each original image, and perform classification on the enlarged training and testing sets (Chatfield et al. 2011; Chai et al. 2013). In Paulin et al. (2014), it is even suggested to learn a larger image transformation set for data augmentation. Similar strategies are also adopted in the CNN training process, including a popular method which adds reversal on each training sample with a probability of \(50\%\), which, as a part of data augmentation, is often cooperated with other techniques such as image cropping (Krizhevsky et al. 2012). Although data augmentation improves the recognition accuracy consistently, it brings heavier computational overheads, e.g., almost doubled time and memory consumptions on the online testing stage of the BoF model, or the requirement of more training epochs to make the CNN training process converge.

There are also efforts on designing reversal-invariant descriptors for image retrieval. Some of them (Ma et al. 2010; Xie et al. 2015b) consider geometry-inverted and brightness-inverted variants, and perform a symmetric function, such as dimension-wise summation or maximization, to cancel out the reversal operation. Other examples include setting extra flag bits to represent the reversal information (Guo and Cao 2010), or enforcing that the flows of all regions should follow a pre-defined direction (Zhao and Ngo 2013). These pieces of work inspire us that symmetry is the key to reversal invariance (Skelly and Sclaroff 2007; Wang et al. 2011).

Despite their success, all of these methods are mainly designed for descriptor matching or object retrieval, and their performance on image classification is below satisfaction (see Table 3). In this paper, we propose efficient algorithms towards reversal-invariant representation, which benefits the recognition task.

Fig. 2
figure 2

Content-based image retrieval on the right-oriented Aircraft-100 dataset (best viewed in color). We use the same query image with different orientations. AP for average precision, TP for true-positive, FP for false-positive (Color figure online)

3 Why Reversal Invariance?

People often take pictures without caring about the left/right orientation, since an image and its left-right reversed copy often have the same semantic meaning. Consequently, there exist both left-oriented and right-oriented objects in almost every popular image datasets, especially in the case of fine-grained object recognition on animals, man-made tools, etc. For example, among 11,788 images of the Bird-200 dataset (Wah et al. 2011), at least 5000 birds are oriented to the left and other 5000 oriented to the right. In the Aircraft-100 dataset (Maji et al. 2013) with 10,000 images, we can also find more than 4800 left-oriented and more than 4500 right-oriented aircrafts, respectively.

However, we argue that most image representation models are sensitive to image reversal, i.e., the features extracted from an image and its reversed version may be completely different. Let us take a simple case study using the BoF model which encodes SIFT with the Fisher vectors (Perronnin et al. 2010). Detailed settings are shown in Sect. 4.6. We perform image classification and retrieval tasks on the Aircraft-100 dataset (Maji et al. 2013). We choose this dataset mainly because that the orientation of an aircraft is more easily determined than, say, a bird. Based on the original dataset, we manually reverse all the left-oriented images, generating a right-aligned dataset.

With the standard training/testing split (around 2 / 3 images are used for training and others for testing), the recognition rate is \(53.13\%\) on the original dataset and rises up to \(63.94\%\) on the right-aligned dataset, with a more-than-\(10\%\) absolute accuracy gain (a more-than-\(20\%\) relative gain). This implies that orientation alignment brings a huge benefit to fine-grained object recognition.

Still on the Aircraft-100 dataset. To diagnose, we use all (10,000) images in the right-aligned dataset for training, and evaluate the model with the entire right-aligned and left-aligned datasets, respectively. When we test on the right-aligned dataset, i.e., the training images are identical to the testing image, the classification accuracy is \(99.73\%\) (not surprising since we are just performing self-validation). However, when we test on the left-aligned dataset, i.e., each testing image is the reversed version of a training image, the accuracy drops dramatically to \(46.84\%\). This experiment reveals that a model learned from right-oriented objects may not recognize left-oriented objects very well.

Lastly, we perform image retrieval on the right-aligned dataset to directly evaluate the feature quality. Given a query image, we sort the candidates according to the \(\ell _2\)-distance between the representation vectors. Some typical results are shown in Fig. 2. When the query is of the same orientation (right) with the database, the search result is satisfying (mAP is 0.4143, the first false-positive is ranked at \(\#18\)). However, if the query image is reversed, its feature representation changes thoroughly, and the retrieval accuracy drops dramatically (mAP is 0.0025, the first true-positive is ranked at \(\#388\)). It is worth noting, in the latter case, that the reversed version of the query image is ranked at \(\#514\). This means that more than 500 images, most of them coming from different categories, are more similar to the query than its reversed copy, because the image feature is not reversal-invariant.

Although all the above experiments are based on the BoF model with SIFT and Fisher vectors, we emphasize that similar trouble also arises in the case of extracting deep features from a pre-trained neural network. Since convolution is not reversal invariance, the features extracted on an image and its reversed version are often different, even when the network is trained with data augmentation (each training image is reversed with a \(50\%\) probability). We will present detailed analysis on this point in Sect. 5.

Since an image and its reversed copy might have totally different feature representation, in a fine-grained dataset containing both left-oriented and right-oriented objects, we are implicitly partitioning the images of each class into two (or even more) prototypes. Consequently, the number of training images of each prototype is reduced and the risk of over-fitting increased. With this observation, some algorithms (Chatfield et al. 2011; Chai et al. 2013) augment the dataset by generating a reversed copy for each image to increase the number of training cases of each prototype, meanwhile the testing stage of deep networks often involves image reversal followed by score average (Krizhevsky et al. 2012; Simonyan and Zisserman 2015). We propose a different idea that generates reversal-invariant image representation in a bottom-up manner.

4 Reversal Invariance for BoF

This section introduces reversal invariance to the BoF model by designing reversal-invariant local descriptors. We first discuss the basic principle of designing reversal-invariant descriptors, and then provide a simple solution named Max-SIFT. After that, we generalize Max-SIFT as RIDE, and show that it can be applied to more types of local descriptors. Experiments on the BoF model and Fisher vector encoding verify the effectiveness of our algorithms.

Fig. 3
figure 3

SIFT and its reversed version. Corresponding grids/gradients are marked with the same number. Numbers in the original SIFT indicate the order of collecting grids/gradients

4.1 Reversal-Invariant Local Descriptors

4.1.1 Reversal Invariance as a Symmetric Function

We start from observing how SIFT, a typical handcrafted descriptor, changes with left-right image reversal. The structure of a SIFT descriptor is illustrated in Fig. 3. A patch is partitioned into \(4\times 4\) spatial grids, and in each grid a 8-dimensional gradient histogram is computed. Here we assume that spatial grids are traversed from top to bottom, then left to right, and gradient intensities in each grid is collected in a counter-clockwise order. When an image is left-right reversed, all the patches on it are reversed as well. In a reversed patch, both the order of traversing spatial grids and collecting gradient values are changed, although the absolute gradient values in the corresponding directions do not change. Taking the lower-right grid in the original SIFT descriptor (\(\#15\)) as the example. When the image is reversed, this grid appears at the lower-left position (\(\#12\)), and the order of collecting gradients in the grid changes from \(\left( 0,1,2,3,4,5,6,7\right) \) to \(\left( 4,3,2,1,0,7,6,5\right) \).

Denote the original SIFT as \({\mathbf {d}}={\left( d_0,d_1,\ldots ,d_{127}\right) }\), in which \({d_{i\times 8+j}}={a_{i,j}}\) for \({i}={0,1,\ldots ,15}\) and \({j}={0,1,\ldots ,7}\). As shown in Fig. 3, each index (0 to 127) of the original SIFT is mapped to another index of the reversed SIFT. For example, \(d_{117}\) (\(a_{14,5}\), the bold arrow in Fig. 3) would appear at \(d_{111}\) (\(a_{13,7}\)) when the descriptor is reversed. Denote the index mapping function as \(f\left( \cdot \right) \) (e.g., \({f\left( 0\right) }={28}\), \({f\left( 117\right) }={111}\)), so that the reversed SIFT can be computed as: \({\mathbf {d}^\mathrm {R}}\doteq {\mathbf {f}\left( \mathbf {d}\right) }= {\left( d_{f\left( 0\right) },d_{f\left( 1\right) },\ldots ,d_{f\left( 127\right) }\right) }\).

Towards reversal invariance, we need to design a descriptor transformation function \(\mathbf {r}\left( \mathbf {d}\right) \), so that \({\mathbf {r}\left( \mathbf {d}\right) }={\mathbf {r}\left( \mathbf {d}^\mathrm {R}\right) }\) for any descriptor \(\mathbf {d}\). For this, we define \({\mathbf {r}\left( \mathbf {d}\right) }={\mathbf {s}\left( \mathbf {d},\mathbf {d}^\mathrm {R}\right) }\), in which \(\mathbf {s}\left( \cdot ,\cdot \right) \) satisfies symmetry, i.e., \({\mathbf {s}\left( \mathbf {d}_1,\mathbf {d}_2\right) }={\mathbf {s}\left( \mathbf {d}_2,\mathbf {d}_1\right) }\) for any pair \(\left( \mathbf {d}_1,\mathbf {d}_2\right) \). In this way reversal invariance is achieved: \({\mathbf {r}\left( \mathbf {d}\right) }={\mathbf {s}\left( \mathbf {d},\mathbf {d}^\mathrm {R}\right) }= {\mathbf {s}\left( \mathbf {d}^\mathrm {R},\mathbf {d}\right) }= {\mathbf {s}\left( \mathbf {d}^\mathrm {R},\left( \mathbf {d}^\mathrm {R}\right) ^\mathrm {R}\right) }= {\mathbf {r}\left( \mathbf {d}^\mathrm {R}\right) }\). We use the fact that \({\left( \mathbf {d}^\mathrm {R}\right) ^\mathrm {R}}={\mathbf {d}}\) holds for any descriptor \(\mathbf {d}\).

4.1.2 The Max-SIFT Descriptor

There are a lot of symmetric function \(\mathbf {s}\left( \cdot ,\cdot \right) \), such as dimension-wise summation or maximization. Here we consider an extremely simple case named Max-SIFT, in which we choose the one in \(\mathbf {d}\) and \(\mathbf {d}^\mathrm {R}\) with the larger sequential lexicographic order. Here, by the sequential lexicographic order we mean to regard each SIFT descriptor as a sequence with length 128, and on each dimension (an element in the sequence), the larger value has a higher priority. The generalized algorithm for selecting the vector with the maximal sequential lexicographic order is provided in Algorithm 1.

figure a

Therefore, to compute the Max-SIFT descriptor for \(\mathbf {d}\), we only need to compare the dimensions of \(\mathbf {d}\) and \(\mathbf {d}^\mathrm {R}\) one by one and stop at the first difference. Let us denote the Max-SIFT descriptor of \(\mathbf {d}\) by \(\widehat{\mathbf {d}}\), and use the following notation:

$$\begin{aligned} {\widehat{\mathbf {d}}}={\mathbf {r}\left( \mathbf {d}\right) }={\widehat{\max }\left\{ \mathbf {d},\mathbf {d}^\mathrm {R}\right\} }, \end{aligned}$$
(1)

where \(\widehat{\max }\left\{ \cdot ,\cdot ,\ldots ,\cdot \right\} \) denotes the element with the maximal sequential lexicographic order (see Algorithm 1).

figure b

The pseudo codes of Max-SIFT are illustrated in Algorithm 2. We point out that there are many other symmetric functions, but their performance is often inferior to Max-SIFT. For example, using Average-SIFT, i.e., \({\mathbf {r}\left( \mathbf {d}\right) }={\frac{1}{2}\left( \mathbf {d}+\mathbf {d}^\mathrm {R}\right) }\), leads to 1–\(3\%\) accuracy drop on every single image classification case.

4.2 RIDE: Generalized Reversal Invariance

4.2.1 The Orientation of SIFT

Let us choose the descriptor from \(\mathbf {d}\) and \(\mathbf {d}^\mathrm {R}\) in a more generalized manner. In general, we define an orientation quantization function \(q\left( \cdot \right) \), and choose the one in \(\left\{ \mathbf {d},\mathbf {d}^\mathrm {R}\right\} \) with the larger function value. Ideally, \(q\left( \cdot \right) \) can capture the orientation property of a descriptor, e.g., \(q\left( \mathbf {d}\right) \) reflects the extent that \(\mathbf {d}\) is oriented to the right. Recall that in the original version of SIFT (Lowe 2004), each descriptor is naturally assigned an orientation angle \({\theta }\in {\left[ 0,2\pi \right) }\), so that we can simply take \({q\left( \mathbf {d}\right) }={\cos \theta }\), but orientation is often ignored in the implementation of dense SIFT (Bosch et al. 2006; Vedaldi and Fulkerson 2010). We aim at recovering the orientation with fast computations.

Fig. 4
figure 4

Estimating the orientation of SIFT

The major conclusion is that, the global orientation of a densely-sampled SIFT descriptor can be estimated by accumulating clues from the local gradients. For each of the 128 dimensions, we take its gradient value and lookup for its (1 of 8) direction. The gradient value is then decomposed into two components along the x-axis and y-axis, respectively. The left/right orientation of the descriptor is then computed by collecting the x-axis components over all the 128 dimensions. Formally, we define 8 orientation vectors \(\mathbf {u}_j\), \({j}={0,1,\ldots ,7}\). According to the definition of SIFT in Fig. 3, we have \({\mathbf {u}_j}={\left( \cos \left( j\pi /4\right) ,\sin \left( j\pi /4\right) \right) ^\top }\). The global gradient can be computed as \({\mathbf {G}\left( \mathbf {d}\right) }={\left( G_x,G_y\right) ^\top }={{\sum _{i=0}^{15}}{\sum _{j=0}^{7}}a_{i,j}\mathbf {u}_j}\). The computation of \(\mathbf {G}\left( \mathbf {d}\right) \) is illustrated in Fig. 4. The proof is provided in Appendix 1.

4.2.2 The RIDE Algorithm

We simply take \(G_x\) as the value of quantization function, i.e., \({q\left( \mathbf {d}\right) }={G_x\left( \mathbf {d}\right) }\) for every \(\mathbf {d}\). It is worth noting that \({q\left( \mathbf {d}\right) }={-q\left( \mathbf {d}^\mathrm {R}\right) }\) holds for any \(\mathbf {d}\), therefore we can simply use the sign of \(q\left( \mathbf {d}\right) \) to compute the reversal-invariant descriptor transform \(\widetilde{\mathbf {d}}\):

$$\begin{aligned} {\widetilde{\mathbf {d}}}={\mathbf {r}\left( \mathbf {d}\right) }={\left\{ \begin{array}{ll} \mathbf {d} &{}\quad {q\left( \mathbf {d}\right) }>{0} \\ \mathbf {d}^\mathrm {R} &{}\quad {q\left( \mathbf {d}\right) }<{0} \\ \widehat{\max }\left\{ \mathbf {d},\mathbf {d}^\mathrm {R}\right\} &{}\quad {q\left( \mathbf {d}\right) }={0} \end{array} \right. }\quad . \end{aligned}$$
(2)

We name the algorithm RIDE (Reversal-Invariant Descriptor Enhancement). When \({q\left( \mathbf {d}\right) }={0}\), RIDE degenerates to Max-SIFT. Since Max-SIFT first compares \(d_0\) and \(d_{28}\) (\({f\left( 0\right) }={28}\), see Sect. 4.1.1), we can approximate it as a special case of RIDE, with \({q\left( \mathbf {d}\right) }={d_0-d_{28}}\).

4.2.3 Generalized RIDE

We generalize RIDE to (a) other local descriptors and (b) more types of reversal invariance.

When RIDE is applied on other dense descriptors, we can first extract SIFT descriptors on the same patches, then compute \(\mathbf {G}\) to estimate the orientation of those patches, and perform reversal operation if necessary. A generalized flowchart of RIDE is illustrated in Algorithm 3. The extra time overheads in this process mainly come from the computation of SIFT, which can be exempted in the case of using Color-SIFT descriptors. For example, RGB-SIFT is composed of three SIFT vectors \(\mathbf {d}_\mathrm {R}\), \(\mathbf {d}_\mathrm {G}\) and \(\mathbf {d}_\mathrm {B}\), from the individual red, green and blue channels, therefore we can compute \(\mathbf {G}_\mathrm {R}\), \(\mathbf {G}_\mathrm {G}\) and \(\mathbf {G}_\mathrm {B}\) individually, and combine them with \({\mathbf {G}}={0.30\mathbf {G}_\mathrm {R}+0.59\mathbf {G}_\mathrm {G}+0.11\mathbf {G}_\mathrm {B}}\). For other color SIFT descriptors, the only difference lies in the linear combination coefficients. By this trick we can perform RIDE on Color-SIFT descriptors very fast.

figure c

In the case that RIDE is applied to fast binary descriptors for image retrieval, we can obtain the orientation vector \(\mathbf {G}\) without computing SIFT. Let us take the BRIEF descriptor (Calonder et al. 2010) as an example. For a descriptor \(\mathbf {d}\), \(G_x\left( \mathbf {d}\right) \) is obtained by accumulating the binary tests. For each tested pixel pair \(\left( p_1,p_2\right) \) with distinct x-coordinates, if the left pixel has a smaller intensity value, add 1 to \(G_x\left( \mathbf {d}\right) \), otherwise subtract 1 from \(G_x\left( \mathbf {d}\right) \). If the x-coordinates of \(p_1\) and \(p_2\) are the same, this pair is ignored. \(G_y\left( \mathbf {d}\right) \) is similarly computed. We still take \({q\left( \mathbf {d}\right) }={G_x\left( \mathbf {d}\right) }\) to quantize left-right orientation. This idea can also be generalized to other binary descriptors such as ORB (Rublee et al. 2011), which is based on BRIEF.

RIDE is also capable of cancelling out a larger family of reversal operations, including upside-down image reversal, and image rotation by \(90^\circ \), \(180^\circ \) and \(270^\circ \). For this we need to take more information from the global gradient vector \({\mathbf {G}}={\left( G_x,G_y\right) ^\top }\). Recall that limiting \({G_x}>{0}\) selects 1 descriptor from 2 candidates, resulting in RIDE-2 (equivalent to RIDE mentioned previously) for left-right reversal invariance. Similarly, limiting \({G_x}>{0}\) and \({G_y}>{0}\) selects 1 from 4 descriptors, obtaining RIDE-4 for both left-right and upside-down reversal invariance, and limiting \({G_x}>{G_y}>{0}\) obtains RIDE-8 for both reversal and rotation invariance. We do not use RIDE-4 and RIDE-8 in this paper, since upside-down reversal and heavy rotations are not often observed, whereas the descriptive power of a descriptor is reduced by strong constraints. An experimental analysis of this issue can be found in Appendix 2.

Fig. 5
figure 5

The distribution of \(q(\cdot )\) values on the Bird-200 dataset. For Max-SIFT, \({q(\mathbf {d})}={d_0{-}d_{28}}\) (see the texts in Sect. 4.2.2). All the SIFT descriptors are \(\ell _2\)-normalized so that \({\Vert \cdot \Vert _2}={1}\) (Color figure online)

4.3 Numerical Stability Issues

Both Max-SIFT and RIDE may suffer from numerical stability issues, especially in areas with low gradient magnitudes. When the quantization function value \(q\left( \mathbf {d}\right) \) is close to 0, small image noise may change the sign of \(q\left( \mathbf {d}\right) \) and, consequently, the Max-SIFT and/or RIDE descriptors. To quantitatively analyze the impact of image noise, we first estimate the distribution of \(q\left( \mathbf {d}\right) \) on the Bird-200 dataset (Wah et al. 2011). According to the histogram in Fig. 5, one may observe that most SIFT descriptors have relatively small \(q\left( \cdot \right) \) values using Max-SIFT. With the descriptors normalized (\({\left\| \mathbf {d}\right\| _2}={1}\) for all \(\mathbf {d}\)), the median of \(\left| q\left( \mathbf {d}\right) \right| \) values is 0.0556 for Max-SIFT and 0.1203 for RIDE, which implies that RIDE is more robust than Max-SIFT to small image noise. The reason is that RIDE summarizes the information of the whole SIFT descriptor, while Max-SIFT only considers few dimensions.

Consider image classification on the Bird-200 dataset. We add random Gaussian noise with standard deviation 0.1203 (the median of all \(\left| q\left( \cdot \right) \right| \) values) to each of the \(q\left( \cdot \right) \) value of RIDE, and find that random noise only causes the classification accuracy drop by less than \(1\%\), which is relatively small compared to the gain of RIDE (\(6.37\%\), see Table 2(d)).

Experimental results are very similar on the Aircraft-100 dataset (Maji et al. 2013).

4.4 Application to Image Classification

The benefit brought by the Max-SIFT and RIDE to image classification is straightforward. Consider an image \(\mathbf {I}\), and a set of, say, SIFT descriptors extracted from the image: \({\mathcal {D}}={\left\{ \mathbf {d}_1,\mathbf {d}_2,\ldots ,\mathbf {d}_M\right\} }\). When the image is left-right reversed, the set \(\mathcal {D}\) becomes: \({\mathcal {D}^\mathrm {R}}={\left\{ \mathbf {d}_1^\mathrm {R},\mathbf {d}_2^\mathrm {R},\ldots ,\mathbf {d}_M^\mathrm {R}\right\} }\). If the descriptors are not reversal-invariant, i.e., \({\mathcal {D}\ne \mathcal {D}^\mathrm {R}}\), the feature representation produced by \(\mathcal {D}\) and \(\mathcal {D}^\mathrm {R}\) might be totally different. With Max-SIFT or RIDE, we have \({\widehat{\mathbf {d}}}={\widehat{\mathbf {d}^\mathrm {R}}}\) or \({\widetilde{\mathbf {d}}}={\widetilde{\mathbf {d}^\mathrm {R}}}\), for any \(\mathbf {d}\), therefore \({\widehat{\mathcal {D}}}={\widehat{\mathcal {D}^\mathrm {R}}}\) or \({\widetilde{\mathcal {D}}}={\widetilde{\mathcal {D}^\mathrm {R}}}\). Consequently, we generate the same representation for an image and its reversed copy.

A simple trick applies when Max-SIFT or RIDE is adopted with Spatial Pyramid Matching (Lazebnik et al. 2006). Note that corresponding descriptors might have different x-coordinates on an image and its reversed copy, e.g., a descriptor appearing at the upper-left corner of the original image can also be found at the upper-right corner of the reversed image, resulting in the difference in spatial pooling bin assignment. To cope with, we count the number of descriptors to be reversed, i.e., those satisfying \({\widehat{\mathbf {d}}}\ne {\mathbf {d}}\) or \({\widetilde{\mathbf {d}}}\ne {\mathbf {d}}\). If the number is larger than half of the total number of descriptors, we left-right reverse the descriptor set by replacing the x-coordinate of each descriptor with \(W-x\), where W is the image width. This is equivalent to predicting the orientation of an image using the orientation of SIFT descriptors (see Sect. 4.7.1). Despite this, a safer way is to use symmetric pooling bins. In our experiments (see Sect. 4.6.1), we use a spatial pyramid with 4 regions (the entire image and three horizontal stripes).

4.5 Comparison with Previous Work

We first discuss the relationship between RIDE and the original SIFT descriptor. SIFT aligns the descriptor by rotating the detected region, so that the dominant orientation is pointed to the upright direction. After this, we have \({G_x}={0}\) (see “The Implementation of SIFT”) thus RIDE degenerates to Max-SIFT. However, in the ordinary implementation of dense sampling where descriptors are not aligned beforehand (Vedaldi and Fulkerson 2010), RIDE serves as a quick and efficient manner of achieving reversal invariance, which does not require considering rotation invariance explicitly.

Some recent work achieves reversal invariance with data augmentation (Wang et al. 2010; Chatfield et al. 2011; Chai et al. 2013; Paulin et al. 2014). This strategy aims at increasing the number of training samples for each prototype (e.g., left/right orientation, etc.). In Sect. 4.6.2, we will show that RIDE works better and faster than data augmentation, arguably because RIDE allows local reversal on each part of the object, which is more flexible than global reversal (see Sect. 4.7.3).

Although some reversal-invariant descriptors are proposed for descriptor matching (Ma et al. 2010; Guo and Cao 2010; Zhao and Ngo 2013) or object retrieval (Guo and Cao 2010; Zhao and Ngo 2013; Xie et al. 2015b), these descriptors have not been adopted in fine-grained recognition. We implement several of them, and compare it with our algorithm in Table 3. One can observe that Max-SIFT and RIDE significantly outperform these competitors in every single case. We will provide detailed analysis on this issue in Sect. 4.7.2.

Table 1 Image classification datasets used in our experiments

RIDE shares the idea of accumulating gradients with FIND (Guo and Cao 2010), but uses a different way of computation. FIND partitions the descriptor bins into two parts (left and right) according to their spatial positions, and accumulates each gradient intensity without considering its horizontal (left/right) components. That is to say, all gradient bins at the same cell contribute equally, although they are pointed to different directions. RIDE improves this strategy by decomposing each bin into a horizontal component and a vertical component, and accumulates each component with their (plus/minus) sign separately into a global gradient vector \(\mathbf {G}\). According to our proof in Appendix 1, our strategy better utilizes the gradient information. In experiments, RIDE works better than FIND in orientation prediction (see Sect. 4.7.1), which, we believe, is the main reason of the accuracy gap in classification (see Table 3).

Finally, we discuss the difference between RIDE and F-SIFT (Zhao and Ngo 2013). We predict the orientation of each descriptor based on the description vector, while F-SIFT normalizes each local patch before extracting the descriptor on it. In other words, RIDE computes orientation based on the descriptor, while F-SIFT based on the patch. We point out that prediction based on the descriptor is more reliable: we are dealing with descriptors, not patches, in the following feature encoding stage. In experiments, RIDE works better than F-SIFT in orientation prediction (see Sect. 4.7.1). We believe this is the main reason of the accuracy gap in classification (see Table 3). This principle generalizes to the CNN case (see Sect. 5.2.5). In addition, RIDE works much faster than F-SIFT (see Sect. 4.6.3).

In summary, although our algorithm shares some ideas with the previous approach, we achieve better performance in image classification by generating more discriminative image-level features. Moreover, our approach enjoys the advantage of easy implementation and low computational costs.

4.6 Experiments

4.6.1 Datasets and Settings

We evaluate our algorithm on four publicly available fine-grained object recognition datasets, three scene classification datasets and one generic object classification dataset. The detailed information of the used datasets is summarized in Table 1.

Basic experimental settings follow the recent proposed BoF model (Sanchez et al. 2013). An image is scaled, with the aspect ratio preserved, so that there are 300 pixels on the larger axis. We only use the region within the bounding box if it is available. We use VLFeat (Vedaldi and Fulkerson 2010) to extract dense RootSIFT (Arandjelovic and Zisserman 2012) descriptors. The spatial stride and window size of dense sampling are 6 and 12, respectively. On the same set of patches, LCS, RGB-SIFT and Opponent-SIFT (Sande et al. 2010) descriptors are also extracted. Max-SIFT or RIDE is thereafter computed for each type of descriptors. In the former case, we can only apply Max-SIFT on SIFT-based descriptors, thus the LCS descriptors remain unchanged. When we apply RIDE to RootSIFT, we compute RIDE on the original SIFT, obtained by dimension-wise squaring RootSIFT.

The dimensions of SIFT, LCS and color SIFT descriptors are reduced by PCA to 64, 64 and 128, respectively. We cluster the descriptors with a GMM of 32 components, and use the improved Fisher vectors (IFV) for feature encoding. A spatial pyramid with 4 regions (the entire image and three horizontal stripes) is adopted. Features generated by SIFT and LCS descriptors are concatenated as the FUSED feature. The final vectors are square-root normalized followed by \(\ell _2\) normalized (Lapin et al. 2014), and then fed into LibLINEAR (Fan et al. 2008), a scalable SVM implementation, with the slacking parameter \({C}={10}\). Averaged accuracy by category is reported on the fixed training/testing split provided by the authors.

To compare our results with the state-of-the-art classification results, strong Fisher vectors are extracted by resizing the images to 600 pixels in the larger axis, using spatial stride 8, window size 16, and clustering 256 GMM components.

4.6.2 Image Classification Results

We first report fine-grained object recognition accuracy with different descriptors in Table 2. Beyond original descriptors, we implement Max-SIFT, RIDE and data augmentation. By data augmentation we mean to generate a reversed copy for each training/testing image, use the augmented set to train the model, test with both original and reversed samples, and predict the label with a soft-max function (Paulin et al. 2014).

Table 2 Classification accuracy (\(\%\)) of different models

In Table 2, one can see that both Max-SIFT and RIDE produces consistent accuracy gain beyond original descriptors (ORIG). Moreover, when we use SIFT or Color-SIFT descriptors, RIDE also produces higher accuracy than using data augmentation (AUGM). When the LCS descriptors are used, RIDE works a little worse than AUGM, which is probably because the orientation of LCS (not a gradient-based descriptor) is not very well estimated with SIFT gradients.

We shall emphasize that data augmentation requires almost doubled computational costs compared to RIDE (see Sect. 4.6.3 for details), since the time/memory complexity of many classification models is proportional to the number of training/testing images. To make fair comparison, we double the codebook size used in RIDE to obtain longer features, since it is a common knowledge that larger codebooks often lead to better classification results. Such a model, denoted by RIDE \(\times \) 2, works better than AUGM in every single case.

We also use strong features and compare Max-SIFT and RIDE with other reversal-invariant descriptors, namely MI-SIFT (Ma et al. 2010), FIND (Guo and Cao 2010) and F-SIFT (Zhao and Ngo 2013). We compute these competitors for each SIFT component in RGB-SIFT, and leave LCS unchanged in the FUSED feature. Besides descriptor computation, all the other stages are exactly the same. Results are shown in Table 3. The consistent 3–\(4\%\) gain verifies that RIDE makes stable contribution to visual recognition.

Table 3 Classification accuracy (\(\%\)) comparison with some recent work

It is worth noting that Bird-200 is a little bit different from other datasets, since part detection and description often play important roles in this fine-grained recognition task. Recently, researchers design complicated part-based recognition algorithms, including Chai et al. (2013), Gavves et al. (2014), Xie et al. (2013), Zhang et al. (2013), Zhang et al. (2014b), Zhang et al. (2014a), Li et al. (2015), and Zhang et al. (2016)). We also evaluate RIDE with SIFT on the detected parts provided by some of these approaches. RIDE boosts the recognition accuracy of Chai et al. (2013) and Gavves et al. (2014) from 56.6 to \(60.7\%\) and from 65.3 to \(67.4\%\), respectively. In comparison, Gavves et al. (2014) applies data augmentation to boost the accuracy from 65.3 to \(67.0\%\). RIDE produces better results with only half time/memory consumption. With the parts learned by deep CNNs (Zhang et al. 2014a), the baseline performance is \(71.5\%\) with SIFT and LCS descriptors. With RIDE, we obtain \(73.1\%\), which is close to the accuracy using deep features (\(73.9\%\) reported in the paper).

To reveal that Max-SIFT and RIDE can be applied to generalized classification, we perform experiments on the scene classification and generic object recognition tasks. The FUSED (SIFT with LCS) features are used, and the results are summarized in Table 3. It is interesting to see that Max-SIFT and RIDE also work well to outperform the recent competitors. Thus, although Max-SIFT and RIDE are motivated by the observation on the fine-grained case, it enjoys good recognition performance on a wide range of image datasets.

4.6.3 Computational Costs

We report the time/memory cost of RIDE with SIFT in Table 4. The time cost of Max-SIFT is consistently lower than RIDE, and the memory cost is the same.

Since the only extra computation of RIDE comes from gradient accumulation and descriptor permutation, the additional time cost of RIDE is merely about \(1\%\) of SIFT computation. This is significantly lower than some previous methods such as F-SIFT (about \(30\%\) extra time Zhao and Ngo 2013). RIDE does not require any extra memory storage. However, if the dataset is augmented with left-right image reversal, one needs to compute and store two instances for each image, descriptor and feature vector, resulting in almost doubled time and memory overheads, which is comparable with using a double-sized codebook, whereas the latter produces much better classification results.

Table 4 Time/memory cost in each step of the BoF model

4.7 Discussion

4.7.1 Object Orientation Prediction

As an diagnostic experiment, we predict the left/right orientation of an image based on the orientation quantization function \(q\left( \cdot \right) \). We use the Aircraft-100 dataset, in which the orientation (left or right) of each aircraft is manually labeled. We adopt the ground-truth bounding box to crop the image, so that the objects are better aligned. After cropping, all the images are resized so that the longer axis has 600 pixels, and dense SIFT descriptors are extracted using the VLFeat library (Vedaldi and Fulkerson 2010).

We use 2 / 3 images (approximately 67 per category) for training. Without loss of generality, we assume that all the training images are oriented to right. For each testing image, we compute its orientation score by accumulating clues from each descriptor. Suppose the width and height of the testing image are W and H, then a descriptor on the position \(\left( x,y\right) \) has the “relative position” \(\left( x/W,y/H\right) \). The descriptor starts from an evidence 0. On each training image, we seek for the nearest descriptor measured by the \(\ell _2\) distance of relative positions, and compare their orientation quantization function values: if the values are of the same sign, add 1 to the evidence, otherwise subtract 1 from the evidence. Each descriptor contributes the sign of the evidence (in \(\left\{ +1,-1,0\right\} \)) to the orientation score of the entire image. After all the descriptors are processed, if the total score is positive, then this testing image is oriented to right; if it is negative, to left; if it is 0, a random guess is taken (this rarely happens).

Fig. 6
figure 6

Global versus local image reversal. Local reversal (with manually labeled regions in the yellow boxes) allows more flexible image representation, and produces smaller feature distances between the test and target images (Color figure online)

One can note that different results are produced with different orientation quantization functions. Using RIDE (\({q\left( \mathbf {d}\right) }={G_x\left( \mathbf {d}\right) }\)), the prediction accuracy is \(65.45\%\), whereas using Max-SIFT (\({q\left( \mathbf {d}\right) }={d_0{-}d_{28}}\), see Sect. 4.2.2), the accuracy drops to \(54.69\%\), barely above the chance level (\(50\%\)). We also implement FIND (Guo and Cao 2010) (ignoring gradient components) and F-SIFT (Zhao and Ngo 2013) (computing on raw pixels) for orientation prediction, and the accuracy is 56.19 and \(57.41\%\), respectively. In summary, RIDE predicts the orientation of a local patch more accurately. As we have analyzed in Sect. 3, this significantly helps image classification.

4.7.2 Classification Versus Retrieval

We discuss the difference between image classification and object retrieval, followed by the reason that some descriptors designed for image retrieval (Ma et al. 2010) fail to achieve good performance in image classification (see Table 3).

To retrieve a partial-duplicate or near-duplicate object from a set of candidate images, the most important thing is to find a sufficient number of local feature matches. As handcrafted descriptors such as SIFT are often sensitive to image transformation (e.g., reversal, rotation, etc.), a good strategy is to cancel out these possible variations by pooling over the corresponding dimensions of each descriptor. For instance, MI-SIFT (Ma et al. 2010) averages the values of each four-bin group to cancel out image reversal. Although this may result in precision drop, i.e., introducing some false matches, it is still possible to filter them out with post-processing such as spatial verification.

As to image classification, especially fine-grained object recognition, things are different: the most important task is to generate some discriminative image features to distinguish one class from others. Therefore, it is often unreasonable to destroy the structure of a local descriptor for the purpose of invariance. To this point, we propose to maximally preserve the descriptive power of local features and prevent dimension-wise operations.

4.7.3 Global Reversal Versus Local Reversal

We point out that RIDE benefits image representation in the perspective of regional feature matching.

An essential difference between RIDE and data augmentation comes from the comparison of local and global image reversal. By local reversal we mean that RIDE can decide whether to reverse every single descriptor individually, while data augmentation only allows to choose one image from two candidates, i.e., either original or globally reversed. Figure 6 compares both strategies in an intuitive manner. In these cases, we aim at matching a target image with a possibly reversed test image. With global reversal, we have only two choices and the flexibility of our model is limited. With local reversal, however, it is possible to reverse smaller regions such as the turned head of the bird or cat. By this we can find larger numbers of true feature matches and obtain more similar image representation, i.e., smaller feature distance. Therefore, it is not difficult to understand the reason why RIDE works even better than data augmentation.

4.8 Summary

In this section, we explore reversal invariance in the context of the BoF model. We propose the Max-SIFT descriptor and the RIDE (Reversal-Invariant Descriptor Enhancement) algorithm which bring reversal invariance to local descriptors. Our idea is inspired by the observation that most handcrafted descriptors are not reversal-invariant, whereas many fine-grained datasets contain objects with different left/right orientations. Max-SIFT and RIDE cancel out the impact of image/object reversal by estimating the orientation of each descriptor, and then forcing all the descriptors to have the same orientation. Experiments reveal that both of them significantly improve the accuracy of fine-grained object recognition and scene classification with very few computational costs. Both Max-SIFT and RIDE are robust to small image noise. Compared with data augmentation, RIDE produces better results with lower time/memory consumptions.

5 Reversal Invariance for CNN

In this section, we generalize the above ideas from BoF to CNN. We first present a simple strategy to improve deep features, which demonstrates the importance of reversal invariance in CNN. Motivated by which, we propose a new convolution operation so that we can train reversal-invariant deep networks directly.

Table 5 Classification accuracy (\(\%\)) without or with reversal-invariant deep features

5.1 Reversal-Invariant Deep Features (RI-Deep)

5.1.1 Average-Deep and Max-Deep

We start with observing the behavior of deep features, which are the neuron responses of a testing image extracted from a pre-trained CNN model. In general, if an image is left-right reversed, the neuron responses on each layer will change accordingly, because the convolution operation is not reversal-invariant. In most deep CNN models (Krizhevsky et al. 2012; Szegedy et al. 2015; Simonyan and Zisserman 2015), data augmentation with image reversal is widely adopted on both the training and testing stages. In training, each sample is reversed with a probability of \(50\%\), so that the network can learn from objects with different orientations. In testing, neuron responses on both the original and reversed versions are computed and averaged. We shall verify in the later experiments that data augmentation modules in training and testing are complementary to each other.

Let us denote an image as \(\mathbf {I}\) and its left-right reversed version as \(\mathbf {I}^\mathrm {R}\). Given a deep CNN model \(\mathcal {M}\) and a specified layer number l, the feature vector extracted on the l-th layer is \({\mathbf {f}_l\left( \mathbf {I};\mathcal {M}\right) }\in {{\mathbb {R}}^{K_l}}\), where \(K_l\) is the number of channels (convolution kernels) on that layer. With the reversed image, we can also compute the reversed deep feature on the same layer, i.e., \(\mathbf {f}_l\left( \mathbf {I}^\mathrm {R};\mathcal {M}\right) \). Most often, \({\mathbf {f}_l\left( \mathbf {I};\mathcal {M}\right) }\ne {\mathbf {f}_l\left( \mathbf {I}^\mathrm {R};\mathcal {M}\right) }\).

Inspired by Sect. 4.1.1, we seek for a deep feature transformation function \(\mathbf {r}\left( \cdot \right) \), which satisfies \({\mathbf {r}\left( \mathbf {I}\right) }={\mathbf {r}\left( \mathbf {I}^\mathrm {R}\right) }\) for any image \(\mathbf {I}\). Here, we choose two simple symmetric operations, named Average-Deep and Max-Deep, respectively:

$$\begin{aligned} {\mathbf {r}_l^\mathrm {AVG}\left( \mathbf {I}\right) }= & {} {\frac{1}{2}\left[ \mathbf {f}_l\left( \mathbf {I};\mathcal {M}\right) + \mathbf {f}_l\left( \mathbf {I}^\mathrm {R};\mathcal {M}\right) \right] }, \end{aligned}$$
(3)
$$\begin{aligned} {\mathbf {r}_l^\mathrm {MAX}\left( \mathbf {I}\right) }= & {} {\max \left\{ \mathbf {f}_l\left( \mathbf {I};\mathcal {M}\right) , \mathbf {f}_l\left( \mathbf {I}^\mathrm {R};\mathcal {M}\right) \right\} }, \end{aligned}$$
(4)

where \(\max \left\{ \cdot ,\cdot \right\} \) denotes the element-wise maximization of two vectors. This strategy is different from that used in Sect. 4.1.2 which chooses the one with the larger sequential lexicographic order. Let us take a little space to illustrate the difference between SIFT descriptors and deep features. SIFT is a type of handcrafted descriptor, each dimension in it corresponds to the gradient intensity on a specific direction. If we simply take the dimension-wise average or maximum of a SIFT and its reversed version (like MI-SIFT (Ma et al. 2010)), the inner structure as well as the relationship between corresponding dimensions may be damaged, leading to significant accuracy drop. In a deep feature vector, however, each dimension corresponds to the extent that a visual concept or attribute arises, therefore it is reasonable to take dimension-wise operation to consider the visual attributes contained in both the original and reversed images.

We point out that both Average-Deep and Max-Deep are similar to the testing strategy used in some state-of-the-art CNNs (Krizhevsky et al. 2012; Simonyan and Zisserman 2015; Szegedy et al. 2015). In which, using visual clues in both the original and reversed testing images produces around 0.2–\(0.5\%\) accuracy gain. We shall verify that this strategy is also useful in transferring features for image classification.

Regarding computational costs, both Average-Deep and Max-Deep require doubled time complexity on the feature extraction stage, but they do not need extra time or memory on the online testing stage. Considering that the feature extraction is performed only once, the extra cost is thus reasonable.

5.1.2 Image Classification Experiments

We evaluate the models on all the eight datasets introduced in Sect. 4.6. We use the AlexNet and the VGGNet (both the 16- and 19-layer models, provided by the MatConvNet library (Vedaldi and Lenc 2014)) as the pre-trained deep networks. To demonstrate the importance of reversal invariance, we also train another version of the AlexNet, in which we do not use image reversal as data augmentation in the training process. The top-5 recognition error rate on the ILSVRC2012 validation set increases by about \(2\%\) (19.9 vs. \(21.9\%\)).

Most often, it is reasonable to pre-process the testing image according to the way of network training. For the AlexNet, we simply resize each image to \(227\times 227\) pixels and feed it into the network. In the original testing process (Krizhevsky et al. 2012), the image is resized to \(256\times 256\) and five sub-images are cropped at different positions and the average response is computed. While this strategy improves the accuracy consistently, we do not use it so that the feature extraction stage is accelerated. For the VGGNet-16 and VGGNet-19, we maximally preserve the aspect ratio of the input image, constrain the width and height divisible by 32 (the down-sampling rate), and the number of pixels is approximately \(512^2\). Such a strategy improves the performance of deep features significantly, compared to resizing all images to \(224\times 224\) pixels. After the neuron responses are computed, we extract the features from each layer by average-pooling over all spatial positions. Throughout the rest part, we use the features extracted from the fc-6 layer, activated by ReLU (Krizhevsky et al. 2012). These feature vectors are \(\ell _2\)-normalized and sent to LIBLINEAR (Fan et al. 2008), a scalable SVM implementation, with the slacking parameter \({C}={10}\). Averaged accuracy by category is reported. Results are summarized in Table 5. Each number is the mean of 10 random training/testing splits.

5.1.3 Discussion

First, it is obvious that feature quality, reflected by classification accuracy, is improved with data augmentation techniques, either on the training stage (reversing each training sample with the probability of \(50\%\)) or on the testing stage (computing the average or maximal neuron responses for each image and its reversed copy), which reveals the importance or reversal invariance in training CNN models and transferring CNN features. In most cases, Average-Deep works slightly better than Max-Deep.

Let us take the results produced by the AlexNet as an example. On the one hand, when the network is trained with both original and reversed samples, the validation accuracy on ILSVRC2012 is improved by about \(2\%\), and, consequently, we obtain consistent accuracy gain using the transferred features for recognition. On the other hand, both Average-Deep and Max-Deep boost the classification accuracy, sometimes even by a large margin, e.g., more than \(8\%\) on the Aircraft-100 dataset. Even when the network is trained with data augmentation, Average-Deep and Max-Deep still improve the classification rate consistently, although the gain becomes relatively small (approximately \(2\%\) on the Aircraft-100 dataset) due to the marginal effect. Considering that the baseline is already high in most cases and that both Average-Deep and Max-Deep are extremely easy to implement, the accuracy gain is significant yet effortless to get.

To compare with the BoF model with handcrafted descriptors, we also copy a part of Table 3 here. We can see that, in most cases, deep features outperform BoF significantly, except in the Aircraft-100 dataset: this set contains 100 aircraft models which are rigid (suitable for handcrafted descriptors) and do not appear in the pre-training set (the ILSVRC2012 dataset) of the deep networks. The BoF model obtains higher accuracy only in this dataset. In contrast, in the Pet-37 dataset, all the objects (cats or dogs) are deformable and the pre-training set contains a lot of these concepts, therefore the performance of deep features is dominant to that of the BoF model. Finally, we observe that the reported accuracy on the Bird-200 dataset is inferior to some recent publications, mainly because we do not use part-based models (see the related contents in Sect. 4.6.2).

It is instructive to note that the accuracy gain brought by reversal invariance differs from case to case. For example, on the Aircraft-100 and Bird-200 datasets, the accuracy gain is impressive (\({>}1\%\) using VGGNet-19), however in the LandUse-21 and Pet-37 datasets, it is less significant (\({<}0.2\%\)). The reason lies in the intrinsic property of the datasets and their relationship with the pre-training data. The orientation of an aircraft or a bird is more significant, and also more meaningful in visual recognition, than that of a scene captured from the sky. Moreover, all the above networks are pre-trained with the ILSVRC2012 dataset, which contains a large number of cat and dog images (but no aircraft images), therefore it is easier to achieve reversal invariance when the testing image contains a related visual concept.

The above experiments suggest that designing reversal invariance also helps to improve the quality of deep features. In what follows, we will design intrinsic reversal-invariant convolution modules, i.e., Average-Conv and Max-Conv, which lead to a more direct way of generating reversal-invariant deep features. These two strategies will be compared in Sect. 5.3.

5.2 Reversal-Invariant Convolution (RI-Conv)

5.2.1 Average-Conv and Max-Conv

As an alternative solution to post-processing deep features towards reversal invariance, we show that directly training a reversal-invariant deep CNN is possible and more efficient. Here, we say a CNN model is reversal-invariant if it produces symmetric neural responses on each pair of symmetric images, i.e., for an arbitrary image \(\mathbf {I}\), if we take \(\mathbf {I}\) and \(\mathbf {I}^\mathrm {R}\) as the input, the neuron responses on each layer of the pre-trained network \(\mathcal {M}\), i.e., \(\mathbf {f}_l\left( \mathbf {I};\mathcal {M}\right) \) and \(\mathbf {f}_l\left( \mathbf {I}^\mathrm {R};\mathcal {M}\right) \), are symmetric to each other: \({\mathbf {f}_l^\mathrm {R}\left( \mathbf {I};\mathcal {M}\right) }={\mathbf {f}_l\left( \mathbf {I}^\mathrm {R};\mathcal {M}\right) }\). In a reversal-invariant network, when we extract features on a fully-connected layer (e.g., fc-6 in the AlexNet), the original and reversed outputs are exactly the same since the spatial resolution is \(1\times 1\). If the features are extracted on an earlier layer (e.g., conv-5 in the AlexNet), we can also achieve reversal invariance by performing average-pooling or max-pooling over the responses at all spatial locations, similar to the strategy used in Sect. 5.1 and some previous publications (He et al. 2015).

The key to constructing a reversal-invariant CNN model is to guarantee that all the network layers are performing symmetric operations. Among the frequently used network operations (e.g., convolution, pooling, normalization, non-linear activation, etc.), only convolution is non-symmetric, i.e., a local patch and its reversed copy may produce different convolution outputs. We aim at designing a new reversal-invariant convolution operation to replace the original one.

Mathematically, let l be the index of a convolution layer with \(K_l\) convolution kernels, and \({\mathbf {f}_{l-1}}\doteq {\mathbf {f}_{l-1}\left( \mathbf {I};\mathcal {M}\right) }\) is the input of the l-th layer. \({\varvec{\theta }_l}\in {{\mathbb {R}}^{W_l\times H_l\times K_l}}\) and \({\mathbf {b}_l}\in {{\mathbb {R}}^{K_l}}\) are the weighting and bias parameters, respectively. The convolution operation takes a patch with the same spatial scale as the kernels, computes its inner-product with each kernel, and adds the bias to the result. For the k-th kernel, \({k}={1,2,\ldots ,K_l}\), we have:

$$\begin{aligned} {f_l^{\left( a,b,k\right) }\left( \mathbf {I};\mathcal {M}\right) }= {\left\langle \mathbf {p}_{l-1}^{\left( a,b\right) },\varvec{\theta }_l^{\left( k\right) }\right\rangle +b_l^{\left( k\right) }}, \end{aligned}$$
(5)

Here, \(f_l^{\left( a,b,k\right) }\) denotes the unit at the spatial position \(\left( a,b\right) \), and convoluted by the k-th kernel, on the l-th layer, and \(\mathbf {p}_{l-1}^{\left( a,b\right) }\) denotes the related data patch on the previous layer. Note that \(\mathbf {p}_{l-1}^{\left( a,b\right) }\) and \(\varvec{\theta }_l^{\left( k\right) }\) are of the same dimension. \(\left\langle \cdot ,\cdot \right\rangle \) denotes the inner-product operation.

Inspired by the reversal-invariant deep features, reversal invariance is achieved if we perform a symmetric operation on the neuron responses from a patch and its reversed copy. Again, we take the element-wise average and maximal responses, leading to the Average-Conv and the Max-Conv formulations:

$$\begin{aligned}&{r_{l,\mathrm {AVG}}^{\left( a,b,k\right) }\left( \mathbf {I};\mathcal {M}\right) } \nonumber \\&\quad ={\frac{1}{2}\left[ \left\langle \mathbf {p}_{l-1}^{\left( a,b\right) },\varvec{\theta }_l^{\left( k\right) }\right\rangle + \left\langle \mathbf {p}_{l-1}^{\left( a,b\right) ,\mathrm {R}},\varvec{\theta }_l^{\left( k\right) }\right\rangle \right] + b_l^{\left( k\right) }}\nonumber \\&\quad ={\left\langle \frac{1}{2}\left[ \mathbf {p}_{l-1}^{\left( a,b\right) }+\mathbf {p}_{l-1}^{\left( a,b\right) ,\mathrm {R}}\right] , \varvec{\theta }_l^{\left( k\right) }\right\rangle +b_l^{\left( k\right) }}, \end{aligned}$$
(6)
$$\begin{aligned}&{r_{l,\mathrm {MAX}}^{\left( a,b,k\right) }\left( \mathbf {I};\mathcal {M}\right) } \nonumber \\&\quad ={\max \left\{ \left\langle \mathbf {p}_{l-1}^{\left( a,b\right) },\varvec{\theta }_l^{\left( k\right) }\right\rangle , \left\langle \mathbf {p}_{l-1}^{\left( a,b\right) ,\mathrm {R}},\varvec{\theta }_l^{\left( k\right) }\right\rangle \right\} + b_l^{\left( k\right) }}. \end{aligned}$$
(7)

Since Average-Conv and Max-Conv simply perform the corresponding pooling operation on two convoluted data blobs, it is straightforward to derive the formula of back-propagation. In the case of Average-Conv, we can accelerate both forward-propagation and back-propagation by modifying the input data (the original input \(\mathbf {p}_{l-1}^{\left( a,b\right) }\) is replaced by the average of \(\mathbf {p}_{l-1}^{\left( a,b\right) }\) and \(\mathbf {p}_{l-1}^{\left( a,b\right) ,\mathrm {R}}\)), thus the time-consuming convolution process is performed only once. In the case of Max-Conv, we need to create a mask blob to store the index of forward-propagated units, as in max-pooling layers.

In what follows, we will plug the reversal-invariant convolution modules into the conventional CNN models. We name a CNN model RI-CNN if all the convolution layers in it, including the fully-connected layers, are made reversal-invariant. We start with discussing its property of reversal invariance, and the cooperation with data augmentation strategies.

5.2.2 Reversal Invariance and Data Augmentation

It is obvious that both Average-Conv and Max-Conv are symmetric operations. We prove that an RI-CNN is reversal-invariant, i.e., the feature vectors extracted from an image and its reversed copy are identical.

We use mathematical induction, with the starting point that an image and its reversed copy are symmetric to each other, i.e., \({\mathbf {f}_0^\mathrm {R}\left( \mathbf {I};\mathcal {M}\right) }= {\mathbf {f}_0\left( \mathbf {I}^\mathrm {R};\mathcal {M}\right) }\). Now, given that \({\mathbf {f}_{l-1}^\mathrm {R}\left( \mathbf {I};\mathcal {M}\right) }= {\mathbf {f}_{l-1}\left( \mathbf {I}^\mathrm {R};\mathcal {M}\right) }\), we derive that \({\mathbf {f}_l^\mathrm {R}\left( \mathbf {I};\mathcal {M}\right) }= {\mathbf {f}_l\left( \mathbf {I}^\mathrm {R};\mathcal {M}\right) }\), if both of them are computed with a reversal-invariant convolution operation on the l-th layer. For this, we assume that when padding (increasing data width/height with 0-valued stripes) is used, the left and right padding width must be the same, i.e., the geometric symmetry is guaranteed.

Consider a patch \(\mathbf {p}_{l-1}^{\left( a,b\right) }\left( \mathbf {I};\mathcal {M}\right) \). According to symmetry, we have \({\mathbf {p}_{l-1}^{\left( a,b\right) ,\mathrm {R}}\left( \mathbf {I};\mathcal {M}\right) }= {\mathbf {p}_{l-1}^{\left( W_{l-1}-a-1,b\right) }\left( \mathbf {I}^\mathrm {R};\mathcal {M}\right) }\), where \(\left( W_{l-1}-a-1,b\right) \) is the left-right symmetric position to \(\left( a,b\right) \). These two patches are fed into the k-th convolution kernel \(\varvec{\theta }_k\), and the outputs are \(f_l^{\left( a,b,k\right) }\left( \mathbf {I};\mathcal {M}\right) \) and \(f_l^{\left( W_{l-1}-a-1,b,k\right) }\left( \mathbf {I}^\mathrm {R};\mathcal {M}\right) \). These two scalars equal to each other since both Average-Conv and Max-Conv are symmetric, thus the neuron responses on the l-th layer are also symmetric: \({\mathbf {f}_l^\mathrm {R}\left( \mathbf {I};\mathcal {M}\right) }= {\mathbf {f}_l\left( \mathbf {I}^\mathrm {R};\mathcal {M}\right) }\). This finishes the induction, i.e., the neuron responses of a pair of symmetric inputs are symmetric.

We point out that such a good property in feature extraction can be a significant drawback in the network training process, since an RI-CNN model suffers from the difficulty to cooperate with “reversal data augmentation”. Here, by reversal data augmentation we mean to reverse each training sample with the probability of \(50\%\). As an RI-CNN model generates exactly the same (symmetric) neuron responses for an image and its reversed copy, these two training samples actually produce the same gradients with respect to the network parameters on each layer. Consequently, reversing a training image cannot provide any “new” information to the network training process. Since using reversal-invariant convolution operations increases the capacity of the CNN model (see Sect. 5.2.5), the decrease of training data may cause over-fitting, which harms the generalization ability of the model.

To deal with, we intentionally damage the reversal-invariant property of the network in the training process. For this, we crop the training image into a smaller size, so that the geometric symmetry does not hold any more. Taking the AlexNet as an example. The original input image size is \(227\times 227\), in which geometric symmetry holds on each convolutional/pooling layer. If the size becomes \(S'\times S'\) where \(S'\) is a little smaller than 227, then in some layers, the padding margin on the left side is not the same as that on the right side. By the way, \(S'\) shall be at least 199, so that the input of the fc-6 layer still has a spatial resolution of \(6\times 6\). In practice, we simply use \({S'}={199}\), so that we can generate as many training images as possible. As we shall see in Sect. 5.2.4, this strategy also improves the baseline accuracy slightly.

5.2.3 CIFAR Experiments

The CIFAR10 and CIFAR100 datasets (Krizhevsky and Hinton 2009) are subsets of the 80 million tiny images database (Torralba et al. 2008). Both of them have 50,000 training samples and 10,000 testing samples, each of which is a \(32\times 32\) color image, uniformly distributed among all the categories (they have 10 and 100 categories, respectively). It is a popular dataset for training relatively small-scale neural networks for simple recognition tasks.

We use a modified version of the LeNet (LeCun et al. 1990). A \(32\times 32\times 3\) image is passed through three units consisting of convolution, ReLU and max-pooling operations. Using abbreviation, the network configuration for CIFAR10 can be written as:

figure d

Here, C5(S1P2) means a convolutional layer with a kernel size 5, a spatial stride 1 and a padding width 2; MP3(S2) refers to a max-pooling layer with a kernel size 5 and a spatial stride 2, and FC10 indicates a fully-connected layer with 10 outputs. In CIFAR100, we replace FC10 as FC100 in order to categorize 100 classes. A 2-pixel wide padding is added to each convolution operation so that the width and height of the data remain unchanged. We do not produce multiple sizes of input images, since the LeNet is not symmetric itself: on each pooling layer, the left padding margin is 0 while the right margin is 1. We apply 120 training epochs with the learning rate \(10^{-3}\), followed by 20 epochs with the learning rate \(10^{-4}\), and another 10 epochs with the learning rate \(10^{-5}\).

We train six different models individually, i.e., training a network with the original version of convolution, Average-Conv or Max-Conv (three choices), and using data augmentation (probabilistic training image reversal) or not (two choices). We name these models as LeNet, LeNet-AUGM (“AUGM” for augmentation), LeNet-AVG, LeNet-AVG-AUGM, LeNet-MAX and LeNet-MAX-AUGM, respectively. For instance, LeNet-MAX indicates the network with Max-Conv but without data augmentation. Note that reversal-invariant convolution also applies to the fully-connected layer. To reveal the statistics significance, we train 5 independent models in each case, and report the average accuracy.

Results are summarized in Table 6. One can observe similar phenomena on both datasets. First, Average-Conv causes dramatic accuracy drop (we will analyze the reason in Sect. 5.2.5). On the other side, data augmentation and Max-Conv improve the recognition accuracy consistently. In the CIFAR10 dataset, both data augmentation and Max-Conv boost the accuracy by about \(1\%\), and these two strategies cooperate with each other to outperform the baseline by \(1.5\%\). In the CIFAR100 dataset, Max-Conv alone contributes a more-than-\(2\%\) accuracy gain, which is higher than the \(1.5\%\) gain by data augmentation, and the combination gives a nearly \(2.5\%\) gain. As a last note, the improvement on CIFAR100 is much larger than that on CIFAR10, which indicates that CIFAR100 is a more challenging dataset (with more categories), and that Max-Conv increases the network capacity to benefit the recognition task.

Table 6 CIFAR classification error rate (\(\%\)) with respect to different versions of LeNet
Table 7 CIFAR classification error rate (\(\%\)) with respect to different versions of BigNet

To compare with the state-of-the-art, we also evaluate our algorithm on a larger network named BigNet. It is a 10-layer network (Nagadomi 2014) which resembles the VGGNet in using small convolutional kernels. We inherit all the settings from the original author, and report the accuracy produced by six different versions as above. Results summarized in Table 7 deliver the same conclusion as using the small network (LeNet).

Table 8 Comparison with some recent work in CIFAR classification error rates (\(\%\))

We also put our result in the CIFAR datasets in the context of some recent publications in this dataset. Results are summarized in Table 8.

5.2.4 ILSVRC2012 Classification Experiments

We also evaluate our model on the ILSVRC2012 classification dataset (Russakovsky et al. 2015), a subset of the ImageNet database (Deng et al. 2009) which contains 1000 object categories. The training set, validation set and testing set contain \(1.3~\mathrm {M}\), \(50~\mathrm {K}\) and \(150~\mathrm {K}\) images, respectively. We use the AlexNet (provided by the CAFFE library (Jia et al. 2014), sometimes referred to as the CaffeNet). The input image is of size \(199\times 199\), randomly cropped from the original \(256\times 256\) image (see Sect. 5.2.2). The AlexNet structure is abbreviated as:

figure e

Here, LRN means local response normalization (Krizhevsky et al. 2012) and D0.5 means Dropout with a drop ratio 0.5. Following the setting of CAFFE, a total of 450,000 mini-batches (approximately 90 epochs) are used for training, each of which has 256 image samples, with the initial learning rate 0.01, momentum 0.9 and weight decay 0.0005. The learning rate is decreased to 1 / 10 after every 100,000 mini-batches.

Table 9 ILSVRC2012 classification error rate (\(\%\)) with respect to different versions of AlexNet

We individually train four models, i.e., using original convolution or Max-Conv, using data augmentation or not. Similarly, we name these variations as AlexNet, AlexNet-AUGM, AlexNet-MAX and AlexNet-MAX-AUGM, respectively. Considering the large computational costs, we only train two individual networks for each setting. We do not train models based on Average-Conv according to the dramatic accuracy drop in CIFAR experiments. Note that Max-Conv also applies to the fully-connected layers.

Result are summarized in Table 9. As we have slightly modified the data augmentation strategy, the baseline performance (\(80.48\%\) top-5 accuracy) is slightly better than that reported using the standard setting (approximately \(80.1\%\) top-5 accuracy Footnote 1). With Max-Conv, the top-5 accuracy is boosted to \(80.88\%\), which show that Max-Conv and data augmentation cooperate to improve the recognition performance. We emphasize that the \(0.40\%\) accuracy gain is not small, given that the network structure is unchanged. Meanwhile, the conclusions drawn in CIFAR experiments also hold in this large-scale image recognition task.

Fig. 7
figure 7

Error rate curves and training/testing loss curves on the CIFAR datasets and the ILSVRC2012 dataset. We report top-1 and top-5 error rates in CIFAR and ILSVRC2012, respectively

5.2.5 Discussion

The success of data augmentation and Max-Conv implies that it is instructive to force the network to learn reversal invariance by constructing corresponding specific structures. This part provides some discussion based on the experimental results.

We first provide another perspective on the behavior of reversal-invariant convolution. Let us consider a convolution layer (the l-th layer), in which we compute the inner product of a patch \(\mathbf {p}_{l-1}^{\left( a,b\right) }\) (probably together its reversed copy) and each of the \(K_l\) convolution kernels \(\varvec{\theta }_k\), \({k}={1,2,\ldots ,K_l}\). Since inner production measures the similarity between \(\mathbf {p}_{l-1}^{\left( a,b\right) }\) and \(\varvec{\theta }_k\), the patches with similar appearance to \(\varvec{\theta }_k\) will get a significant neuron response. In this situation, \(\varvec{\theta }_k\) behaves like a codeword and \(K_l\) is the codebook size. Meanwhile, we note that image patterns are often left-right asymmetric, e.g., a slash may have either a positive or a negative angle. Without reversal-invariant convolution, we need two different codewords to encode a visual pattern and its reversed version, which significantly decreases the capacity of the limited codebook size (\(K_l\)), and, consequently, the capacity of the network. Reversal-invariant convolution brings the opportunity for each local patch to be compared with a codeword and its reversed copy, so that equivalently, we need only one codeword to store a visual pattern and its reversed version.

Table 10 Classification accuracy (\(\%\)) comparison with deep features extracted using different strategies. Note that the first part of this table is not the same as in Table 5, since we have used a different way of training AlexNet (see Sect. 5.2.2). With Max-Conv, we do not need to post-process the feature vector since it is naturally reversal-invariant

Now, it is easy to see the difference between Average-Conv and Max-Conv. Both of them compute the similarity between each codeword and each original/reversed local patch. After that, Average-Conv considers the average response and Max-Conv gets the larger response. Thus, in the context of average-convolution, a local patch can get a high response if it is similar to both the codeword itself and its reversed copy, which is not reasonable since image patterns are often left-right asymmetric. Another way of understanding is that Eqn (6) is equivalent to averaging each kernel and its reversed copy and then convoluting with the patch, thus Average-Conv constrains all the input patches to be left-right symmetric. In opposite, Max-Conv animates those local patches which are similar to either the original or reversed codeword, thus contributes to improve the discriminative power and invariance of the deep feature. Consequently, in the experiments, Average-Conv causes dramatic accuracy drop, while Max-Conv boosts the performance significantly.

To take a closer observation on the network training with data augmentation and/or reversal-invariant convolution, we plot the testing error rate as well as the training/testing loss with respect to the number of training epochs. Note that both strategies augment the training data: data augmentation implicitly increases the number of training samples, meanwhile reversal-invariance convolution makes it possible to “see” more variations of local patches. From the results shown in Fig. 7, we can see that using data augmentation slows down the network training since it introduce regularization to the training process. However, with Max-Conv, network training converges faster since the network capacity is increased. These two strategies cooperate with each other to make full use of the increased model capacity, meanwhile prevent over-fitting.

Another strategy for reversal invariance in CNN is to pre-compute the orientation of each patch before convolution, e.g., using the method presented in Zhao and Ngo (2013) for patch normalization. We point out that RI-Conv is more effective than Zhao and Ngo (2013). RI-Conv always chooses the patch orientation which generates the higher score in template matching, while Zhao and Ngo (2013) determines the patch orientation without considering the learned templates in the network. As we have shown in Sect. 4.5, the former strategy often works better. In addition, RI-Conv is very easy to implement, while Zhao and Ngo (2013) may produce complicated gradients on the input patch.

5.3 Model Comparison

We compare the strategies discussed in this section, i.e., training a non-reversal-invariant deep network followed by post-processing deep features for reversal invariance, and training a reversal-invariant deep network to generate reversal-invariant deep features directly.

We use the networks trained in the previous experiments, namely AlexNet-AUGM and AlexNet-MAX-AUGM, to extract deep features on the image classification dataset used in Sect. 5.1. Results are summarized in Table 10. We observe that the features extracted from AlexNet-MAX produce consistently higher accuracy than the original non-reversal-invariant features. The performance is comparable to the Average-Deep and Max-Deep features, while the computation is cheaper. In one word, designing intrinsically reversal-invariant modules is helpful to visual recognition.

5.4 Summary

In this part, we generalize the idea of reversal-invariant representation from the BoF model to deep CNNs, and verify that reversal invariance is also important in both deep feature extraction and deep network training. We propose two effective algorithms (RI-Deep and RI-Conv). First, computing neuron responses on a testing image as well as its reversed version makes it possible to extract reversal-invariant deep features from a pre-trained network which is not reversal-invariant. Second, a small modification in convolution leads to a deep network which is intrinsically reversal-invariant, which has larger capacity yet unchanged complexity, meanwhile makes the feature extraction more effective. Reversal-invariant convolution also cooperates well with data augmentation, creating the possibility of applying deep neural networks to even larger databases.

Last but not least, the Max-Conv operator is easy to implement yet fast to carry out (less than \(20\%\) extra time is required).

6 Conclusions

It is important to consider reversal invariance in order to achieve more robust image representation, but conventional BoF and CNN models often lack of an explicit implementation of reversal invariance. This paper presents a basic idea that designs reversal-invariant local patterns, such as Max-SIFT and RIDE (local descriptors), RI-Deep (deep features) and RI-Conv (convolution), so that reversal invariance is guaranteed in the representation based on the BoF and CNN models. The proposed algorithms are very easy to implement yet efficient to carry out, meanwhile producing consistent accuracy improvement. The success of our algorithms also reveals that designing invariance directly is often more effective than using data augmentation, and that these two strategies can often cooperate with each other towards better visual recognition.