1 Introduction

One of the goals of machine learning research is to obtain better generalization ability using as fewer parameters as possible. Convolutional neural networks (CNN) [1, 2] are closer to this goal than fully connected networks because they share weights of convolutional filters across different image locations. As shown in Fig. 1a, such a weight-sharing mechanism provides CNNs with space shift invariance, and reduces the number of parameters of CNNs [3]. However, such a weight-sharing mechanism does not exist in the rotation dimension and CNNs still lacks rotation invariance [4]. As shown in Fig. 1b, when the input image rotates by a certain angle, the original weights of the convolutional kernels quickly and seriously mismatch with the regions to be convoluted, which usually leads to the failure of feature extraction and classification in rotated image recognition (RIR) tasks.

Fig. 1
figure 1

Differences between the invariances of feature extraction of CNN for translated and rotated images

To recognize arbitrarily rotated images, existing RIR researchers commonly train CNN using rotation data augmentation [5,6,7,8,9,10,11]. There are three implementation approaches for rotation data augmentation. The most straightforward yet widely used implementation approach of rotation data augmentation [12] is randomly rotating each training image, so that the CNN output as identical classification scores as possible for an image and its rotated versions. Such an straightforward approach does not need to adjust the architectures of CNNs, but its performance is heavily dependent on the diversity of the orientations among training images. This is to say, CNNs must learn as many orientations as possible to achieve high RIR performance. However, providing CNNs with training images at all-possible orientations is hard work. To improve RIR performance, the second implementation approach of rotation data augmentation is to build multiple rotation channels, which actively rotated the features extracted by CNN [4, 13, 14]. For example, Laptev et al. [8] uniformly rotated the input image by 24 angles in \([ - 180^{ \circ } ,{\kern 1pt} {\kern 1pt} {\kern 1pt} 180^{ \circ } ]\), then applied 'maximum pooling' to features extracted in these 24 images. These 24 images are called image rotation channels, which can produce more robust rotation-invariant features than a single image channel. But because building multiple image channels are usually computationally unfriendly, multiple rotation channels have also been built by rotating feature maps [14,15,16,17,18] or convolution kernels [4, 6, 7, 19]. The two approaches above take advantage of CNN's powerful function-fitting ability to ‘memorize’ images at all orientations so that the trained CNN can directly recognize images at any orientations. Different from these two approaches, other effective RIR methods are based on ‘derotation’ [20]. For example, Spatial Transformer Network (STN) [21], has been proposed to reduce the Number of training images. STN uses a spatial transformer layer to align rotated images to several canonical orientations, so the CNN only needs to learn several orientations. However, it should be noted that the training of the spatial transformer layer still relies on images at different orientations. This is to say STN still uses rotation data augmentation.

Although the rotation data augmentation has been widely used in RIR tasks, it has at least two disadvantages. First, CNN has to learn images at as many orientations as possible, which significantly increases the Number of training samples. More training samples require more parameters of CNN to achieve good generalization, and also increase the training time of CNN. Second, the outputs of CNN trained with rotation data augmentation are independent or insensitive to the rotation of input images. This means that rotation data augmentation makes CNN lose orientation information. As a result, if the orientation need to be predicted, then CNN has to build extra orientation regression tasks [22, 23]. The abovementioned two disadvantages don’t conform to how the human brain works because the human brain doesn’t need to memorize images at all orientations. Psychological studies [24,25,26] have suggested that there is a "mental rotation" process when we recognize objects in less similar orientations in the human brain. “Mental rotation” refers to that an arbitrarily orientated mental imagery is rotated to multiple orientations and recognized multiple times until the mental imagery attains its normal orientation. Benefited from “mental rotation”, human brain is able to recognize arbitrarily oriented images by learning and memorize images at one orientation. This inspires us to develop a similar RIR mechanism for CNN to improve the RIR performance while decreasing the Number of training samples, which is named Pre-Rotation Only at Inference stage (PROAI).

From the humans “mental rotation” process we can see that rotation invariance can be achieved by share the recognition ability across different rotations. This is also the core recognition principle of PROAI to achieve rotation invariance, i.e., to share CNN weights across the rotation dimension of images. At the training stage, PROAI trains CNN with images only at one orientation, so the CNN can achieve high generalization using only a small number of parameters. Also, this CNN is supposed to correctly classify images only at the orientation. At the inference stage, PROAI generalize the recognition ability of the CNN to any other orientations through a pre-rotation operation. The pre-rotation operation rotates each test image into its all-possible orientations to generate multiple rotated versions, which are then fed into the CNN with a small number of parameters to calculate classification scores. The maximum of these classification scores is applied to simultaneously estimate both the category and the orientation of each test image.

Compared with existing RIR methods, PROAI has made the following two contributions: (1) Architectures and weights of the entire CNN are shared across the rotation dimension for the first time, which allows CNN no longer need to learn rotated images at arbitrary orientations in RIR tasks, reducing both the Number of free parameters of CNN and training time. (2) PROAI builds an orientation-related learning task for CNN, enabling CNN to estimate images' orientations without adding extra orientation regression tasks.

2 Pre-rotation Only at Inference-Stage (PROAI)

The workflow of PROAI illustrated in Fig. 2 is divided into two stages, i.e., the training stage and the inference stage. At the training stage, a CNN, which have a small number of parameters, is trained by images only at one orientation. As a result, this CNN is able to recognize images only at one orientation. At the inference stage, a multi-channel weight-sharing mechanism generalizes this recognition ability to images at any other orientations, so images at any other orientations can also be recognized.

Fig. 2
figure 2

Illustration of using PROAI to recognize rotated images

2.1 Training Procedure of PROAI

2.1.1 Training Procedure

Figure 3 shows rotated images from MNIST [27] and Fashion MNIST [28] datasets. As it is shown, the angle between the top of the image object and the positive direction of the y-axis is defined to be the rotation orientation of the image. In this paper, the orientations of images are denoted as \(\varphi\) (\({\kern 1pt} \varphi \in [ - 180^{ \circ } , + 180^{ \circ } ]\)), and the arbitrarily rotated image of the object with category \(i\) can be expressed as \(x_{i}^{\varphi } {\kern 1pt} {\kern 1pt} {\kern 1pt} (i \in Z)\). Specifically, \(\varphi = 0\) (the positive direction of the y-axis) refers to the normal orientation, and \(x_{i}^{0}\) refers to the images with category \(i\) at normal orientation.

Fig. 3
figure 3

Illustration of rotation orientations of images (red arrows are assumed to be the top of objects)

Based on the abovementioned definition of normal orientation, the training loss of PROAI is defined in Eq. (1). We design the following modified cross-entropy loss function for PROAI. Owing to the fact that PROAI only uses \(x_{i}^{0}\) to train CNN, the cross entropy loss function \(L[f(\cdot),i]\) is calculated through Eq. (1)

$$L[f(x_{i}^{0} ),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i] = - \sum\limits_{{c{ = }1}}^{C} {y_{c} \cdot \log [f_{c} (x_{i}^{0} )} ]$$
(1)

where \(f( \cdot )\) represents the output of a CNN, \(i\) is the category labels of an image \(x_{i}^{0}\). \(C\) is the Number of sample categories in the training dataset, and \(c \in [1,C]\). \(y\) is a one-hot vector for the category \(i\), \(y_{c}\) is the \(c\)th element of \(y\), and \(f_{c} (x_{i}^{0} )\) is the \(c\)th element of \(f( \cdot )\).

One CNN trained with images only at one orientation is supposed to output higher classification scores for images at the orientation than images at other orientations. Therefore, PROAI makes CNN output the maximum classification scores only at normal orientation. In other words, PROAI makes CNN output peak value at the normal orientation. Such property is formulated by:

$$\left\{ {\begin{array}{*{20}c} {\max [f(x_{j}^{\varphi } )] \le \max [f(r^{\varphi } (x_{i}^{\varphi } ))],{\kern 1pt} {\kern 1pt} \;{\kern 1pt} ^{\prime} = ^{\prime} \;{\text{holds}}\;{\kern 1pt} {\text{when}}{\kern 1pt} \;j = i} \\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \max [f(x_{i}^{\varphi } )] \le \max [f(x_{i}^{0} )],{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} ^{\prime} = ^{\prime} \;{\text{holds}}{\kern 1pt} \;{\text{when}}\;{\kern 1pt} {\kern 1pt} \varphi = 0} \\ \end{array} } \right.{\kern 1pt} {\kern 1pt}$$
(2)

where \(i\), \(j\) are category labels of image objects, \(r^{\varphi } ( \cdot )\) represents rotating an image by an angle of \(\varphi\), \(x_{i}^{\varphi } = r^{\varphi } (x_{i}^{0} )\).

2.1.2 Comparison with the Training Procedures of Existing RIR Methods

Figure 4 compares of PROAI with existing RIR methods from three aspects: training images, network architectures, and annotation.

Fig. 4
figure 4

Comparison of the training stages of PROAI and existing RIR methods

In terms of training images, PROAI trains CNNs using images only at normal orientation, while existing methods have tried to make CNN 'memorize' images at as many orientations as possible (see Fig. 4 or Fig. 3). Take the implementation approach of Rotation Data Augmentation (RDA) integrated from PytorchFootnote 1 as an example, each training image will be rotated to a random orientation in each epoch of training. That is to say, each training image will be transformed into a new version in each training epoch. Noting the Number of total training epochs as \(N_{epoch}\), then data augmentation requires \(N_{epoch}\) times more training images than PROAI.

In terms of network architecture, PROAI requires CNN with smaller parameters than existing RIR methods. This is because PROAI is required to learn images only at one orientation, and the variation of the training image dataset of PROAI is apparently smaller than that of RDA methods [29]. To achieve better RDA performance, existing RIR methods must improve CNNs’ parameters using more complex network architectures. For example, Transformation-Invariant Pooling (TIPooling) obtains the final rotation invariant features by pooling the features extracted from multiple image rotation channels [8]; Oriented Response Networks (ORN) and RotEqNet [6] create multiple rotation channels by rotating convolutional filters to extract rotation invariant features [7]; STN trains a complex spatial transformation layer to align images to a similar orientation before using CNN to recognize images [21].

In terms of annotation, PROAI only have to annotate category labels for training images, no matter in classification or orientation estimation tasks. As a contrast, existing RIR methods have to annotate both category or angle labels for classification or orientation estimation tasks.

The differences in the training procedures shown in Fig. 4 result in different CNN classification scores for rotated images. Take rotated handwritten digit recognition task as an example, the output classification scores of CNN to a digit “4” in different orientations are shown in Fig. 5. In Fig. 5a, the CNN trained with images only at normal orientation outputs higher classification scores on the correct category label '4' for images that are oriented close to the normal orientation and outputs lower classification scores on wrong category labels or other orientations. The output in Fig. 5a satisfies Eq. (1). For comparison, as shown in Fig. 5b, rotation data augmentation makes the CNN insensitive to changes of orientation, resulting in similar responses for different orientations on the correct category label ‘4’, i.e., \(f(x_{i}^{0} ) \approx f(x_{i}^{\varphi } )\).

Fig. 5
figure 5

The output of CNNs trained by PROAI and RDA to images at different orientations

2.2 Inference Procedure of PROAI

2.2.1 Inference Procedure

The inference procedure of PROAI illustrated in Fig. 2 is composed of three steps.

(1) Pre-rotate a test image into multiple orientations.

Given a test image at an orientation of \(\theta\), \({x}^{\theta }\in {\mathbb{R}}^{H\times W\times d}\) (\(H\), \(W\), \(d\) are the height, width, and number of color channels of the test image). It is firstly rotated by N angles that are uniformly distributed in \(\left[ { - \Phi , + \Phi } \right]\). The angle interval for rotation is \(\Delta \Phi\), \(\Delta \Phi { = }{{{2}\Phi } \mathord{\left/ {\vphantom {{{2}\Phi } {\text{N}}}} \right. \kern-0pt} {\text{N}}}\). The rotated images of the test image are denoted as,

$$x^{{\varphi_{n} }} = r^{{ \pm \Delta \varphi_{n} }} (x^{\theta } )$$
(3)
$$\Delta \varphi_{n} { = } \pm n \cdot \Delta \Phi ,\;\;\varphi_{n} = \theta \pm n \cdot \Delta \Phi$$
(4)

where \({\kern 1pt} \forall n = - \left( {\frac{N}{2} - 1} \right), - \left( {\frac{N}{2} - 2} \right),...,0,...,\frac{N}{2} - 1,\frac{N}{2}\), \(0^{ \circ } \le \Phi \le 180^{ \circ }\).

(2) Calculate the classification score matrix of multiple forward channels.

Applying a cluster of CNN sharing the same architecture and weights with the CNN trained with images only at normal orientation to independently conduct forward calculation in each channel, the classification score vector \(\vec{f}_{n}\) of the \(n\)th channel can be obtained as,

$$\vec{f}_{n} = f(x^{{\varphi_{n} }} ) = [f_{0} ,...,f_{m} ,...,f_{{M{ - }1}} ]$$
(5)

where, \(f_{m}\) is the classification confidence that belongs to the \(m\)th class image,\(m \in [0,...,M - 1]\), where M denotes the number of all possible categories. For the M-class classification problem, \(f_{n}\) is a vector with length of M. All these classification score vectors are arranged to form a classification score matrix \(F = \left[ {\vec{f}_{{ - ({N \mathord{\left/ {\vphantom {N 2}} \right. \kern-0pt} 2} - 1)}}^{T} {\kern 1pt} {\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} ...{\kern 1pt} {\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \vec{f}_{0}^{T} {\kern 1pt} {\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} ...{\kern 1pt} {\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \vec{f}_{{{N \mathord{\left/ {\vphantom {N 2}} \right. \kern-0pt} 2}}}^{T} } \right]\), where \(F\in {\mathbb{R}}^{N\times M}\).

(3) Estimating category and orientation by finding the maximum classification score matrix.

According to Eq. (1), the CNN trained with images only at normal orientation outputs higher classification scores for images oriented close to the normal orientation and outputs lower classification scores for images oriented far away from the normal orientation (see Fig. 5a). Therefore, both the category \(\hat{c}\) and the orientation \(\hat{\theta }\) of the test image can be estimated from the coordinate of the maximum \(F\), i.e.,

$$\hat{c} = k,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \hat{\theta } = - j \cdot \Delta \Phi$$
(7)

where,

$$(j,k) = \mathop {\arg \max }\limits_{(n,m)} \{ {\text{soft}}\max [F^{N \times M} (n,m)]\} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {\kern 1pt} {\kern 1pt} {\kern 1pt} \mathop {\arg \max }\limits_{(n,m)} [F^{N \times M} (n,m)]$$
(8)

2.2.2 Comparison with the Inference Procedures of Existing RIR Methods

Figure 6 compares PROAI with existing RIR methods from three aspects, i.e., preprocessing of test images, network architectures, and outputs.

Fig. 6
figure 6

Comparison of the inference stages of PROAI and existing RIR methods

In terms of the preprocessing of test images, PROAI pre-rotates each test image into multiple orientations to build image rotation channels, while RDA, ORN, and STN directly classify rotated images using a single channel.

In terms of the network architectures, PROAI builds a cluster of copies of the CNN trained with normal images, while existing methods use the same CNN architecture with training stages. It should be noted that although TIPooling, and ORN also use multiple forward channels in the inference stage, PROAI is essentially different from them. This is because the weights of CNNs in existing methods are learned from all-possible orientations, while and weights of the CNN in PROAI are trained with normal images first and then shared across the rotation channels.

In terms of the outputs, PROAI simultaneously outputs the category and orientation for each test image, while the existing methods only predict what they have learned at the training stage.

3 Results and Discussions

MNIST and Fashion MNIST have been commonly used to evaluate RIR performances [6, 8, 21]; hence they are applied to evaluate the performance of PROAI. The performance evaluation of PROAI is divided into training and inference stages. At the training stage, the parameters and training time of PROAI are compared with existing RIR methods. At the inference stage, the classification accuracy of PROAI is firstly compared with existing RIR methods. Then the orientation estimation precision of PROAI is compared with a CNN angle regressor. The CNN angle regressor is tailor-designed and trained in a supervised orientation regression task because existing RIR methods have not reported orientation estimation precision yet. Comparing the classification accuracies achieved by PROAI with existing RIR methods, which can demonstrate that PROAI can achieve higher classification accuracies in RIR tasks. And PROAI can achieve higher orientation estimation performance in RIR tasks.

In addition to quantitively evaluating PROAI on MNIST and Fashion MNIST, PROAI has also been applied to the rotated face recognition task of FDDB dataset [30], and the underwater rotated target recognition task of SCTD dataset [31].

3.1 Preparation of Dataset and Design of CNN Architecture

3.1.1 Preparation of RIR Dataset

Two RIR datasets, Rotated MNIST and Rotated Fashion MNIST, are prepared for RIR experiments in this paper.

Firstly, Rotated MNIST is generated by randomly rotating the images in MNIST within a certain angle range. Different angle ranges have been used by existing researches, which typically include \([0^{ \circ } ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 0^{ \circ } ]\), \([ - 90^{ \circ } , \, 90^{ \circ } ]\), \([ - 180^{ \circ } ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 180^{ \circ } ]\). Correspondingly, Rotated datasets generated by these three angle ranges are denoted as rot0, rot180, and rot360. If assuming that the images in the original MNIST are normal oriented, then rot0 can be taken as the training dataset for PROAI. As a result, PROAI only require rot0 to train CNNs at the training stage, while existing RIR methods have to use rot180, rot360 to train CNNs. At the inference stage, rot180, and rot360 of the test dataset were applied to compare the classification accuracies of PROAI with existing RIR methods.

In addition to the classification task, PROAI also conducted orientation estimation task on Rotated MNIST. Therefore, the rotation angles of images must be labeled to form orientation labels. PROAI does not require orientation labels of training images, but existing RIR research must use them for training supervised orientation regressors. At the inference stage, orientation labels of validation or test images are applied to evaluate the orientation estimation precisions.

Secondly, Rotated Fashion MNIST is generated using the abovementioned method. The Rotated Fashion MNIST is obtained by applying rotational transformation on Fashion MNIST, and is much more difficult to recognize than Rotated MNIST. For example, the data augmentation method can achieve 95% test accuracy on Rotated MNIST, but only 77% test accuracy on the Rotated Fashion MNIST. In addition, because the image orientation of Fashion MNIST is very close to the ideal normal orientation (see Fig. 3), Rotated Fashion MNIST can effectively examine whether PROAI is able to extend the recognition ability of CNN for images in one orientation to other arbitrary orientations.

Both in Rotated MNIST and Rotated Fashion MNIST, the numbers of images in training, validation, test sets are 50,000, 10,000, 10,000, and the image sizes are 28 × 28.

3.1.2 Design of CNN Architectures

To fairly compare PROAI with existing RIR methods, RDA was applied to compare with PROAI. The reason for choosing RDA for comparison is two-folded: (1) RDA is the most practical RIR method. (2) Neither RDA nor PROAI needs to change the architecture of CNNs. The CNN architecture designed for PROAI and RDA is shown in Table 1. The designed CNN has a similar architecture with the CNN architecture used in the experiments of RIR classification [8]. This CNN architecture is composed of classic convolution and maxpooling layer to extract feature. The extracted features are transformed by the ‘ReLU’ activation function. To improve the generalization of the architecture, Batch Normalization (BN) [32] layers has been added after each ‘ReLU’ activation layer. At the output of the CNN, the ‘Softmax’ activation function is applied to transform the output. The output of the CNN is a classification score vector with length C (C is the Number of sample categories in the training dataset).

Table 1 The CNN architecture designed for PROAI and RDA

Since PROAI and RDA require CNNs with different numbers of parameters to achieve each best generalization performance. The numbers of parameters of the CNN in Table 1 were adjusted by changing four variable architecture hyper-parameters, which includes the Number of channels for the Initial Convolutional layer (N1), the number of neurons for the Fully Connected layer (N2), and two multipliers for the channel numbers (M1 and M2). N1, N2, M1, M2 were set to adapt the requirements of different RIR methods and datasets. Specifically, PROAI requires smaller number of parameters than RDA, and CNNs designed for Rotated Fashion MNIST requires larger number of parameters than Rotated MNIST (Because images in Fashion MNIST are more complex than that in MNIST, more convolution filters are used than that for MNIST). Architecture hyper-parameters, numbers of parameters, and numbers of computations of CNNs, which are designed in the following experiments, are listed in Table 2. The designed architecture hyper-parameters in Table 2 can help PROAI or RDA achieve each high generalization performance. The Numbers of parameters and computations in Table 2 were calculated by python 'thop' library.Footnote 2

Table 2 Architecture hyper-parameters, numbers of parameters, calculations of CNNs in the experiments

The reason for choosing such a traditional and simple architecture shown in Table 1 is to demonstrate that PROAI is not picky for CNN architectures. In addition to this architecture, one of the most popular CNN architectures ResNet [33], and the CNN architectures automatically designed by a state-of-the-art Neural Architecture Search (NAS) algorithm [34] are also applied to evaluate the RIR performance of PROAI.

3.1.3 Evaluation Metrics

Two important evaluation metrics of RIR tasks are classification accuracy and orientation estimation precision. These two metrics are calculated through the following Eqs. (9) and (10). The classification accuracy \(A\) represents the proportion of the Number of correct classifications to the Number of all targets, i.e.,

$$A = {\kern 1pt} {\kern 1pt} {\kern 1pt} \frac{1}{{N_{I} }}\sum\limits_{n = 1}^{N} {I(y_{n} = \hat{y}_{n} )}$$
(9)

where \(N_{I}\) is the Number of images in a dataset, \(I( \cdot )\) refers to the indicator function. The \(A\) calculated on the training, validation, and test datasets are also referred to as the training accuracy, validation accuracy, and test accuracy. The higher the \(A\) is, the more accurate the classification is.

The orientation estimation precision is evaluated using the Mean Absolute Error (MAE), which is calculated by,

$$MAE = \frac{1}{{N_{I} }}\sum\limits_{i = 1}^{{N_{I} }} {\left| {\varphi_{i} - \hat{\varphi }_{i} } \right|}$$
(10)

where \(\varphi_{i}\) and \(\hat{\varphi }_{i}\) is the ground-truth value and the estimated value of the rotated the \(i\)-th image. The lower the MAE value is, the more precise the orientation estimation is.

Besides, other practical metrics are also calculated to evaluate the performance of training and inference of PROAI. At the training stage, the training time and the number of parameters of CNNs used in PROAI and RDA are calculated and compared. Training time represents the total time required for all epochs of training, of which the unit is second (s). The number of parameters represents the Number of learnable parameters for a CNN. In this paper, mega (M) is used as the unit of the number of parameters. At the inference stage, the inference time, classification accuracies, and orientation estimation precisions of PROAI are calculated and compared with those of RDA. The inference time refers to the total time required for inferencing all the images in the test set, of which the unit is second (s).

3.2 Results at the Training Stage

This section evaluates the training time and number of parameters of CNNs training by PROAI. As a comparison, the results of CNNs trained with RDA are also provided. The comparison demonstrates that the training method of PROAI can effectively reduce both the training time and the number of parameters of CNNs.

3.2.1 Training Time

PROAI trains and validate CNN on the rot0 datasets of Rotated MNIST and Rotated Fashion MNIST. The CNN architectures designed in Table 2 are trained on these two datasets. Specifically, the CNNs with small numbers of parameters, CNN(small)@M and CNN(small)@F are applied to be trained on Rotated MNIST and Rotated Fashion MNIST, respectively. The cross-entropy loss function in Eq. (1) and the Stochastic Gradient Descent (SGD) algorithm are used to optimize the parameter of the two CNNs. For the training on these two datasets: the momentums of SGD are 0.9, the batch sizes of training images are 64, and the training epochs are 360.

As a comparison, RDA is applied to train and validate CNN on the rot180 or rot360 datasets of Rotated MNIST and Rotated Fashion MNIST. The CNNs with large numbers of parameters, CNN(large)@M and CNN(large)@F are applied to be trained on Rotated MNIST and Rotated Fashion MNIST, respectively. Other training hyperparameters are set to be the same as that of PROAI.

The curves for cross-entropy loss in relation to training epochs are shown in Fig. 7. The cross-entropy loss curves of RDA are the results of training CNNs on rot360. As shown in Fig. 7a and b, both the training and validation losses of PROAI decrease and converge more quickly than RDA. Also, the loss curves of PROAI are reduced to significantly lower values than that of RDA. These results demonstrate that PROAI is of a faster training convergence. Figure 7c and d can also confirm this argument.

Fig. 7
figure 7

Cross-entropy loss in relation to training epochs

To quantitively evaluate how faster the training of PROAI is. The CNN(small)@M is trained by PROAI and the CNN(large)@M is trained by RDA on Rotated MNIST for many times. For each time of training, the overall training epochs is increased in [3, 480]. After each time of training, the classification accuracies on training and validation datasets are calculated using Eq. (9). Meanwhile, the overall training time is recorded. As a result, the training and validation accuracies in relation to training time can be drawn as Fig. 8. As it is shown, PROAI cost significantly shorter training time while achieve higher validation accuracies. In addition, PROAI and RDA cost 84 and 390 epochs to achieve each best generalization performance. Under this condition, the training time of PROAI (\(t_{PROAI}\)) is only 10.3% of that of RDA (\(t_{DA}\)).

Fig. 8
figure 8

Classification accuracies in relation to training time

3.2.2 Numbers of Parameters

PROAI trains CNN on rot0 dataset, so it requires CNN with smaller numbers of parameters than RDA does. This is the reason that we have designed CNNs with small number of parameters CNN(small)@M and CNN(small)@F for PROAI, and CNN(large)@M and CNN(large)@F for RDA.

To validate that PROAI requires smaller numbers of parameters of CNN than RDA. RDA and PROAI are used to train both CNNs with larger and small numbers of parameters. The trained CNN is then applied to infer the images in the validation dataset. Meanwhile, the validation accuracies are calculated through Eq. (9). By observing the difference in validation accuracies, the preferences for the number of parameters of different methods can be concluded.

We train CNN(small)@M and CNN(large)@M on rotated MNIST using PROAI and RDA, and we can calculate the validation accuracies shown in Table 3. As can be seen, when training CNNs with PROAI, the CNN with a smaller number of parameters, i.e., CNN(small)@M, achieves higher validation accuracy. In contrast, when training CNNs with RDA, the CNN with a larger number of parameters, i.e., CNN(large)@M, achieves higher validation accuracy. These two results imply that PROAI requires smaller numbers of parameters of CNN than RDA. That is also to say that PROAI can reduce the number of parameters of CNNs. For Rotated MNIST dataset, the number of parameters of CNN(small)@M is only 2.92% of that of CNN(large)@M.

Table 3 Validation accuracies on Rotated MNIST of CNNs trained with PROAI and RDA

Moreover, we also train CNN(small)@F and CNN(large)@F on Rotated Fashion MNIST using PROAI and RDA, and we can calculate the validation accuracies shown in Table 4. As can be seen from the validation accuracies, PROAI makes the CNN with a smaller number of parameters achieve high validation accuracy, and RDA makes the CNN with larger number of parameters achieve higher validation accuracy. These results also imply that PROAI can reduce the number of parameters of CNNs.

Table 4 Validation accuracies on Rotated Fashion MNIST of CNNs trained with PROAI and RDA

3.2.3 Discussions

Both the results in Tables 3 and 4 show that PROAI can reduce the number of parameters of CNN in RIR tasks. This is in agreement with the fact that the training set of PROAI contains images only at one orientation while the training set of RDA method contains images at all possible orientations. The variation of images within the training set of PROAI is smaller, so the optimal number of parameters is lower. The advantages of the reductions of the number of parameters is threefold: (1) the CNN requires less storage space to save weight and less memory footprint to run. (2) The number of calculations of CNN is reduced. With the same network architecture and image size, the CNN with a lower number of parameters is usually less computationally intensive. As a result, the computation required by the CNN of PROAI is lower than that of the CNN of the data augmentation method. As shown in the “Numbers of Calculations” column in Table 2, the number of calculations of the CNN are reduced to 32.22% and 37.28% of that of the RDA method on the two datasets. (3) The CNN weights converge faster during training. As shown in Fig. 7, PROAI achieves the highest generalization using 84 training epochs, while the data augmentation method requires 390 epochs.

Due to the abovementioned advantages (2) and (3), the total training time of PROAI becomes significantly lower than that of RDA.

3.3 Results at the Inference Stage

This section first compares the classification accuracies achieved by PROAI with existing RIR methods, which can demonstrate that PROAI can achieve higher classification accuracies in RIR tasks.

Then this section compares orientation estimation precisions achieved by PROAI with existing RIR methods. But because existing RIR methods have not reported orientation estimation precisions yet, two tailor-designed CNN angle regressors are trained using the orientation labels of rotated MNIST and rotated Fashion MNIST. The results of orientation estimation experiments can demonstrate that PROAI can achieve higher orientation estimation performance in RIR tasks.

Finally, the inference time of PROAI is quantitively evaluated.

3.3.1 Classification Accuracies

This sub-section compares the classification accuracy achieved by PROAI with existing RIR methods. The classification accuracies in this section are calculated by applying Eq. (9) to the validation dataset. The rotated MNIST was first applied to calculate the classification accuracies of PROAI and RDA. For PROAI, the inference procedure proposed in Sect. 2.2.1 was conducted, each test image was rotated into 36 orientations to form 36 rotation channels, in which the CNN architecture and weights were shared with CNN(small)@M trained in Table 3. The classification accuracies of PROAI and existing RIR methods are shown in Table 5.

Table 5 Classification accuracy on rotated MNIST

In the first row of Table 5, the text before the arrow indicates the dataset type used in the training stage. The text after the arrow indicates the dataset type used in the inference stage, the first column gives the numbers of parameters of each CNNs (take CNN(large)@M as the reference), and the second to fifth columns show the classification accuracies of each method on the rot0, rot180, and rot360 test sets, respectively. As can be seen, the results in “rot0— > rot180” and “rot0— > rot360” columns show that PROAI using CNN(small)@ M significantly increases the classification accuracy, even though the training data is not augmented. For example, in the “rot0— > rot180” and “rot0— > rot360” columns, PROAI increase the classification accuracy to 99.2%. This result is not only higher than the RDA method but also greater than the existing state-of-the-art accuracy (98.88%) achieved by the rotation-invariant ORN.

The results of CNN(small)@M are in Table 5 has revealed that PROAI can use a simple CNN architecture to achieve higher RIR performance than RDA. To demonstrate PROAI can also generalize to other CNN architectures, we also use two new CNN architectures to evaluate the RIR performance of PROAI. One of the architectures is one of the most popular CNN architectures ResNet [33]; another CNN architecture is called DARTSNet, which is automatically designed for MNIST dataset by a state-of-the-art Neural Architecture Search (NAS) algorithm, Differentiable Architecture Search (DARTS) [34]. DARTSNet@6, which is composed of 6 levels of computation cells, was applied in PROAI, while DARTSNet@12, which is composed of 12 levels of computation cells, was applied in RDA. The classification accuracies achieved by PROAI and RDA under different CNN architectures are shown in Table 6.

Table 6 The classification accuracies under different CNN architectures

As shown in Table 6, in the first place, DARTSNet@6 can improve the classification accuracy of PROAI and achieve state-of-the-art classification accuracy 99.37%, while ResNet18 has decreased the performance of PROAI. A possible cause for the drop in accuracy is that the number of parameters of ResNet18 is still big for applying PROAI to rotated MNIST. This result again proves that, PROAI only requires CNN architectures with small numbers of parameters. In the second place, for each kind of CNN architecture, PROAI has achieved higher classification accuracies than RDA. This reveals again that PROAI increases classification accuracy. From the above results, two conclusions can be drawn: (1) when using the same CNN architectures in PROAI and RDA, PROAI can always outperform RDA on classification accuracy. (2) Designing appropriate CNN architecture can improve the RIR performance of PROAI, and the neural architecture search can be applied to design the CNN architectures for PROAI.

In addition to rotated MNIST, rotated fashion MNIST was also applied to calculate the classification accuracies of PROAI and RDA. For this dataset, the rotation channel N of PROAI was set to 144, the classification accuracy is shown in Table 7. As can be seen, the accuracy of PROAI is greater than RDA by 9.93%. This result implies that PROAI is also effective for Rotated Fashion MNIST, in which the image pattern is more complex.

Table 7 Classification accuracies on rotated fashion MNIST

To examine the relationship between the RIR performance of PROAI and the Number of pre-rotation channels, the “rot0- > rot360” classification accuracy curves of PROAI were calculated by adjusting the Number of pre-rotation channels in [1, 360]. The classification accuracy curve with respect to the number of rotation channels is shown in Fig. 9. As can be seen, on both datasets, the classification accuracies noticeably increase first and then stabilize when increasing the Number of rotation channel numbers. On rotated MNIST, the highest classification accuracy 99.2% is achieved when N is 36, which is 3.9% greater than the highest classification accuracy achieved by RDA. On Rotated Fashion MNIST, the highest classification accuracy is achieved when N is 144, which is 12.7% greater than the highest classification accuracy achieved by RDA.

Fig. 9
figure 9

Classification accuracies with respect to the Number of rotation channel

Figure 9 also shows that the number of pre-rotation channels required for PROAI to achieve the best performance is different for different RIR datasets, which implies the inference time is also different. The inference time, the choice of N will be discussed in Sects. 3.3.3 and 3.3.4.

3.3.2 Orientation Estimation Precisions

This sub-section evaluates the orientation estimation precision achieved by PROAI. The evaluation metric of orientation estimation precision is MAE. By comparing the predicted value of orientation and the ground truth, the MAE for orientation estimation can be calculated using Eq. (10). Because the performance of PROAI is in relation to the Number of pre-rotation channels, the MAEs achieved by PROAI were calculated in the condition of different numbers of pre-rotation channels. The numbers of pre-rotation channels are increased from 1 to 360. As a result, the MAE curves are plotted with blue and orange lines in Fig. 10.

Fig. 10
figure 10

Mean absolute error of orientation estimation with respect to the Number of rotation channels

For comparison, the MAEs achieved by RDA were also calculated. But because existing RIR methods have not reported the results for orientation estimation task, the MAE for orientation estimation of PROAI was compared with a tailor-designed CNN angle regressor. This angle regressor was designed by adding two output neurons to CNN(large)@M and CNN(large)@F, and these two neurons are responsible for regressing the sine and cosine values of the image rotation angle [22, 23]. Since the supervised orientation regressors need to be trained with images at any possible rotation angles, they are also taken as RDA methods. The blue and orange dash lines in Fig. 10 indicate the MAEs on Rotated MNIST, Rotated Fashion MNIST achieved by supervised orientation regressors.

It can be observed that the MAE curves of PROAI noticeably decrease first and then stabilize when increasing the rotation channel numbers. On rotated MNIST, the lowest MAE 8.739°is achieved when N is 276, which is 6.38°lower than what can be achieved by the supervised angle regressor. On Rotated Fashion MNIST, the lowest MAE 26.93°is achieved when N is 324, which is 11.63°lower than the supervised angle regressor.

Figure 10 demonstrates that PROAI can estimate orientation more precisely than the supervised angle regressor. Examples of rotated images labeled with the orientations estimated by PROAI are shown in Fig. 11.

Fig. 11
figure 11

Examples of rotated images with the orientations estimated by PROAI

3.3.3 Inference Time

This sub-section compares the inference time of PROAI with existing RIR methods. The inference time here refers to the total time required for inferring all images in test sets. In the experiment, parallel computing is adopted for inference, and the batch size of a test set is 64; the computer used in the inference experiment is equipped with a Core i9 CPU and a NVIDIA 3090 GPU.

To observe the relationship between the pre-rotation channel numbers and the inference time, the pre-rotation channel number gradually increases, and the corresponding inference time is recorded. For each Number of pre-rotation channels, the inference experiment is conducted 5 times to record 5 independent measures of the inference time. These five different measures of the inference time are plotted using the boxplot. Then, the inference time curve of PROAI is shown in Fig. 12. As shown in Fig. 12, the inference time of PROAI has a linear relationship with the pre-rotation channel number. Also, it can be observed from Fig. 12 that the variance of the inference time increases with the pre-rotation channel number.

Fig. 12
figure 12

Inference time of PROAI in relation to pre-rotation channel numbers

Figures 9 and 10 have demonstrated that the classification and the orientation estimation performance of PROAI increase with the pre-rotation channel number. Figure 12 has demonstrated the inference time of PROAI increases with the pre-rotation channel number. Next, it is valuable to study how many pre-rotation channel numbers are required by PROAI to outperform the classification and the orientation estimation performance of RDA. To answer this question, the pre-rotation channel number of PROAI is gradually increased. Then the classification performance and the orientation estimation performance are evaluated for each pre-rotation channel number. As a result, the curves for classification accuracies and mean absolute errors of orientation estimation are shown in Fig. 13. As a comparison, the results of RDA are added.

Fig. 13
figure 13

Classification accuracies and mean absolute errors of orientation estimation vs inference time

Figure 13a, b shows the classification accuracies with respect to inference time. The classification accuracies of PROAI in relation to the inference time are plotted with blue stars, and the inference time of RDA in relation to the inference time is plotted with an orange circle. As these two figures shown, the inference time and classification accuracy achieved by single-channel PROAI is lower than the RDA, but the classification accuracies become greater when the inference time increases. PROAI outperforms RDA when the numbers of the pre-rotation channel are 6 and 18. These two points are plotted with red squares in the two figures, and the corresponding inference time of PROAI is 12.66 s and 79.74 s, which are 2.079 and 6.254 times that of RDA.

Figure 13c, d shows the MAEs for orientation estimation with respect to inference time. The MAEs for orientation estimation of PROAI in relation to the inference time are plotted with blue stars, and the MAE of RDA in relation to the inference time is plotted with an orange circle. As these two figures shown, when using a single channel, the inference time of PROAI is lower than RDA, while its MAEs for orientation estimation are higher than RDA. When increasing the pre-rotation channels, the MAEs become lower, but the inference time increases. When the Number of the pre-rotation channels increases to 8 and 24, PROAI outperforms RDA methods on Rotated MNIST and Rotated Fashion MNIST. These two points are plotted with red squares in the two figures, and the corresponding inference time of PROAI is 16.88 s and 106.3 s, which are 2.772 and 8.337 times that of RDA.

3.3.4 Discussions

Figures 8 and 9 demonstrate that the RIR performance increase first and then stabilize when increasing the rotation channel numbers. Different datasets require a different number of rotation channels to achieve each highest RIR performance. For the Rotated MNIST dataset, PROAI achieves the highest classification accuracy when N is set to 36, but for the Rotated Fashion MNIST dataset, PROAI achieves the highest classification performance when N is set to 144. The reason that the PROAI requires different numbers of pre-rotation channels on the two datasets is that the orientations of handwritten digits in MNIST are distributed in a certain range, so that the CNN trained directly on this dataset can recognize handwritten digits in a certain range of orientations rather than only a single orientation. As a result, a smaller number of pre-rotation channels is able to make PROAI achieve the highest classification accuracy on Rotated MNIST. By contrast, since the image orientations of Fashion MNIST are more concentrated in a small range, the trained CNN can only recognize images within a small range of orientations. As a result, it requires more rotation channels for PROAI to achieve high classification accuracy on Rotated Fashion MNIST. As can be seen from the above analysis, whether the image orientation is normal or not, it is effective to use PROAI’s weight-sharing mechanism in the inference stage to generalize the CNN’s ability to recognize image at one or some orientations to any other arbitrary orientation. This implies that PROAI has the function of improving the generalization abilities of any trained CNNs. Since this inference procedure and its function are similar to test-time augmentation, future research on PROAI can focus on the explanation of test-time augmentation.

The results in Fig. 11 show that the inference time of PROAI is proportional to the number of channels, which is dependent on the dataset type, the task type, and the expected performance. In other words, PROAI can adjust the rotation image recognition performance by adjusting the inference time. This recognition performance includes rotated classification performance and additional orientation estimation performance that are obtained without learning orientation labels. To outperform RDA methods, different additional inference time is required by PROAI in different datasets. Therefore, it is necessary to set a reasonable number of pre-rotation channels for the best trade-off between RIR performance and inference time. In the experiments of Sect. 3.3.1, no greater than three times of the inference time is sufficient to obtain better RIR performance than RDA. In the experiments in Sect. 3.3.2, no greater than nine times of the inference time is sufficient to obtain better orientation estimation performance than RDA. Although PROAI requires longer inference time, it should be emphasized that the computation of each channel of PROAI is completely independent, so parallel computation can be applied to effectively accelerate the inference in practice.

In summary, the experiments in Sects. 3.3.1 and 3.3.2 demonstrate that PROAI can achieve state-of-the-art performance in both the rotated image classification and the orientation estimation task on both Rotated MNIST and Rotated Fashion MNIST datasets.

4 Conclusion

While existing rotated image recognition methods focus on making CNN "memorize" as many images as possible during the training stage, this paper has proposed a novel rotated image recognition mechanism, PROAI, which simulates the mental rotation process of the human brain. At the training stage, images at only one orientation are learned by CNN. At the inference stage, images at any other orientation are fed into a cluster of CNNs sharing the same architecture and weight to calculate classification scores, of which the maximum value has been successfully applied to simultaneously estimate both the category and the orientation of each test image. PROAI has significantly reduced the parameters and training time of CNN in RIR tasks and also achieved state-of-the-art classification accuracies and orientation estimation precisions on several datasets.

The main limitations associated with the PROAI method is that the multi-channel inference architecture of PROAI costs more computation power, therefore, the inference time of PROAI is proportional to the Number of channels. However, since the computation of each channel of PROAI is completely independent, parallel computation can be applied to effectively accelerate the inference in practice. In addition, it is necessary to process the trade-off between performance and inference time to set a reasonable number of pre-rotation channels so that we can achieve both high accuracy and high inference speed.

PROAI achieves state-of-the-art performance on the rotated digits recognition and rotated fashion recognition tasks. Since the inference procedure of PROAI is similar to test-time augmentation, it holds promise using the method of PROAI to explain the test-time augmentation. Also, it would be beneficial to generalize PROAI to process other kinds of image transformations, such as scale transformation, etc.