1 Introduction

Fine-grained visual classification (FGVC) aims at identifying sub-classes of a given object category, e.g., different species of birds, and models of cars and aircrafts. It is a much more challenging problem than traditional classification due to the inherently subtle intra-class object variations amongst sub-categories. Most effective solutions to date rely on extracting fine-grained feature representations at local discriminative regions, either by explicitly detecting semantic parts [11, 12, 38, 39, 41] or implicitly via saliency localization [4, 10, 25, 33]. It follows that such locally discriminative features are collectively fused to perform final classification.

Early work mostly finds discriminative regions with the assistance of manual annotations [2, 16, 21, 37, 40]. However, human annotations are difficult to obtain, and can often be error-prone resulting in performance degradations [41]. Research focus has consequently shifted to training models in a weakly-supervised manner given only category labels [4, 26, 33, 38, 41]. Success behind these models can be largely attributed to being able to locate more discriminative local regions for downstream classification. However little or no effort has been made towards (i) at which granularities are these local regions most discriminative, e.g., head or beak of a bird, and (ii) how can information across different granularities be fused together to classification accuracy, e.g., can do head and beak work together.

Information cross various granularities is however helpful for avoiding the effect of large intra-class variations. For example, experts sometimes need to identify a bird using both the overall structure of a bird’s head, and finer details such as the shape of its beak. That is, it is often not sufficient to identify discriminative parts, but also how these parts interact amongst each other in a complementary manner. Very recent research has focused on the “zooming-in” factor [11, 39], i.e., not just identifying parts, but also focusing on the truly discriminative regions within each part (e.g., the beak, more than the head). Yet these methods mostly focuses on a few parts and ignores others as zooming in beyond simple fusion. More importantly, they do not consider how features from different zoomed-in parts can be fused together in a synergistic manner. Different to these approaches, we further argue that, one not only needs to identify parts and their most discriminative granularities, but meanwhile how parts at different granularities can be effectively merged.

In this paper, we take an alternative stance towards fine-grained classification. We do not explicitly, nor implicitly attempt to mine fine-grained feature representations from parts (or their zoomed-in versions). Instead, we approach the problem with the hypothesis that the fine-grained discriminative information lies naturally within different visual granularities – it is all about encouraging the network to learn at different granularities and simultaneously fusing multi-granularity features together. This can be better explained by Fig. 1.

More specifically, we propose a consolidated framework that accommodates part granularity learning and cross-granularity feature fusion simultaneously. This is achieved through two components that work synergistically with each other: (i) a progressive training strategy that effectively fuses features from different granularities, and (ii) a random jigsaw patch generator that encourages the network to learn features at specific granularities. Note that we refrain from using “scale” since we do not apply Gaussian blur filters on image patches, rather we evenly divide and shuffle image patches to form different granularity levels.

Fig. 1.
figure 1

Illustration of features learned by general methods (a and b) and our proposed method (c and d). (a) Traditional convolution neural networks trained with cross entropy (CE) loss tend to find the most discriminative parts. (b) Other state-of-the-art methods focus on how to find more discriminative parts. (c) Our proposed progressive training (Here we use last three stages for explanation.) gradually locates discriminative information from low stages to deeper stages. And features extracted from all trained stages are concatenated together to ensure complementary relationships are fully explored, which is represented by “Stage Concat.” (d) With assistance of jigsaw puzzle generator the granularity of parts learned at each step are restricted inside patches.

As the first contribution, we propose a multi-granularity progressive training framework to learn the complementary information across different granularities. This differs significantly to prior art where parts are first detected, and later fused in an ad-hoc manner. Our progressive framework works in steps during training, where at each step the training focuses on cultivating granularity-specific information with a corresponding stage of the network. We start with finer granularities which are more stable, gradually move onto coarser ones, which avoids the confusion made by large intra-class variations that appear in large regions. On its own, this is akin to a “zooming out” operation, where the network would focus on a local region, then zoom out a larger patch surrounding this local region, and finish when we reach the whole image. More specifically, when each training step ends, the parameters trained at the current step will pass onto the next training step as its parameter initialization. This passing operation essentially enables the network to mine information of larger granularity based on the region learned in its previous training step. Features extracted from all stages are concatenated only at the last step to further ensure complementary relationships are fully explored.

However, applying progressive training naively would not benefit fine-grained feature learning. This is because the multi-granularity information learned via progressive training may tend to focus on the similar region. As the second contribution, we tackle this problem by introducing a jigsaw puzzle generator to form different granularity levels at each training step, and only the last step is still trained with original images. This effectively encourage the model to operate on patch-level, where patch sizes are specific to a particular granularity. It essentially forces each stage of the network to focus on local patches other than holistically across the entire image, therefore learning information specific to a given granularity level. This effect is demonstrated in Fig. 1 and the Fig. 2 illustrates the learning process of progressive training with the jigsaw puzzle generator. Note that, the very recent work of [4] first adopted a jigsaw solver to solve for fine-grained classification. We differ significantly in that we do not employ jigsaw solver as part of feature learning. Instead, we simply generate jigsaw patches randomly as means of introducing different object parts levels to assist progressive training.

Main contributions of this paper can be summarized as follows:

  1. 1.

    We propose a novel progressive training strategy to solve for fine-grained visual classification (FGVC). It operates in different training steps, and fuses information from previous levels of granularity at each step, ultimately cultivating the inherent complementary properties across different granularities for fine-grained feature learning.

  2. 2.

    We adapt a simple yet effective jigsaw puzzle generator to form images with different levels of granularity. This allows the network to focus on different “scales” of features as per prior work.

  3. 3.

    The proposed Progressive Multi-Granularity (PMG) Training framework obtains state-of-the-art or competitive performances on three standard FGVC benchmark datasets.

2 Related Work

2.1 Fine-Grained Classification

Recent studies on FGVC have moved from strongly-supervised scenario with additional annotations e.g., bounding box [2, 16, 21, 37, 40], to weakly-supervised conditions with only category labels [11, 12, 22, 36, 38, 39, 41, 42].

In the weakly-supervised configuration, recent studies mainly focus on locating the most discriminative parts, more complementary parts, and parts of various granularities. However, few considered how to fuse information from these discriminative parts together. Current fusion techniques can be roughly divided into two categories. The first category conducts predictions based on different parts and then directly combines their probabilities together. For example, Zhang et al. [39] trained several networks focusing on features of different granularities to produce diverse prediction distributions, and then weighted their results before combining them together. The other group concatenate features extracted from different parts together for next prediction [11, 12, 38, 41]. Fu et al. [11] found region detection and fine-grained feature learning can reinforce each other, and built a series of networks which located discriminative regions for the next network while conducting predictions. With similar motivation, Zheng et al. [41] jointly learned part proposals and the feature representations on each part, and located various discriminative parts before prediction. Both of them train a fully-connected fusion layer to fuse features extracted from different parts. Ge et al. [12] went one step further by fusing features from complementary object parts with two LSTMs stacked together.

Fusing features from different parts is still a challenging problem with limited efforts. In this work, we tackle it based on the intrinsic characteristics of fine-grained objects: although with large intra-class variation, the subtle details exhibit stability at local regions. Hence, instead of locating discriminative parts first, we guide the network to learn features from small granularity to large granularity progressively.

2.2 Image Splitting Operations

Splitting an image into pieces with the same size has been utilized for various task in prior art. Amongst them, one typical solution is to solve the jigsaw puzzle [6, 31]. It can also go one step further by adopting the jigsaw puzzle solution as the initialization to a weakly-supervised network, which leads to better transformation performance [35]. This helps the network to exploit the spatial relationship of local image regions. In one-shot learning, image splitting operation was used for data augmentation [5], which split two images and exchanged patches across to generate new training ones. In more recent research, DCL [4] first adopted image splitting operation for FGVC, who destructed images to emphasize local details and then reconstructed them to learn semantic correlation among local regions. However, it split images with the same size during the whole training process, which made it difficult to exploit multi-granularity regions. In this work, we apply a jigsaw puzzle generator to restrict the granularity of learned regions at each training step.

2.3 Progressive Training

Progressive training methodology was originally proposed for generative adversarial networks [18], where it started with low-resolution images, and then progressively increased the resolution by adding new layers to the network. Instead of learning information from all scales, this strategy allows the network to discover large-scale structure of the image distribution and then shift attention to increasingly finer scale details. Recently, progressive training strategy has been widely utilized for generation tasks [1, 19, 29, 34], since it can simplify the information propagation within the network by intermediate supervision.

For FGVC, the fusion of multi-granularity information is critical to the model performance. In this work, we adopt the idea of progressive training to design a single network that can learn these information with a series of training stages. The input images are firstly split into small patches to train low-level layers of the model. Then the number of patches are progressively increased and the corresponding high-level layers are added and trained. Most of the existing work with progressive training are focusing on the task of sample generation. To the best of our knowledge, this has not been attempted before for the task of FGVC.

Fig. 2.
figure 2

The illustration of the progressive training process. The network is trained from shallow stages with smaller patches to deeper stages with larger patches. At the end of each training step, the parameter from current step will initialize the parameter of following step. This enables the network to further mine information of larger granularity based on the detail knowledge learned in the previous training step.

3 Approach

In this section, we present our proposed Progressive Multi-Granularity (PMG) training framework. We encourage the model to learn stable fine-grained information in the shallower layers, and gradually focus on learning more abstracted information of larger granularity in the deeper layers as the training progresses. Please refer to Fig. 2.

3.1 Progressive Training

Network Architecture. Our network design for progressive training is generic and could be implemented on the top of any state-of-the-art backbone feature extractors, like Resnet [14]. Let F be our backbone feature extractor, which has L stages. The output feature-map from any intermediate stage is represented as \({F}^{l} \in \mathbb {R}^{H_{l} \times W_{l} \times C_{l}}\), where \(H_{l}\), \(W_{l}\), \(C_{l}\) are the height, width and number of channels of the feature map at l-th stage, and \(l=\{1,2, \ldots ,~L\}\). Here, our objective is to impose classification loss on the feature-map extracted at different intermediate stages. Hence, in addition to F, we introduce convolution block \({H}_{conv}^{l}\) that takes l-th intermediate stage output \({F}^{l}\) as input and reduces it to a vector representation \({V}^{l} = {H}_{conv}^{l}({F}^{l})\). Thereafter, a classification module \({H}_{class}^{l}\) consisting of two fully-connected layers with Batchnorm [17] and Elu[7] non-linearity, corresponding to l-th stage, predicts the probability distribution over the classes as \(y^{l} = {H}_{class}^{l}({V}^{l})\). Here, we consider last S stages: \(l = L, L-1, \dotsc , L-S+1\). Finally, we concatenate the output from the last S stages as

$$\begin{aligned} {V}^{concat} = \text {concat}[{V}^{L-S+1}, \dotsc , {V}^{L-1}, {V}^{L}] \end{aligned}$$
(1)

This is followed by an additional classification module \(y^{concat} = {H}_{class}^{concat}({V}^{concat})\)

Training Process. During training, each iteration contains \(S+1\) steps where low-level stages of the model are trained first and new stages are progressively added. Since the receptive field and representation ability of low-level stages are limited, the network will be forced to first exploit discriminative information from local details (i.e. object textures). Directly training the whole network intends to learn all the granularities simultaneously. In contrast to that, step-wise incremental training naturally allows the model to mine discriminative information from local details to global structures when the features are gradually sent into higher stages.

For training, we compute cross entropy (CE) loss \(\mathscr {L}_{CE}\) between the ground truth label y and the predicted output from every stage.

At each iteration, a batch of data d will be used for \(S+1\) steps, and we only train one stage’s output at each step in sequence. It needs to be clear that all parameters are used in the current prediction will be optimized, even they may have been updated in the previous steps, and this can help all stages in the model working together.

figure a

3.2 Jigsaw Puzzle Generator

Jigsaw Puzzle solving [35] has been found to be suitable for self-supervised task in representation learning. On the contrary, we borrow the notion of Jigsaw Puzzle to generate input images for different steps of progressive training. The objective is to devise different granularity regions and force the model to learn information specific to the corresponding granularity level at each training step. Given an input image \(d\in R^{3\times W\times H}\), we equally split it into \(n \times n\) patches which have \(3\times \frac{W}{n}\times \frac{H}{n}\) dimensions. Then, the patches are shuffled randomly and merged together into a new image \(P(d,\,n)\). Here, the granularities of patches are controlled by the hyper-parameter n.

Fig. 3.
figure 3

The training procedure of the progressive training strategy which consists of \(S+1\) steps at each iteration (Here \(S=3\) for explanation). The Conv Block represents the combination of two convolution layers and a max pooling layer, and Classifier represent two fully connected layers with a softmax layer at the end. At each iteration, the training data are augmented by the jigsaw generator and sequentially fed into the network by \(S+1\) steps. In our training process, the hyper-parameter n is \(2^{L-l+1}\) for the \(l^{th}\) stage. At each step, the output from the corresponding classifier will be used for loss computation and parameter updating.

Regarding the choice of hyper-parameter n for each stage, two conditions needs to be satisfied: (i) the size of the patches should be smaller than the receptive field of the corresponding stage, otherwise, the performance of the jigsaw puzzle generator will be reduced; (ii) the patch size should increase proportionately with the increase of the receptive fields of the stages. Usually, the receptive field of each stage is approximately double than that of the last stage. Hence, we set n as \(2^{L-l+1}\) for the \(l^{th}\) stage’s output.

During training, a batch of training data d will first be augmented to several jigsaw puzzle generator-processed batches, obtaining P(dn). All the jigsaw puzzle generator-processed batches share the same label y. Then, for the \(l^{th}\) stage’s output \({y}^{l}\), we input the batch \(P(d, n),n=2^{L-l+1}\), and optimize all the parameters used in this propagation. Figure 3 illustrates the whole progressive training process with the jigsaw puzzle generator step by step.

It should be clarified that the jigsaw puzzle generator cannot always guarantee the completeness of all the parts which are smaller than the size of the patch, because they still have chances of getting split. However, it should not be a bad news for model training, since we adopt random cropping which is a standard data augmentation strategy before the jigsaw puzzle generator and leads to the result that parts with appropriate granularities, which are split at this iteration due to the jigsaw puzzle generator, will not be always split in other iterations. Hence, it brings an additional advantage of forcing our model to find more discriminative parts at the specific granularity level.

3.3 Inference

At the inference phase, we merely input the original images into the trained model and the jigsaw puzzle generator is unnecessary. If we only use \({y}^{concat}\) for prediction, the FC layers for the other three stages can be removed which leads to less computational budget. In this case, the final result \(C_1\) can be expressed as

$$\begin{aligned} C_1 = argmax({y}^{concat}). \end{aligned}$$
(2)

However, the prediction from a single stage with information of a specific granularity is unique and complementary, which leads to a better performance when we simply combine all outputs together with equal weights. The multi-output combined prediction \(C_2\) can be written as

$$\begin{aligned} C_2 = argmax(\sum _{l=L-S+1}^L {y}^{l} + {y}^{concat}). \end{aligned}$$
(3)

4 Experimental Results and Discussion

In this section, we evaluate the performance of the proposed method on three ne-grained image classification datasets: CUB-200-2011 (CUB) [32], Stanford Cars (CAR) [20], and FGVC-Aircraft (AIR) [27]. Firstly, the implementation details are introduced in Sect. 4.1. Subsequently, the classification accuracy comparisons with other state-of-the-art methods are provided in Sect. 4.2. In order to illustrate the advantages of different components and design choices in our method, a comprehensive ablation study and a visualization are provided in Sect. 4.3 and 4.5. Besides, the discussion about the hyper-parameter selection and the fusion techniques are provided in Sect. 4.4.

4.1 Implementation Details

We perform all experiments using PyTorch [28] with version higher than 1.3 over a cluster of GTX 2080 GPUs. The proposed method is evaluated on the widely used backbone networks: VGG16 [30] and ResNet50 [14], which means the total number of stages \(L=5\). For the best performance, we set \(S=3\), \(\alpha =1\), and \(\beta =2\). The category labels of the images are the only annotations used for training. The input images are resized to a xed size of \(550 \times 550\) and randomly cropped into \(448 \times 448\), and random horizontal ip is applied for data augmentation when we train the model. During testing, the input images are resized to a xed size of \(550 \times 550\) and cropped from center into \(448 \times 448\). All the above settings are standard in the literatures.

We use stochastic gradient descent (SGD) optimizer and batch normalization as the regularizer. Meanwhile, the learning rates of the convolution layers and the FC layers newly added by us are initialized as 0.002 and reduced by following the cosine annealing schedule [24]. The learning rates of the pre-trained convolution layers are maintained as 1/10 of those of the newly added layers. For all the aforementioned models, we train them for up to 200 epochs with batch size as 16 and used a weight decay as 0.0005 and a momentum as 0.9.

4.2 Comparisons with State-of-the-Art Methods

The comparisons of our method with other state-of-the-art methods on CUB-200-2011, Stanford Cars, and FGVC-Aircraft are presented in Table 1. Both the accuracy of the single output \(C_1\) and the combined output \(C_2\) are listed. In addition, we run our method 5 times with random initialization and conduct a one-sample Student’s t-test to confirm the significance of our results in Table 2. Results show that our improvement is statistically significant with significance level 0.05.

Table 1. Comparison with other state-of-the-art methods.
Table 2. The p-value of one-sample Student’s t-tests between combined accuracies of our method and methods with close performances on three datasets. The proposed method has statistically significant difference from a referred technique if the corresponding p-value is smaller than 0.05

CUB-200-2011. We achieve a competitive result on this dataset in a much easier experimental procedure, since only single feed-forward propagation through one network is needed during testing. Our method outperforms RA-CNN [11] and MGE-CNN [39] by 4.3% and 1.1%, even though they build several different networks to learn information of various granularities. They train the classification of each network separately and then combine their information for testing, which proofs our advantage of exploiting multi-granularity information gradually in one network. Besides, even Stacked LSTM [12] obtains better performance than our method, it is a two phase algorithm that requires Mask-RCNN [13] and CPF to offer complementary object parts and then uses bi-directional LSTM [15] for classification, which leads to longer inference time and more computation budget.

Stanford Cars. Our method achieves state-of-the-art performance with Resnet50 as the base model. Since the performance of \({y}^{concat}\) is good enough, the improvement of combining multi-stage outputs is not obvious. The result of our method surpasses PC [10] even it acquires great performance gains by adopting more advanced backbone network i.e. DenseNet161. For MA-CNN [41] and NTS-Net [38] which first locate several different discriminative parts to combine feature extracted from each of them for final classification. We outperform them by a large margin of 2.3% and 1.2%, respectively.

FGVC-Aircraft. On this task, the multi-output combined result of our method also achieves the state-of-the-art performance. Although S3N [9] finds both discriminative parts and complementary parts for feature extraction, and applies additional inhomogeneous transform to highlight these parts, we still outperform it by 0.6% with the same backbone network ResNet50, and show competitive result even when we adopt VGG16 as the base model.

4.3 Ablation Study

We conduct ablation studies to understand the effectiveness of the progressive training strategy and the jigsaw puzzle generator. We choose CUB-200-2011 dataset for experiments and ResNet50 as the backbone network, which means the total number of stages L is 5. We first design different runs with the number of stages used for output S increasing from 1 to 5 and no jigsaw puzzle generator, as shown in Table 3. The \({y}^{concat}\) is kept for all runs and number of steps is \(S+1\). It is clear that the increasing of S boosts the model performance significantly when \(S<4\). However, we also notice the accuracy starts to decrease when \(S=4\). The possible reason is that the low stage layers are mainly focus on the class-irrelevant features, but the additional supervision will force it to distill class-relevant information and then affect the overall performance.

In Table 3, we also report the results of our method with the jigsaw puzzle generator. The hyper-parameter n of the jigsaw puzzle generator for \(l^{th}\) stage follows the pattern that \(n=2^{L-l+1}\). It is obvious that the jigsaw puzzle generator improves the model performance on the basis of progressive training when \(S<4\). When \(S=4\), the model with the jigsaw puzzle generator does not show any advantages, and when \(S=5\) the jigsaw puzzle generator lowers the model performance. This is because when \(n>8\) the split patches are too small to keep meaningful information, which instigates confusion in the model training.

According to the above analysis, progressive training is beneficial for fine-grained classification task when we choose appropriate S. In such a case, the jigsaw puzzle generator can further improve the performance.

Table 3. The performances of the proposed method by using different hyper-parameter s with/without the jigsaw puzzle generator.
Table 4. The combined accuracies of our method with different \(\alpha \) and \(\beta \).

4.4 Discussions

The Choice of Hyper-parameter \(\varvec{(\alpha }\) and \(\varvec{\beta }\)). In our training procedure, the first S steps and the last step are trained for different goals: learning features with increasing granularity as the network going deeper, and learning correlations between multi-granularity features. Hence, we introduce two hyper-parameter \(\alpha \) and \(\beta \) to adjust their training loss. The model performances with different choice of \(\alpha \) and \(\beta \) are listed in Table 4. When we keep \(\alpha =1\), it can be observed that the accuracy increases and then decreases as \(\beta \) changes. And the model achieves the best performance on both three datasets when \(\beta =2\).

Fusion of Multi-granularity Information. In the experiments, we generate images contain multi-granularity information via jigsaw puzzle generator with \(n=\{8,4,2,1\}\), and fused these information with one network in a progressive manner. In order to demonstrate the advantage of the fusion strategy under the same configuration, we conduct two experiment on (i) training four different networks with generated images where \(n=\{8,4,2,1\}\) separately and concatenating their features for final classification with a fully connected fusion layer, which is similar to the fusion technique used in RA-CNN [11], and (ii) training a model with same architecture as ours but back-propagating the losses of four outputs in one step. We choose CUB-200-2011 dataset for experiments with ResNet50 as the base model and the results are listed in Table 5. The performance of four networks trained separately is higher than a lot of state-of-the-art methods but our method still outperforms it by a large margin, which indicates the effectiveness of our fusion technique. When we back-propagate losses of four outputs in one step, which means multi-granularity information is learnt simultaneously, the performance clearly drops even the other configurations are unchanged. Hence, the unique advantage of progressively learning multi-granularity information is significant.

Table 5. The comparison between our fusion technique and the other manners.

4.5 Visualization

In order to demonstrate the achievement of our motivation, we apply the Grad-CAM to visualize the last three stages’ convolution layers of both our method and the baseline model. Columns (a)–(c) in Fig. 4 are visualization of the convolution layers from the third to the fifth stage of our model’s backbone network, which are supervised by the jigsaw puzzle generator-processed images with \(n=\{8,4,2\}\) sequentially. It is shown in column (a) that the model concentrates on discriminative parts of small granularity at the third stage like bird eyes and small pattern or texture of birds’ feathers. And when it comes to column (c), the fifth stage of the model pays attention to parts of larger granularity. The visualization result demonstrates that our model truly gives predictions based on discriminative parts from small granularity to large granularity gradually.

When compared with the activation map of the baseline model, our model shows more meaningful concentration on the target object, while the baseline model only shows the correct attention at the last stage. This difference indicates that the intermediate supervision of progressive training can help the model distill useful information at low-level stages. Besides, we find the baseline model usually only concentrates on one or two parts of the object at the last stage. However, the attention regions of our method nearly cover the whole object at each stage, which indicates that jigsaw puzzle generator-processed images can force the model to learn more discriminative parts at each granularity level.

Fig. 4.
figure 4

Activation map of selected results on the CUB dataset with the Resnet50 as the base model. Columns (a)–(c) and (d)–(f) are visualizations of the last three stages’ convolution layers of our model and the baseline model, respectively.

5 Conclusions

In this paper, we approached the problem of fine-grained visual classification from a rather unconventional perspective – we do not explicitly nor implicitly mine for object parts, instead we show fine-grained features can be extracted by learning across granularities and effectively fusing multi-granularity features. Our method can be trained end-to-end without additional manual annotations other than category labels, and only needs one network with one feed-forward pass during testing. We conducted experiments on three widely used fine-grained datasets, and obtained state-of-the-art performance on two of them while being competitive on the other.