1 Introduction

Recently, a large body of computer vision research has focused on the fine-grained image recognition problem in several domains, such as animal breeds or species [3, 4, 27], plant species [17, 23] and architectural styles [21]. Fine-grained recognition concerns the task of distinguishing subordinate categories of the same superordinate category. It is a challenging task, as fine-grained subordinate categories share a high degree of visual similarity with small intra-class variances caused by factors such as poses, viewpoints or lighting conditions [18, 30]. Moreover, fine-grained recognition algorithms perform well within specific fine-grained domains that can provide valuable insight into a variety of challenging applications [2, 6, 14, 15, 34], such as the recommendation of relevant products in e-commerce, surveillance systems and so on (Fig. 1).

Fig. 1
figure 1

The top row images show the minute intra-category variances among different subordinate categories of the bird. The bottom row images show that Faster RCNN with softmax loss frequently misclassifies horses and sheep into cows, since it focus on capturing more inter-category variances rather than intra-category variances

Most of the current state-of-the-art fine-grained recognition systems [5, 33] are part-based methods, as leveraging parts can capture the subtle appearance difference in specific object parts and achieve better performance. However, part annotations are more difficult to be obtained than object annotations. In this paper, we formulate the fine-grained recognition problem as the object detection problem [7, 8] without considering parts. When we train a standard Faster RCNN, the existence of many background samples makes the feature representation less discriminative between different subordinate categories and more confusing between an object category and the background. To address this concern, we introduce a cascaded structure to eliminate excessive background samples. Our cascaded framework consists of a standard Faster RCNN and a modified Fast RCNN with a one-vs-rest loss function. For simplicity, we denote the first standard Faster RCNN as SFNet and the unified recognition framework as RFNet. An overview of our proposed recognition framework for fine-grained recognition is shown in Fig. 2. In our unified recognition framework, the standard Faster RCNN first generates primitive detections which usually contain many background parts. So we first eliminate primitive detections with low scores, which are more likely to be part of the background, and then use the balanced data to further train a modified Fast RCNN. Finally, the predicted label of the detected box with the highest score is used as the predicted label of the whole image. Our unified framework is trained to detect only the whole object, so it does not need part annotations at the training stage and is free from any annotations at the testing stage.

Fig. 2
figure 2

An overview of our RFNet. Red rectangle indicates SFNet (a standard Faster RCNN) and blue rectangle indicates one-vs-rest Fast RCNN

Fine-grained recognition tasks require distinguishing objects at the subordinate level. A good fine-grained recognition framework should be able to capture variances among different subordinate categories. However, Fast RCNN and Faster RCNN exploit the N + 1 class (N object categories plus background) softmax loss function that results in an offset between detections and fine-grained recognition solutions, when referring to feature learning. The feature learning of the softmax detection network is still affected by the background class even though we have eliminated most of the background samples using the cascaded structure. Besides, it is very difficult for the softmax detection network to distinguish the objects with similar appearance or belonging to semantically related genres. For example, Faster RCNN can distinguish animals from the background, but it frequently misclassifies horses and sheep into cows (shown in Fig. 1), since horse, sheep and cow are all subordinate categories of the animals and have significant intra-category variances. To bridge this gap, we replace the softmax loss function of Fast RCNN with a novel one-vs-rest loss function, which consists of N (the number of subordinate categories) two-class cross entropy losses, each of which is responsible for capturing the variances between one specific subordinate category and its similar categories. This design enables the one-vs-rest loss function to focus on capturing the variances between each category and its similar categories, suitable for fine-grained recognition tasks.

The main contributions of this paper are as follows:

  1. 1)

    First, we propose a novel cascaded detection framework for fine-grained recognition tasks. The unified recognition framework does not need expensive part annotations at the training stage and is free from any annotations at the testing stage.

  2. 2)

    Second, we introduce a cascaded structure to eliminate excessive background samples, then train a better detector using the balance data. The cascaded structure enables our framework to be free from the influence of excessive background samples and the learned features are suitable for object categorisation.

  3. 3)

    To the best of our knowledge, it is the first time to introduce one-vs-rest detection network into fine-grained recognition tasks. Due to the ability of the one-vs-rest loss function to capture intra-category variances, the cascaded detection network is well adapted to fine-grained recognition tasks.

2 Related work

Fine-grained recognition

Current top-performing fine-grained recognition methods [5, 33] leverage object parts, as it is widely acknowledged that the subtle difference between objects can help deliver better performance. [19, 31] focus on localizing and describing discriminative object parts in the fine-grained domain and explicitly requires both box and part annotations during the training and testing phases. Aiming at training fine-grained classifiers without part annotations, [16] introduces co-segmentation to localize the whole object and then performs alignment across all the images. [13] also leveraged better segmentation [1, 9] to localize object parts, and proposes an efficient architecture for inference, but it requires both bounding box and part annotations in training, and even needs specific annotations during testing. Towards the goal of performing fine-grained recognition without any annotations, some unsupervised methods have emerged. [28] presented a visual attention model to support fine-grained classification without any annotations. [24] reported a method to localize parts with a constellation model, which incorporates CNN into the deformable part model. Although unsupervised methods [24, 28] are free from box and part annotations, their performance is still not comparable to part-based methods. The comparison of part-based methods, bounding box-based methods and unsupervised methods can be seen in Table 1. In order to well balance the relationship between accuracy and annotation demands, we here propose a novel cascade detection framework for fine-grained recognition.

Table 1 The comparison of part-based methods, bounding box-based methods and unsupervised methods

Object detection

RCNN [12] is one of the most notable region based frameworks for object detection. It demonstrates state-of-the-art performance on standard detection benchmarks at the early time and also inspires most of the state-of-the-art detection methods. RCNN first exploits the standard selective search algorithm [26] to generate hundreds or thousands of region proposals per image, and then trains a CNN to classify these region proposals. To further boost the detection performance, the standard Fast RCNN [11] and Faster RCNN [22] introduced a multi-task loss function simultaneously to classify region proposals and regress the bounding box coordinates. However, most of the current detection networks use the softmax loss function and produce a large number of misclassification errors. Recently, [29] introduced a one-vs-rest loss function in order to reduce misclassification errors in generic object detection. We here also use the one-vs-rest loss function for fine-grained recognition. Different from [29], we propose a novel cascaded detection framework for fine-grained recognition tasks and improve system performance.

3 The proposed method

Our proposed framework consists of a standard Faster RCNN [22], followed by a modified Fast RCNN with the one-vs-rest loss function. The standard Faster RCNN first generates primitive detections which usually contain a large number of background parts. We first eliminate excessive backgrounds in the primitive detections, and then use the balanced data to further train a one-vs-rest Fast RCNN. Finally, the predicted label of the highest scored detection box is used as the predicted label of the whole image. The cascaded structure enables the one-vs-rest Fast RCNN to be free from the influence of excessive background components and the learned features are suitable for object categorisation. Besides, the softmax loss function of the Fast RCNN is replaced by a novel one-vs-rest loss function which can capture the variances between different subordinate categories.

3.1 Cascaded detection network

In order to perform fine-grained recognition without part annotations, we propose a cascaded detection framework to detect the whole object in the image so that it needs only box annotations at the training stage and is free from any annotations at the testing stage. Our cascaded framework consists of a standard Faster RCNN, followed by a one-vs-rest Fast RCNN. When training the standard Faster RCNN, the existence of many background samples allows the feature representation component to capture less intra-category variance (i.e., variance between different subcategories) and more inter-category variance (i.e., between the object category and background), causing many false positives between the ambiguous object categories (e.g., people mistakenly classify horses and sheep as cows). When training a better detector, it is necessary to eliminate excessive background samples to achieve good balance. So after eliminating the background in the primitive detections of the standard Faster RCNN, we add another one-vs-rest Fast RCNN and train it with the balanced data. The cascaded structure prevents our framework from the influence of excessive background clutters. Ref. [33] shows a Fast RCNN network to refine small semantic part candidates generated from a novel top-down proposal method, a classification sub-network to extract features from the detected parts, and combines them for recognition. In the same way, our cascaded detection network can also incorporate object parts in addition to the whole object. Better system performance is expected when considering image parts.

Previous work [31] reported a bottom-up selective search method to generate part and object proposals, which used RCNN to perform object detection. In the experiments, they discovered that the region proposals are the bottleneck for precise fine-grained recognition. Salient differences among different fine-grained bird species are more likely to attach to some small parts. Once the crucial discriminative small parts are lost due to the unreliable proposal methods, it is hard for the sub-classification network to further distinguish them. For example, as shown in Fig. 3, it is not straightforward to distinguish between a Ringed-billed gull and a California gull without identifying the pattern of their beaks. In our method, the Faster RCNN network can generate high quality proposals, since it exploits an effective proposal generation network RPN. RPN exploits a multi-task loss function used for classification and bounding-box regression of the translation-invariant anchors. The loss function is defined as:

$$ L\left(\left\{{p}_i\right\},\left\{{t}_i\right\}\right)=\frac{1}{N_{cls}}{\sum}_i{L}_{cls}\left({p}_i,{p}_i^{\ast}\right)+\lambda \frac{1}{N_{reg}}{\sum}_i{p}_i^{\ast }{L}_{reg}\left({t}_i,{t}_i^{\ast}\right) $$
(1)

where i is the index of an anchor in a mini-batch andpi is the predicted probability of anchor i being an object. The ground truth label pi = 1if the anchor is positive, and pi = 0if the anchor is negative. ti is a vector representing the four parameterized coordinates of the predicted bounding box, and \( {t}_i^{\ast } \) is that of the ground truth box associated with a positive anchor. The classification loss Lcls is the log loss over the two classes (object vs. background). The regression loss function Lreg is of a robust L1 form, defined as:

$$ {L}_{reg}\left({t}_i,{t}_i^{\ast}\right)={\sum}_i{smooth}_{L_1}\left({t}_i,{t}_i^{\ast}\right) $$
(2)
$$ {smooth}_{L_1}(x)\left\{\begin{array}{l}0.5{x}^2\kern2.5em if\mid x\mid <1\\ {}\mid x-0.5\mid \kern1em otherwise\end{array}\right\} $$
(3)
Fig. 3
figure 3

The salient difference between a California gull and a Ringed-billed gull lies in the pattern of their beaks

The two terms are normalized with Ncls andNregNreg, and a balancing weightλ.

In our experiments, SFNet can achieve 82.0% accuracy only with average 10 high quality proposals per image, far less than thousands of bounding boxes produced from the selective search method [26].

3.2 Objective function

3.2.1 Softmax loss

Both Fast R-CNN and Faster RCNN drop the one-vs-rest SVM in the RCNN in order to obtain an end-to-end system. However, softmax loss encourages feature representation to learn inter-category variances instead of intra-category variances. This can be explained by the definition of softmax loss in Eqs. 4 and 5.

$$ L=-\sum \limits_{n=1}^N\sum \limits_{c=1}^C{t}_{n,c}\log {p}_{n,c}, where\;{p}_{n,c}=\frac{e^{net_{n,c}}}{e^{\sum \limits_{c=1}^C{net}_{n,c}}} $$
(4)

Denotetn, cand pn, cas the ground truth label and the predicted label for thenth sample and cth class. tn, c = 1if thenthsample belongs to thecthclass, tn, c = 0otherwise. netn, cis the classification prediction from the neural network. Denote θas the parameter of the network, the derivative is:

$$ \frac{\delta L}{\delta \theta}=\sum \limits_{n,c}\left({p}_{n,c}-{t}_{n,c}\right)\frac{\delta {net}_{n,c}}{\delta \theta} $$
(5)

Eq.6 shows that the number of the samples belonging to class c influences the gradient of the parameters. Suppose the prediction errorspn, c − tn, chave similar magnitudes for all the samples, then we can infer that one class which has more samples, the magnitude of the gradient from it will be much larger than the magnitude of the gradient from the other classes. This results in the network parameters dominated by the class which has much more samples. Therefore, the existence of the dominated background samples (3/4 of all the training samples) leads to better feature representation for capturing inter-category variances.

3.2.2 One-vs-rest loss

For the Fast RCNN in the proposed framework, we replace the softamx loss function with a novel one-vs-rest loss, which is designed to capture variances among different subordinate categories. One-vs-rest loss consists of N (the number of subordinate categories) two-class cross entropy losses, and each two-class cross entropy loss function focuses on capturing the variances between one specific subordinate category and its similar categories. The objective function is the sum of N two-class cross entropy losses. At the training time, primitive detections with low scores, which are more likely to be the background, are discarded. This step is especially important since it makes one-vs-rest Fast RCNN network learn more discriminative features of different subordinate categories. Then each two-class cross entropy classifier is trained using the detections which have high scores on that specific category, as those high scored detections may be true positives or false positives (i.e. detections misclassified by SFNet whose ground truth labels are similar to that specific category). In this way, the negative training samples of each two-class cross entropy classifier are of the categories similar to the specific category, allowing each specific two-class cross entropy classifier to capture the variances between the specific category and its similar categories. At the test time, after non maximum suppression (NMS) operation on the primitive detections, less and higher quality detections are left. Then each of the left detections is again classified and regressed by the one-vs-rest Fast RCNN, and the output scores (N categories) are averaged (different from the multiply operation used in [29]) over the primitive scores in a category-by-category way to retrieve the final scores. Finally, the predicted label of the highest scored box is used as the predicted label of the whole image. The whole training process and the testing stage of RFNet are illustrated in Processes 1 and 2, accordingly.

figure g

4 Experimental results

4.1 Dataset

We evaluate the performance of our proposed framework for fine-grained recognition on CUB-200-2011 dataset [27], which is generally considered as the most extensive and competitive datasets in the literature. CUB-200-2011 contains 11,788 images of 200 bird species, each image has a single bounding box annotation, rough segmentations and 15 key points annotated, which is not used in our method.

4.2 Implementation details

The baseline models of our two networks are based on the VGG16 model [25], as done in current state-of-the-art methods [5, 33]. All the experiments are performed on a single NVIDIA K40 GPU. Parameters of the SFNet are initialized from the model pre-trained on the ImageNet dataset. Parameters of the one-vs-rest Fast RCNN are initialized from the SFNet model, and the new one-vs-rest loss layer is initialized from a Gaussian distribution.

4.3 Results and comparisons

We first conduct some ablation experiments to analyse the cascaded structure and the one-vs-rest loss with regard to recognition performance, and then move on to the comparison against the previous work.

4.3.1 Ablation experiments

How important is the cascade structure?

To evaluate the effectiveness of the cascaded structure, we compare SFNet with softmax RFNet, which consists of a standard Faster RCNN (SFNet) and a standard Fast RCNN with the softmax loss function. For softmax RFNet, the baseline model of the standard Fast RCNN is VGG16 and the parameters are initialized for the SFNet model as the same as RFNet. From Table 2, we observe that softmax RFNet improves accuracy by 0.9% over SFNet, and the experiment validates the effectiveness of the cascaded structure to eliminate the influence of excessive background samples during feature learning.

Table 2 Recognition performance comparisons between SFNet, softmax RFNet and RFNet on CUB-200-2011, softmax RFNet consists of a standard Faster RCNN (SFNet) and a standard Fast RCNN with softmax loss

Sotfmax loss vs. one-vs-rest loss

The comparison between softmax RFNet and RFNet, shows that one-vs-rest loss improves accuracy by 1.1% over softmax loss. The results shown in Fig. 4 verify the ability of the one-vs-rest loss function of further capturing intra-category variances among the subordinate categories, and also reducing false positives mainly caused between ambiguous categories.

Fig. 4
figure 4

Examples on the CUB-200-2011 dataset of SFNet detections (blue), RFNet detections (red) and ground truth bounding box (green). Images misclassified by SFNet are rectified by one-vs-rest Fast RCNN network

4.3.2 Comparison with other state-of-the-art methods

This section shows the comparison results of our method against the previous work. For fair comparison, we report the results with varying degrees of supervision such as part annotation or bounding-boxes at the training and the testing time.

The comparison results illustrated in Table 3 show that our RFNet performs much better than the previous unsupervised methods [10, 24, 28], and outperforms part-based methods [13, 19, 31, 32]. RFNet also achieves comparable performance against the state-of-the-art, part-free, fine-grained recognition method [20]. [20] presents bilinear models that exploit two CNNs to extract features while we use a single cascaded structure to extract features which is easier to train. However, our method is slightly worse than the current state-of-the art methods [5, 33], due to the significant advantage of exploring part information for bird recognition. [10] is with box level annotation at both the training and testing stages, and achieves about 13.4% higher accuracy than that without any annotation. [16] introduced box level annotation at the testing time, and also achieved better performance. All these developments verify that leveraging more additional supervision results in higher performance. It is worth emphasizing that RFNet improves the detection and the loss layers for better feature learning. We anticipate that leveraging part annotations in our cascade detection framework will result in higher performance due to the additional supervision.

Table 3 Recognition performance comparisons of the current state of the art methods on CUB-200-2011, sorted by the amount of annotation used

5 Conclusion and discussion

In this paper, we have proposed a novel cascade detection framework for fine-grained recognition tasks without considering parts. The proposed cascaded detection framework is well adapted for fine-grained recognition by introducing a one-vs-rest loss function, which can capture more intra-category variances. Experiments showed that our proposed recognition framework achieved comparable performance against the other state-of-the-art part free fine-grained recognition methods on the CUB-200-2011 Birds dataset.

The cascaded framework boosts the classification accuracy, but the two networks are trained respectively and cannot meet the requirement of many real-time applications. Taking into account the speed of the proposed framework, and introducing the proposed solution to applications such as surveillance systems and the recommendation of relevant products in e-commerce become one of the future research directions.