Keywords

1 Introduction

Concept-based video annotation, also known as video concept detection, refers to the problem of annotating a video fragment (e.g., a keyframe) with one or more semantic concepts (e.g., “table”, “dog”) [14]. The state of the art approach to doing this is to train a deep convolutional neural network (DCNN) on a set of concepts [3, 5, 12]. Then, a test keyframe can be forward propagated by the DCNN, to be annotated with a set of concept labels. DCNN training requires the learning of millions of parameters, which means that a small-sized training set could easily over-fit the DCNN on the training data. It has been proven that the bottom layers of a DCNN learn rather generic features, useful for different domains, while the top layers are task-specific [18]. Transferring a pre-trained network in a new dataset by fine-tuning its parameters is a common strategy that can take advantage of the bottom generic layers and adjust the top layers to the target dataset and the new target concepts [2, 5, 18].

In this study we compare three fine-tuning methods in order to investigate the best way to transfer the parameters of a DCNN trained on a source dataset to a different target dataset that requires a different set of concepts. Although DCNN fine-tuning has been presented in previous studies [2, 5, 18], this is the first work, to our knowledge, that performs a large number of experimental comparisons considering three different pre-trained DCNNs, two different subsets of concepts for which the pre-trained networks will be fine-tuned, two different target datasets, and three fine-tuning strategies with many parameter evaluations for each of them, with the purpose of comparing these strategies. Experiments performed on the TRECVID 2013 SIN [10] and the PASCAL VOC-2012 classification [4] datasets show that increasing the depth of a pre-trained network with one more fully-connected layer and fine-tuning the rest of the layers on the target dataset can improve the network’s concept detection accuracy compared to other fine-tuning approaches.

2 Related Work

Fine-tuning is a process where the weights of a pre-trained DCNN are used as the starting point for a new target training set and they are modified in order to adapt the pre-trained DCNN to the new target dataset. Then, the fine-tuned DCNN can be used either as feature generator, i.e., the output of one or more hidden layers is typically used as a global keyframe/image representation [13], or as standalone classifier that performs the final class label prediction directly.

Different DCNN-based transfer learning approaches have been successfully applied in many datasets. The most straight-forward approach replaces the classification layer of a pre-trained DCNN with a new output layer that corresponds to the categories that should be learned with respect to the target dataset, [2, 5, 18]. Generalizing this approach, the weights of the first K network layers can remain frozen, i.e., they are copied from the pre-trained DCNN and kept unchanged, and the rest of the layers (be it just the last one or more than one) are learned from scratch [1, 9]. Alternatively, the copies of the first K layers could be allowed to adapt to the target dataset with a low learning rate. For example, [18] investigates which layers of Alexnet [7] are generic, i.e., can be directly transferred to a target domain, and which layers are dataset-specific. Furthermore, experiments in [18] show that fine-tuning the transferred layers of a network works better than freezing them. However, neither of these studies investigates how low the learning rate for the aforementioned layers should be, relative to the new layers, during fine-tuning. Other studies extend the pre-trained network by one or more fully connected layers, which seems to improve the above transfer learning strategies [1, 8, 9, 15]. However, the optimal number of extension layers and the size of them has not been investigated before. Although fine-tuning has been applied in many studies, a complete understanding of what fine-tuning parameters (e.g., number/size of extension layers, learning rate) work better has not been extensively examined. Furthermore, a thorough comparison of all the available fine-tuning alternatives is yet to appear in the literature.

3 Comparison of Fine-Tuning and Extension Strategies for DCNNs

Let \(D_s\) denote a pre-trained DCNN, trained on \(C_s\) categories using a source dataset, and \(D_t\) denote the target DCNN, fine-tuned on \(C_t\) categories of a different target dataset. In this section we present the three fine-tuning strategies (Fig. 1) that we compare for the problem of visual annotation, in order to effectively fine-tune DCNNs \(D_s\) that were trained on a large visual dataset for a new target video/image dataset. These three fine-tuning strategies are as follows:

Fig. 1.
figure 1

Fine-tuning strategies outline.

  • FT1-def: Default fine-tuning strategy: This is the typical strategy that modifies the last fully-connected layer of \(D_s\) to produce the desired number of outputs \(C_t\), by replacing the last fully-connected layer with a new \(C_t\)-dimensional classification fully-connected layer.

  • FT2-re: Re-initialization strategy: In this scenario, similar to FT1-def, the last fully-connected layer is replaced by a new \(C_t\)-dimensional classification layer. The weights of the last N layers, preceding the classification layer, are also re-initialized (i.e., reset and learned from scratch).

  • FT3-ex: Extension strategy: Similar to the previous two strategies, the last fully-connected layer is replaced by a new \(C_t\)-dimensional classification fully-connected layer. Subsequently, the network is extended with E fully-connected layers of size L that are placed on the bottom of the modified classification layer. These additional layers are initialized and trained from scratch during fine-tuning, at the same rate as the modified classification layer. One example of a modified network after the insertion of one extension layer for two popular DCNN architectures that were also used in our experimental study in Sect. 4, is presented in Fig. 2. Regarding the GoogLeNet architecture, which has two additional auxiliary classifiers, an extension layer was also inserted in each of them.

Fig. 2.
figure 2

A simplified illustration of the CaffeNet [7] (left) and GoogLeNet [16] (right) architectures used after insertion of one extension layer. Each of the inception layers of GoogLeNet consists of six convolution layers and one pooling layer. The figure also presents the direct output of each network and the output of the last three layers that were used as features w.r.t. FT3-ex strategy. Similarly, the corresponding layers were used for the FT1-def and FT2-re strategies.

Each fine-tuned network \(D_t\) is used in two different ways to annotate new test keyframes/images with semantic concepts. (a) Direct classification: Each test keyframe/image is forward propagated by \(D_t\) and the network’s output is used as the final class distribution assigned to the keyframe/image. (b) \(D_t\) is used as feature generator: The training set is forward propagated by the network and the features extracted from one or more layers of \(D_t\) are used as feature vectors to subsequently train one supervised classifier (e.g., Logistic Regression) per concept. Then, each test keyframe/image is firstly described by the DCNN-based features and subsequently these features serve as input to the trained Logistic Regression classifiers.

4 Experimental Study

4.1 Datasets and Experimental Setup

The TRECVID SIN task 2013 [10] dataset and the PASCAL VOC-2012 [4] dataset were utilized to train and evaluate the compared fine-tuned networks. The TRECVID SIN dataset consists of low-resolution videos, segmented into video shots; each shot is represented by one keyframe. The dataset is divided into a training and a test set (approx. 600 and 200 h, respectively). The training set is partially annotated with 346 semantic concepts. The test set is evaluated on 38 concepts, i.e., a subset of the 346 concepts. The PASCAL VOC-2012 [4] dataset consists of images annotated with one object class label of the 20 available object classes. PASCAL VOC-2012 is divided into training, validation and test sets (consisting of 5717, 5823 and 10991 images, respectively). We used the training set to train the compared methods, and evaluated them on the validation set. We did not use the original test set because ground-truth annotations are not publicly available for it (the evaluation of a method on the test set is possible only through the evaluation server provided by the PASCAL VOC competition, submissions to which are restricted to two per week).

For each dataset we fine-tuned the following three pre-trained DCNNs: (i) CaffeNet-1k, the reference implementation of Alexnet [7] by Caffe [6], trained on 1000 ImageNet categories, (ii) GoogLeNet-1k [16], trained on the same 1000 ImageNet [11] categories and (iii) GoogLeNet-5k, trained using 5055 ImageNet [11] categories. Each of these networks was fine-tuned on the 345 TRECVID SIN concepts (i.e., all the available TRECVID SIN concepts, except for one which was discarded because only 5 positive samples are provided for it), which resulted to a training set of 244619 positive examples. CaffeNet-1k was also fine-tuned on a subset of 60 TRECVID SIN concepts. We refer to each of these fine-tuned networks as CaffeNet-1k-345-SIN, GoogLeNet-1k-345-SIN, GoogLeNet-5k-345-SIN and CaffeNet-1k-60-SIN, respectively. In addition, each of these original networks was fine-tuned on the positive examples of the PASCAL VOC-2012 training set. These networks are labeled as CaffeNet-1k-VOC, GoogLeNet-1k-VOC and GoogLeNet-5k-VOC, respectively.

In performing pre-trained DCNN fine-tuning, we compared the three fine-tuning strategies presented in Sect. 3. Specifically, in all cases we discarded and replaced the classification fully-connected (fc) layer of the utilized pre-trained network, with a 60-dimensional or a 345-dimensional fc classification layer for the 60 or 345 concepts of the TRECVID SIN dataset respectively, or with a 20-dimensional classification layer for the 20 object categories of the PASCAL VOC-2012 dataset. We examined two values for parameter N of the FT2-re strategy; we refer to each configuration as FT2-re1 (for \(N=1\)) and FT2-re2 (for \(N=2\)). The FT3-ex strategy was examined for two settings of network extensions \(E\in \{1, 2\}\): i.e., extending the network by one or two fc layers, respectively, followed by ReLU (Rectified Linear Units) and Dropout layers. The size of each extension layer was examined for 7 different dimensions: \(L\in \{64, 128, 256, 512, 1024, 2048, 4096\}\). We refer to these configurations as FT3-exE-L. The new layers’ learning rate and momentum was set to 0.01 and \(5\mathrm {e}{-4}\), whereas the mini-batch size was restricted by our hardware resources and set to 256 and 128 for the CaffeNet and GoogLeNet configurations, respectively.

For the purpose of evaluation, we then tested each fine-tuned network on the TRECVID SIN 2013 test set that consists of 112677 representative keyframes and 38 semantic concepts on the indexing problem; that is, given a concept, return the 2000 test keyframes that are more likely to represent it. In addition, we examined classification performance on the PASCAL VOC-2012 validation set, consisting of 5823 images and 20 object categories. We fine-tuned the total of 17 configurations times 7 networks on a Tesla K40 GPU, over a period of 2 months. All networks were trained and implemented in Caffe [6].

4.2 Preliminary Experiments for Parameter Selection

A set of preliminary experiments on the CaffeNet-1k-60-SIN and the FT1-def strategy was performed, in order to investigate how the learning rate of the pre-trained layers and the number of training epochs affect the performance of a fine-tuned network. Specifically, we partitioned the training set of the TRECVID SIN dataset into training and validation sets, which resulted to 71457 and 3007 keyframes, respectively. Momentum and weight decay were set to 0.9 and \(5\mathrm {e}{-4}\), respectively. We examined learning rate values for the pre-trained layers equal to \(LR_{pre}={k} \times {LR_{new}}\), where \(k \in \{0, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1\}\) and \(LR_{new}=0.01\) is the learning rate of the new classification layer that will be trained from scratch. A value of \(k=0\) keeps the pre-trained layers’ weights “frozen”, while the value of 1 makes them learn as fast as the new layers. To investigate the effect of the training epochs, each fine-tuning run was examined for a range of maximum epochs equal to: \(\{0.25, 0.5, 1, 2, 4, 8, 16, 32\}\).

Table 1. Classification accuracy for the CaffeNet-1k-60-SIN and the FT1-def strategy for different values of the learning rate multiplier of the pre-trained layers (k), in the vertical axis, and different number of training epochs (e), in the horizontal axis. For each value of parameter e the best accuracy reached is underlined. The globally best accuracy is bold and underlined.

Table 1 presents the results w.r.t. the accuracy on the validation set, as this metric is implemented in the Caffe framework. We can observe that smaller learning rate values for the pre-trained layers and higher values for the training epochs improve accuracy. Consequently, we selected the best values of 0.1 and 32 for the learning rate multiplier and the maximum number of epochs, respectively, and kept them fixed for the rest of the experiments.

4.3 Main Findings of the Study

Table 2 presents the results, in terms of Mean Extended Inferred Average Precision (MXinfAP), of the CaffeNet-1k-60-SIN (left) and the CaffeNet-1k-345-SIN (right), for the three fine-tuning strategies of Sect. 3. In addition, Table 3 presents the MXinfAP of the GoogLeNet-1k-345-SIN (top) and GoogLeNet-5k-345-SIN (bottom). MXinfAP [17] is an approximation of the MAP, suitable for the TRECVID SIN partially annotated dataset. Similarly, Table 4 presents the results in terms of MAP of the CaffeNet-1k-VOC and Table 5 presents the MAP of the GoogLeNet-1k-VOC (top) and GoogLeNet-5k-VOC (bottom).

Table 2. MXinfAP (%) for CaffeNet-1k-60-SIN (sub-table (A), left) and CaffeNet-1k-345-SIN (sub-table (B), right). For each sub-table, the best result per column is underlined. The globally best result per sub-table is bold and underlined.
Table 3. MXinfAP (%) for GoogLeNet-1k-345-SIN (sub-table (A), top) and GoogLeNet-5k-345-SIN (sub-table (B), bottom). For each sub-table, the best result per column is underlined. The globally best result per sub-table is bold and underlined.
Table 4. MAP % for CaffeNet-1k-VOC. For the FT-re strategy we trained the network with learning rate 10 times lower that of all the other cases. Otherwise, the network did not converge. The best result per column is underlined. The globally best result is bold and underlined.
Table 5. MAP % for GoogleNet-1k-VOC (sub-table (A), top) and GoogleNet-5k-VOC (sub-table (B), bottom). For each sub-table, the best result per column is underlined. The globally best result per sub-table is bold and underlined.

For each pair of utilized network and fine-tuning strategy we evaluate: (i) The direct output of the network (Tables 2, 3, 4 and 5: col. (a)). (ii) Logistic regression (LR) classifiers trained on DCNN-based features. Specifically, the output of each of the three last layers of each fine-tuned network was used as feature to train one LR model per concept (Tables 2, 3, 4 and 5: col. (b)–(d)). Furthermore, we present results for the late-fused output (arithmetic mean) of the LR classifiers built using the last three layers (Tables 2, 3, 4 and 5: col. (e)). For the GoogLeNet-based networks evaluations are also reported for the two auxiliary classifiers (Tables 3 and 5: col. (f)–(i)). The details of the two DCNN architectures mentioned above (CaffeNet, GoogLeNet) and the extracted features are also illustrated in Fig. 2. Based on the results reported in the aforementioned tables, we reach the following conclusions:

  1. (a)

    According to Table 2, fine-tuning a pre-trained network on more concepts (going from 60 to 345) leads to better concept detection accuracy for all the fine-tuning strategies.

  2. (b)

    Across all the networks and for both datasets, the FT3-ex strategy almost always outperforms the other two fine-tuning strategies (FT1-def, FT2-re) for specific (L, E) values.

  3. (c)

    With respect to the direct output, FT3-ex1-64 and FT3-ex1-128 constitute the top-two methods for the TRECVID SIN dataset irrespective of the employed DCNN. On the other hand, FT3-ex1-2048 and FT3-ex1-4096 are the top-two methods for the PASCAL VOC-2012 dataset and the GoogLeNet-based networks, while FT3-ex1-512 and FT3-ex1-1024 are the best performing strategies for the CaffeNet network on the same dataset. That is, the FT3-ex strategy with one extension layer is always the best solution, but the optimal dimension of the extension layer varies, depending on the target domain dataset and the network architecture.

  4. (d)

    The highest concept detection accuracy for each network is always reached when LR classifiers are trained on features extracted from the last and the second last fully connected layer for TRECVID SIN and PASCAL VOC-2012 dataset, respectively, using the FT3-ex strategy. That is, features extracted from the top layers are more accurate than layers positioned lower in the network, but the optimal layer varies, depending on the target domain dataset.

  5. (e)

    DCNN-based features significantly outperform the direct output alternative in the vast majority of cases. However, in a few cases the direct network output works comparably well. The choice between the two approaches should be based on the application that the DCNN will be used. E.g., real time applications’ time and memory limitations would most probably render using DCNNs as feature extractors in conjunction with additional learning (LR or SVMs) prohibitive. Furthermore, we observe that the features extracted from the final classifier of GoogLeNet-based networks outperform the other two auxiliary classifiers, in most cases.

  6. (f)

    Using DCNN layers’ responses as feature vectors, on the one hand, FT3-ex1-512 is in the top-five methods irrespective of the employed DCNN, the extracted feature and the used dataset. Regarding the PASCAL VOC-2012 dataset this is always the case except for the features extracted from the third last layer of the CaffeNet network (Table 4: col. (d)). On the other hand, FT3-ex2-64 is always among the five worst fine-tuning methods. The rest of the FT3-ex configurations, present fluctuations of their performance across the different utilized DCNNs and DCNN-based features.

  7. (g)

    Finally, it is better to combine features extracted from many layers; specifically, performing late fusion on the output of the LR classifiers trained with each one of the last three fully connected layers almost always outperforms using a single such classifier irrespective of the employed network (Tables 2, 3, 4 and 5: col. (e)). The above conclusion was also reached for the auxiliary classifiers of GoogLeNet-based networks but for space-limitations we only present the fused output for each of these auxiliary classifiers (Tables 2, 3, 4 and 5: col. (g), (i)).

5 Conclusions

In this paper we presented a large comparative study of three fine-tuning strategies on three different pre-trained DCNNs and two different subsets of semantic concepts. Experiments performed on the TRECVID 2013 SIN dataset [10] and PASCAL VOC-2012 classification dataset [4] show that the method of increasing the depth of a pre-trained network with one fully-connected layer and fine-tuning the rest of the layers on the target dataset can improve the network’s concept detection accuracy, compared to other fine-tuning approaches. Using layers’ responses as feature vectors for a learning model such as logistic regression can lead to additional gains, compared to using the direct network’s output, at an additional cost of computation time and memory.