Keywords

1 Introduction

With the availability of cheap devices such as digital cameras, smartphones, etc., video has become an essential part of the multimedia communication environment. As a result of these advances in technology, we are seeing a sudden increase in videos with or without semantic tags on social networking sites. According to YouTube statistics, approximately 200 hours of video content is uploaded to YouTube every minute and approximately 11 million videos are posted to Twitter every day without bad text or tags. As online videos without semantic tags are on the rise in popularity, robust content-based video analysis techniques are demanding. With content-based video retrieval (CBVR) as a technology, it opens and provides solutions to applications like in-video advertising, content filtering, video navigation, video indexing and video surveillance. In-video advertising, the goal is to retrieve target videos that are similar and suitable to include advertisement between it. In content filtering, unappropriated activity is excluded, which can also be solved by retrieving unappropriated videos to an unappropriated query video.

Significant research progress has been made over the last decades in image retrieval [1] including fine-grained search [2], but CBVR has received insufficient attention in the multimedia community compared with image retrieval domain. Traditional search techniques are difficult to process large-scale database videos due to the high cost of computing. Lots of efforts have been applied in this field. In [3], a video content indexing by objects is presented. In their approach, moving object is detected in wavelet domain by a combination of morphological color segmentation at a lower scale with global motion estimation. Then histograms of wavelet coefficients of objects at multi-scale are computed and matched with database for retrieval of similar videos. The limitation is that the system is dependent on how much object segmentation is accurate, and technique is needed to exploit temporal dynamics even if the objects are roughly segmented.

In recent, CNN shows tremendous success in the field of computer vision, especially for the tasks like image classification, object detection, segmentation and image retrieval. This progress also led to video retrieval problem. Like, Lou et al. [4] propose compact and discriminative CNNs descriptor for video retrieval. Limitation is that they do not consider the relationship between feature maps of CNN, which can be incorporated to compute temporal features. Podlesnaya et al. [5] use CNN features for video clip representation. Its limitation is that the size of feature vectors causes cost complexity while matching videos. The feature vector dimensionality can be reduced in order to search in log time scale. Kumar et al. [6] deal the problem of movie scene retrieval with CNN and LSTM. There are also works done in the scope of hashing-based video search like [7,8,9]. All these works learn a new subspace in binary (hash) domain where similar videos are closer and dissimilar videos are far away. For instance, [8] proposes a deep auto encoder–decoder framework utilizing two-layered hierarchical LSTM to learn binary codes. Kumar et al. [9] also exploit CNN with lstm for video retrieval problem. Moreover, several recent research studies are performed by researchers using AI/ML approaches [10,11,12].

In this paper, we investigate the significance of the 2d-CNN’s middle and higher layer’s features for video representation. First, we conduct systematic assessment of the performance of features from different layers of CNN in video retrieval tasks. Then, we find which features fusion combination can boost the performance.

2 Materials and Method

2.1 CNN Architecture

CNN can be considered as an extension of the multi-layer perceptron (MLP) that exploits the rich 2D spatial structure of image that MLP fails to do, where initial layers (convolutional) are responsible for sensing spatial relationship within nearby pixels and the final layers are responsible for generating lower dimensional representation with higher level abstraction of image. The general CNN network looks as in Fig. 1. Once the network is trained with a sufficiently large dataset (proportional to the number of network’s parameters), then each layer extracts the rich information present in the image in a hierarchical manner. The early layers extract the low-level image properties like edges, objects contour. Middle layers extract the shape, color, texture, and higher layers extract features responsible for global level abstraction like face, month, nose, etc.

Fig. 1
figure 1

General CNN architecture

Three types of CNN architectures are used in this paper: AlexNet [13], GoogleNet [14] and ResNet18 [15]. Tables 1, 2 and 3 show the respective CNN’s layers name and its output’s sizes. Moreover, reader may refer [13,14,15] for detailed information on the implementation of CNN.

Table 1 AlexNet architecture
Table 2 ResNet18 architecture
Table 3 GoogleNet architecture

AlexNet: This CNN consists of five convolutional layers and three dense layers. It achieves first position in ILSVRC 2012. It takes 227 × 227 × 3 RGB image as an input and passes it through all intermediate layers to output final class score. Due to the dense layers at the end, the network makes over 61 M parameters. GoogleNet: This CNN is deeper compared with AlexNet and introduces a concept of inception block and achieves first position in ILSVRC 2014. Each inception module consists of multiple convolutions of kernel sizes 1 × 1, 3 × 3 and 5 × 5. The 1 × 1 convolutional layers in the middle are for the dimensionality reduction of the feature space. In total, nine inception modules are connected sequentially. More info can be found in [14]. ResNet18: This is another CNN that introduces the residual connection by which it solves the problem of vanishing gradient in training deeper CNN. With the inclusion of residual connection in CNN, it provides the shortcut to the gradients so that it can easily reach the input without vanishing that much as in without residual case. It is the winner of ILSVRC 2015. In this model, there are four residual blocks of length {2, 2, 2, 2}. For more info, refer [15].

2.2 Feature Extraction

To represent the video frame, activations from the particular layer of CNN can be extracted. Following the findings of [16] and [17], we choose the last two convolutional and fully connected layers as feature representation (see in Tables 1, 2 and 3, bolded font ones used as descriptors). Let \(f_{CNN}\) be the feature transformation function that maps the \(R^{m \times n}\)video frame (m × n is resolution of video) to \(R^{u \times v}\)(u × v is size of feature map) feature space. Given a set of T consecutive frames of ith clip sampled from the nth video, the feature vector for ith clip for a particular Lth layer is denoted as \(CFmax_{ni}^{L}\)and \(CFmean_{ni}^{L}\), which is computed as:

$$CFmax_{ni}^{L} = \max \left( {f_{CNN}^{L} \left( {V_{ni}^{{\left( {1:T} \right)}} } \right)} \right)$$
(1)
$$CFmean_{ni}^{L} = {{\sum\limits_{t} {f_{CNN}^{L} \left( {V_{ni}^{\left( t \right)} } \right)} } \mathord{\left/ {\vphantom {{\sum\limits_{t} {f_{CNN}^{L} \left( {V_{ni}^{\left( t \right)} } \right)} } T}} \right. \kern-\nulldelimiterspace} T}$$
(2)

where, \(CF{max}\) and \(CFmean\)represent the features associated with max and mean pooling over temporal dimension.

All clip level features are averaged to generate the descriptor at video level.

3 Experimental Settings

3.1 Dataset and Setting

We conduct the experiments on UCF-101 dataset [18], which consists of 13 k videos from 101 categories. The standard train/test split 1 of the dataset is used. Retrieval is done by assuming the videos of the testing set as queries and training videos as retrieval set. We adopted the standard mean average precision (mAP@k) for evaluation purposes. Matlab 2019b and tesla k40 GPU are employed for all experiments.

3.2 Implementation

First, 10 clips per video evenly sampled from each video, then following [13] each clip undergoes through spatial center cropping of network’s input size. All the networks are pretrained on imagenet and are not trained on video dataset, which confirms the experiments are conducted under unsupervised settings. Features are extracted as discussed in the Sect. 2.2 and we choose T = 16 frames per clip. Convolutional features costs in higher dimensionality, so we applying spatial average pooling (see Table 4 for filter size) to extract lower dimensional features from it. For matching the video clips, the cosine distance is adopted.

Table 4 Spatial pooling strategy in different layers

4 Results

In this section, we first explore the effectiveness of individual features, then we see the usefulness of fusion of these features.

4.1 Effectiveness of Different Layer’s Features

In the following, we inspect each network’s performance.

Experiment using AlexNet

In Table 5, we can see that higher layers (fc6 and fc7) outperform the middle layers (conv4 and conv5), the reason being that the higher layer captures rich global level distinctive features with high level abstraction compared with middle layers. Using temporal max pooling, fc6 (53.95 mAP@1) performs slightly better than fc7 (53.30 mAP@1). The reason seems to that the last fully connected layer has class-specific features that are generalized to only seen classes, but fc6 is better to generalize to unseen classes. Also, we can observe temporal max pooling performs better than temporal mean pooling (see Fig. 2).

Table 5 Neural codes of AlexNet and its performance analysis on basis of mAP@k
Fig. 2
figure 2

mAP of different layers of three different networks; dotted line denotes performance under temporal mean pooling otherwise max pooling

Experiment using GoogLenet

In case of the last two layers of GoogleNet, temporal mean pooling performs better than max pooling as reported in Table 6. But for the second last inception block Inception4e, temporal max pooling performs better. Contrary to AlexNet, last layer (pool5) outperforms others.

Table 6 Neural codes of GoogleNet and its performance analysis on basis of mAP@k

Experiment using ResNet18

With similar findings in the above two networks, max pooling performs better in the second last residual block conv4b a.k.a res4b, and also in the case of pool5 but not true for the con5b. Using mean pooling, conv5b outperforms others (see Table 7 and Fig. 2).

Table 7 Neural codes of ResNet18 and its performance analysis on basis of mAP@k

4.2 Influence of Fusion of Multiple layer’s Features

With the above finding, we wish to investigate the significance of fusion of different features in context of video search. In this experiment, we use the CNN layers with best performed temporal pooling for fusion (denoted as subscript in Table 8). We also use two handcrafted features: LBP [19] and HOG [20] for sake of comparison. Both LBP and HOG features are computed for each clip’s frame (grayscale frame) and then averaged across all clips of the video to generate a video descriptor. For fusion of CNN’s activations, first, we apply L2 norm on individual features and then fusion (concatenate) of features.

Table 8 Comparison of mAP@k of different layers fusion strategies

Table 8 reports the mAP@k on a different combination of features. We can observe that deep leaning features easily outperform the handcrafted ones with large margin. We can also see the fusion of either combination of layers does not improve performance as much as compared to the best standalone layer feature. For example, in case of GoogleNet, Pool5mean’s performance is higher than its fusion with other lower layers feature. The reason seems to be that the higher layer captures the essential compact information from preceding layer, by fusing the lower layer with higher layer makes redundant feature (logically). Hence, the direct fusion of layers within same network is not feasible.

4.3 Effectiveness of Fusion of Different Network Features

Next, we wish to explore the influence of fusion of different network’s features on nearest neighbor search. The results are reported in Table 9, where we can see that any combination of fusion performs better than standalone performing layer. This suggests that multi-model fusion works superior.

Table 9 Comparison of mAP@k of different networks fusion strategies

5 Conclusion

This paper analyzes and discusses the significance of different layer’s features of network under the nearest neighbor search task. In particular, AlexNet, GoogleNet and ResNet18 are deployed to extract features to represent videos. We explored the effectiveness of each layer features and their fusion on the performance on video retrieval. Results suggest that direct fusion of middle level features with higher layer features of the same network architecture does not seem to boost the performance than standalone features. In the future, we will investigate how to tackle this issue. Results also suggest that on fusion of different networks features can boost the performance, but this also increases the memory requirement, time complexity etc. In addition, learning video representations require a large dataset that is labor-intensive. Future work will include to explore self-supervised learning approach as it is a promising direction to tackle the need for large-scale video datasets.