1 Introduction

Object detection in videos is a critical and challenging research field in object detection, which has received increasing attention in recent years. With the rapid development of deep learning, the CNN-based detectors (Ren et al. 2015; Dai et al. 2016; Cai and Vasconcelos 2018; Zhang et al. 2020) have become the main-stream object detection algorithms. The state-of-the-art object detection methods have been demonstrated to show improved detection performance in accuracy. Since these existing studies mainly focus on detecting objects in single images, we define them as static detectors. However, compared to single images, videos include spatiotemporal information, that is, objects in video frames are continuous in temporal and spatial domains (Zhu et al. 2017b). Therefore, serious problems exist when using static detectors in video object detection. As shown in Fig. 1a, the features of consecutive frames in the video are similar. However, static detectors ignore the feature similarity and extract features for each frame, resulting in computational redundancy. On the other hand, as illustrated in Fig. 1b, video frames often suffer from the situations as object occlusion and motion blur, resulting in false positives. The state-of-the-art static detectors still cannot solve this accuracy degradation caused by the false positives. Therefore, the challenges of video object detection lie in computational redundancy and accuracy degradation bring by static detectors.

Fig. 1
figure 1

Problems existing in video object detection caused by static detectors. a Duplicate extraction of similar features from neighboring frames. b False positives due to object blurring, occlusion, etc

The key to solve the above challenges of computational redundancy and accuracy degradation is to exploit the spatiotemporal information of the videos. Feature propagation is an effective technique in video object detection. It is an application of spatiotemporal information at the feature level and can be used for feature association between frames. Feature aggregation is another important technique in video object detection, which is mainly used to improve the feature of some frames. In order to address the challenge of accuracy degradation, the video object detection methods of FGFA (Zhu et al. 2017a), STMN (Xiao and Jae Lee 2018), and STSN (Bertasius et al. 2018) apply feature aggregation to strengthen the features of deteriorated frames using nearby frames. FGFA predicts pixel-level features of using optical flow (Dosovitskiy et al. 2015) and aggregates nearby features to improve the feature quality for each frame. Based on pixel-level feature calibration of FGFA, MANet (Wang et al. 2018a) fuses instance-level to deal with occlusion. However, these accuracy-focused studies enhance detection accuracy relying on expensive CNN-based feature extraction networks, thus, leading to low detection speed.

For the problem of high computing complexity existing in accuracy-focused studies, an ideal solution is to apply feature propagation to reduce computing cost while maintaining detection accuracy. In video object detection methods, optical flow using motion information and the memory based technology Long Short Term Memory Network (LSTM) are often used to propagate features. For instance, DFF (Zhu et al. 2017b) extracts feature maps for sparse key frames, and estimate feature maps for other non-key frames by optical flow. Because the complexity of optical flow network (Dosovitskiy et al. 2015) is lower than that of convolutional network, the total detection time is reduced. However, since the motion of high-level feature pixels is quite different from that of image pixels, estimating high-level features using optical flow representing image pixel motion may introduce artificial error. In addition, a fixed key frame strategy is used in DFF, resulting in missed detections for newcomers in non-key frames. Research of Liu et al. (2019) proposes an interleaved framework to propagate and aggregate features through LSTM (Xingjian et al. 2015). Moreover, an adaptive key frame policy using reinforcement learning (Hasselt et al. 2016) is used to further improve results. However, the inherent defect of LSTM is that object memory remains after it has moved to a different position, resulting in the inability of LSTM to accurately align features. In addition, LSTM is more time consuming. Thus it is not an optimal choice to propagate features with LSTM.

Self-attention mechanism (Vaswani et al. 2017) is a feature learning method more commonly used recently in vision analysis. For instance, attention mechanism (Bahdanau et al. 2015) is used for capturing long-range dependencies in Non-Local (Wang et al. 2018b), where connections between two pixels within an image or inter-frame are established using attention. Compared with optical flow and LSTM, self-attention mechanism directly calculates the correspondence between features, thus the attention-guided feature propagation method is more accurate and lightweight.

In this work, we focus on improving the video object detection speed by reducing the redundant computation in feature extraction while ensuring the detection accuracy. Based on the high similarity between features of nearby frames and attention-guided feature propagation, we propose a dynamic video object detection network with a transformable feature extracting process. The complete and lightweight feature extracting networks are designed for sparse key frames and dense non-key frames, respectively. The extracted features of key frames are high-level semantic features, which are suitable to generate detection results. The low-level features produced by the lightweight feature extracting networks have fast extracting speed, but cannot be fed to the detection network. Thus feature propagation is employed to establish semantic features for non-key frames. In order to propagate features accurately and quickly, a reliable and lightweight feature propagation method named feature temporal attention (FTA) is introduced based on self-attention. We use the self-attention mechanism in time domain to establish connections between two feature pixels inter-frame. In addition, a lightweight transform network is used to further improve semantic information of low-level features. Based on FTA, the temporal attention based feature propagation module (TAFPM) predicts the final features of non-key frames by the key frame features and the transformed non-key frame features. Furthermore, in view of the fact that alternating frequency of complete and lightweight feature extracting networks is determined by key frame, we propose an adaptive key frame decision strategy using the similarity of low-level features from inter-frames. The integration of the TAFPM and key frame strategy leads to our video object network achieving comparable detection accuracy and greatly increases detection speed.

In summary, the contributions of this paper are as follows:

  • We propose a new online dynamic video object detection network, which significantly improves detection speed by reducing redundant calculation in feature extraction.

  • We introduce a lightweight attention-guided feature propagation method, which establishes an accurate connection between inter-frame features.

  • We design a new adaptive key frame decision strategy based on the low-level features to further balance detection accuracy and computing time.

  • We verify the proposed detection network on the ImageNet VID dataset, obtaining satisfactory detection performance.

2 Related work

2.1 Object detection in images

Currently, CNN-based approaches are the leading object detection methods. Since generating default boxes rely on anchors, methods in Ren et al. (2015), Dai et al. (2016), Liu et al. (2016), Bochkovskiy et al. (2020) and Cai and Vasconcelos (2018) are called anchor-based methods. In contrast, methods of CornerNet (Law and Deng 2020) and CentripetalNet (Dong et al. 2020) are anchor-free methods. Anchor-based methods fall in two categories: one-stage methods (Liu et al. 2016; Bochkovskiy et al. 2020) and two-stage methods (Ren et al. 2015; Dai et al. 2016; Cai and Vasconcelos 2018). YOLOv4 (Bochkovskiy et al. 2020)is a state-of-the-art one-stage method, which detects object by regression, and has a fast detection speed. However, compared with two-stage methods, the detection accuracy of one-stage methods is generally lower. Faster R-CNN (Ren et al. 2015) is the most representative two-stage method, which uses the idea of classification to detect objects. The extracted features are first used to propose possible regions, which are then classified to produce detection results. As a result, the Faster R-CNN is highly accurate, but time-consuming. R-FCN (Dai et al. 2016) increases the number of shared feature layers to 101, which generates an increase in the computing speed compared with Faster R-CNN. Cascade R-CNN (Cai and Vasconcelos 2018) combines the cascade idea and the Faster R-CNN detection framework, thus improving the detection accuracy. CornerNet uses the idea of keypoint to handle the object detection problem. It generates the detection box by finding the top-left point and bottom-right point. CentripetalNet is based on CornerNet. For the accurate match of keypoints, CentripetalNet proposes a corner matching method based on centripetal shift, along with a cross-star deformable convolutional module.

Based on the characteristics of the above detection methods, the two-stage method with higher accuracy is more suitable for our study. Therefore, R-FCN with ResNet-101 (He et al. 2016) is chosen as the static detector in the proposed video object detection method.

2.2 Object detection in videos

Video object detection methods incorporate video-specific spatiotemporal information into static detectors to improve the detection performance. The Spatiotemporal information can be fused in the post-processing stage or inside the static detection network. The former study is called the box level method, and the latter belongs to feature level method.

Box level methods operate on detection boxes in time domain in post-processing stage. For example, Seq-NMS (Han et al. 2016) propose sequence NMS, by which boxes of adjacent frames are linked to box sequences to boost weak detections. Seq-NMS can be embedded in other video object detection methods to further improve the detection performance. T-CNN (Kang et al. 2017b) utilizes box propagation to reduce false negatives, and introduces tracking to establish long-term connections of boxes. TCN (Kang et al. 2016) designs a strategy to classify and re-score tubelet. D&T (Feichtenhofer et al. 2017) computes cross-correlation between features of adjacent frames to track the objects and forms tracklets, by which the inter-frame detections are linked to improve detection accuracy. By the proposed spatiotemporal cuboid proposal network, method in Tang et al. (2018) link detections in short and long range to improve the classification quality. These box-level methods use complex post-processing to enhance detection accuracy and become time-consuming.

State-of-the-art methods for detecting object in videos are feature level methods, where feature propagation and aggregation are usually applied to optimize detection structure. In the researches for boosting performance, it is a common operation to strengthen features by aggregating features from other frames, e.g., FGFA (Zhu et al. 2017a) and MANet (Wang et al. 2018a). The memory-guided method STMN (Xiao and Jae Lee 2018) aggregates feature by the proposed Spatial-Temporal Memory Module (STMM), and aligns feature with the MatchTrans module. STSN (Bertasius et al. 2018) uses deformable convolution to aggregate feature. Deng et al. (2019), Shvets et al. (2019) and Chen et al. (2020) aggregate features in the proposal-level. RDN (Deng et al. 2019) propagate and aggregate object relation over the supportive proposals. The aggregated features are then used to augment the feature of each reference object proposal. Shvets et al. (2019) proposes a temporal relation module to establish the similarities between inter-frame proposals and select proposals from nearby frame to strengthen the current proposals. In MEGA (Chen et al. 2020), the candidate box features of current frame are augmented by global and local information to achieve high accuracy. The above studies enhance detection accuracy at the cost of computing time.

Among the methods that consider speed and accuracy, (Zhu et al. 2018) combines the methods of DFF and FGFA, thus designs a common optical flow based detection framework for high detection performance. In addition, the proposed temporally-adaptive key frame scheduling also replaces the fixed key frame strategy in this work. TSSD-OTA (Chen et al. 2019) temporally integrates multi-scale features by ConvLSTM. Moreover, attention mechanism is introduced to selects optimal features for memory module ConvLSTM. Liu and Zhu (2018) proposes an efficient Bottleneck-LSTM to reduce computational cost in feature propagation. Later, Liu et al. (2019) designs a dynamic framework including multiple feature extractors and aggregates features using the Bottleneck-LSTM. Yao et al. (2020) integrates detection and tracking at the object level. The real-time tracker updates detections and propagates the box features between frames. LSTM is then used to aggregate the object-level features. Jiang et al. (2020) uses the idea of fixed key frame and propagates features by the proposed attention-based module of Learnable Spatio-Temporal Sampling (LSTS). From this collection of research of balancing detection speed and accuracy, feature propagation method and key frame strategy are key elements to reducing calculating speed while ensuring accuracy. For accurate and fast feature propagation, we use self-attention mechanism as relation module to model inter-frame dependencies on features.

2.3 Key frame strategy

Key frame idea is used to select sparse frames to improve computational efficiency when processing a video. It plays an important role in video object detection, video behavior recognition and video object segmentation. AdaFrame (Wu et al. 2019) proposes a framework to adaptively select relevant frames for fast video recognition. It uses a memory-agumented LSTM as the selector of the key frame. Li et al. (2018) and Xu et al. (2018) design lightweights CNN network to determine the key frames in video semantic segmentation.

In the existing research area of video object detection, most of the methods use fixed key frame strategies such as DFF, FGFA, MEGA, etc. Adaptive key frame strategy is adopted by methods in Liu et al. (2019) and Zhu et al. (2018). Zhu et al. (2018) defines key frame based on the output of optical flow. The density of key frame in Chen et al. (2018) depends on propagation difficulty. Relying on reinforcement learning, key frame is selected in Liu et al. (2019) and Yao et al. (2020). From the observation, the adaptive key frame has been less studied in video object detection. We propose an effective and lightweight key frame strategy by leveraging the feature itself, creating a complete and efficient detection network.

Fig. 2
figure 2

Pipeline of the proposed video object detection method. Key frame decision module defines the properties (key or non-key) of each input frame. Key frame features are extracted via the complete feature network (ResNet-101). The lightweight low-level feature network and feature propagation module is designed for extracting and producing non-key frame features. Detection network is the same for each frame. It takes semantic features as input and outputs detection results

3 Methods

We develop an attention-based dynamic framework for video object detection, by which the running time is reduced while maintaining detection performance. In this section, we present the framework and implementation details. We first detail an overall outline of the framework. Then the principle of feature propagation and two component modules are introduced in detail: Feature temporal attention (FTA), temporal attention based feature propagation module (TAFPM), and key frame decision module.

3.1 Overview

Our method is based on the well-known static detector R-FCN. As shown in Fig. 2a, two steps are required to produce detection results in the network of R-FCN. Input images are first fed into CNN-based feature extractor \(N_{f}\) to produce feature maps f, which are then used as inputs of RPN to generate region proposals (RoIs). Finally, through position-sensitive RoI pooling layers and softmax layers, RoIs are processed to get the final detection results. Since undertaking the detection task, we define the subnetworks after \(N_{f}\) as detection network \(N_{d}\). Compared with \(N_{d}\), \(N_{f}\) is more time-consuming due to its multiple convolution operations. However, when a video sequence is served as the input, the output features of \(N_{f}\) are similar for neighboring frames, as shown in Fig. 1a. This means that extracting features for each frame is not necessary for video object detection, and the feature similarity of adjacent frames can be used to solve the computational redundancy in video object detection.

We propose a dynamic video object detection network to avoid the complex feature extraction for non-key frames. Moreover, based on the feature similarity, a key frame determination strategy is applied to further optimize the detection performance. Figure 2b illustrates the pipeline of the proposed dynamic framework. The key frames \(I_{k}\) and non-key frames \(I_{t}\) are defined by the key frame decision module, which is detailed in Fig. 4.

In this paper, we divide the original \(N_{f}\) into a low-level feature network \(N_{f}^{l}\) and high-level feature network \(N_{f}^{h}\). Output features \(f^{l}\) of the lightweight \(N_{f}^{l}\) contain more detailed information. Semantic information of images needed in object detection is mainly reflected in the output features \(f^{h}\) of \(N_{f}^{h}\). For key frames \(I_{k}\), \(N_{f}\) is used to extract both low-level features \(f_{k}^{l}\) and high-level features \(f_{k}^{h}\), that is, the final detection results of \(I_{k}\) are given by the complete R-FCN.

We define the starting frame of each input video as the first key frame. For each current frame, \(f_{t}^{l}\) is first extract by \(N_{f}^{l}\). Then, \(f_{k}^{l}\) and \(f_{t}^{l}\) act as inputs for the key frame decision module to determine whether the current frame is the next key frame. If the current frame is a non-key frame, no high-level features are extracted. Given that the extracted low-level features \(f_{t}^{l}\) are less semantic for later detection tasks, feature semantic enhancement handling is designed to produce approximate high-level feature \(f_{t}^{h_{appr}}\). Then the proposed feature temporal attention (FTA) acted on \(f_{k}^{h}\) and \(f_{t}^{h_{appr}}\) to produce the propagated high-level feature of non-key frame \(f_{t}^{h}\), which are followed by \(N_{d}\) to generate the detection results.

3.2 Feature temporal attention

Self-attention can assign weights to each feature unit through autonomous learning between feature maps, thereby extracting more useful feature maps. We use self-attention in the time domain, and propose feature temporal attention (FTA), by which high-level feature maps of key frames are propagated to non-key frames. FTA propagate feature through three steps. We first calculate the similarity between pairs of feature maps, and then normalize the similarity matrix to generate corresponding weights, based on which the propagated features are eventually produced.

We define the feature maps of frames \(I_{k}\) and \(I_{k+\tau }\) as \(F_{k}\) and \(F_{k+\tau }\), respectively, and both features have a size of \(N*W*H\). The similarity matrix of the two feature maps is calculated by the dot-production function, as shown in Eq. (1):

$$\begin{aligned} f(F_{k}^{i},F_{k+\tau }^{j})=\theta (F_{k}^{i})^{T}\phi (F_{k+\tau }^{j}) \end{aligned}$$
(1)

where \(F_{k}^{i}\) represents an arbitrary position of \(F_{k}\), similarly, \(F_{k+\tau }^{j}\) corresponds to \(F_{k+\tau }\). \(f(\cdot )\) refers to the dot-production function, and the dimension of the output features is \(WH*WH\). \(\theta (F_{k}^{i})\) and \(\phi (F_{k+\tau }^{j})\) are two embedding functions with the same processes. They are defined in Eq. (2):

$$\begin{aligned} \left\{ \begin{matrix} \theta (F_{k}^{i})=W^{\theta }F_{k}^{i}\\ \phi (F_{k+\tau }^{j})=W^{\phi }F_{k+\tau }^{j} \end{matrix}\right. \end{aligned}$$
(2)

where \(W^{\theta }\) and \(W^{\phi }\) represent the same feature transformation for \(F_{k}^{i}\) and \(F_{k+\tau }^{j}\), respectively. Taking \(W^{\theta }\) as an example, first features \(F_{k}\) are convolved with the convolutional kernel of \((N/8) *1*1\) to generate the intermediate features, which are then unfolded into a feature matrix with resolution of \((N/8)*WH\).

Since the similarity matrix \(f(F_{k}^{i},F_{k+\tau }^{j})\) is used as a weight in self–attention mechanism, we normalize it with the softmax function to construct the attention map \(att_{j,i}\) of \(F_{k}\) and \(F_{k+\tau }\):

$$\begin{aligned} att_{j,i}=\frac{exp(f(F_{k}^{i},F_{k+\tau }^{j}))}{\sum _{j=1}^{n}exp(f(F_{k}^{i},F_{k+\tau }^{j}))} \end{aligned}$$
(3)

where \(att_{j,i}\) represent the attention to \(F_{k}^{i}\) when generating \(F_{k+\tau }^{j}\). n indicates pixel number of \(F_{k+\tau }\) after embedding, that is, all possible positions of j, \(n=(N/8) *WH\).

According to attention map \(att_{j,i}\) and \(F_{k}\), the propagated feature map \(F_{k+\tau }^{j_{pro}}\) of \(I_{k+\tau }\) can be estimated with Eq. (4):

$$\begin{aligned} F_{k+\tau }^{j_{pro}}=\sum _{j=1}^{n}(att_{j,i}\cdot F_{k}^{i}) \end{aligned}$$
(4)

where n is all possible positions of i, \(n=N/8 *WH\). Through 1*1 convolution again, \({F_{k+\tau }^{pro}}\) is transformed into the same dimension as the extracted feature map \(F_{k+\tau }\). \({F_{k+\tau }^{pro}}\) is represented by Eq. (5):

$$\begin{aligned} F_{k+\tau }^{pro}=(F_{k+\tau }^{1_{pro}},F_{k+\tau }^{2_{pro}},\ldots ,F_{k+\tau }^{j_{pro}},\ldots F_{k+\tau }^{m_{pro}}) \end{aligned}$$
(5)

where m is the position number of \(F_{k+\tau }^{pro}\), and which dimension is \(N*W*H\). Therefore, through the rule of FTA, we propagate feature map of \(I_{k}\) to \(I_{k+\tau }\).

Fig. 3
figure 3

The lightweight network for feature semantic enhancement handling. It takes low-level features (\(f_{t}^{l}\)) as input and outputs the estimated features \(f_{t}^{h_{appr}}\) suitable for FTA

3.3 Temporal attention based feature propagation module

Until this point, we have elaborated the basis (FTA) of feature propagation, that is how to update feature \(F_{k+\tau }\) based on feature \(F_{k}\). In FTA, the same-level features of two related frames are required. However, in our study, only low-level features are extracted in non-key frames for fast detection. Therefore, we propose a temporal attention based feature propagation module (TAFPM) to fix the feature propagation problem in this paper.

Next, we detail how to use high-level features of key frame \(f_{k}^{h}\) and low-level features of non-key frame \(f_{t}^{l}\) to obtain the high-level features of non-key frames. Since the high-level features used in subsequent detection network \(N_{d}\) express semantic information, which happens to be lacking in \(f_{t}^{l}\), we design a lightweight network \(N_{l}\) for \(f_{t}^{l}\) to enhance semantic information. The resulting features are the approximate semantic features \(f_{t}^{h_{appr}}\).

The structure of \(N_{l}\) is shown in Fig. 3. A convolution layer with a 1*1 kernel is first used to reduce the feature channels. In addition, the network includes two 3*3 convolutional layers with 512 and 1024 channels, respectively. The output feature maps \(f_{t}^{h_{appr}}\) have the same dimensions as \(f_{k}^{h}\) to ensure the implementation of feature propagation.

Feature propagation from key frame to non-key frame is performed based on FTA. We take \(f_{k}^{h}\) and \(f_{t}^{h_{appr}}\) as inputs of FTA:

$$\begin{aligned} \left\{ \begin{array}{l} F_{k}=f_{k}^{h}\\ F_{k+\tau }=f_{t}^{h_{appr}} \end{array}\right. \end{aligned}$$
(6)

where \(f_{k}^{h}\) are the extracted high-level features of key frames, and \(f_{t}^{h_{appr}}\) are the outputs of semantic enhancement handling of non-key frames.

After defining the inputs of FTA, through Eqs. (1)–(5), the output \(f_{t}^{h_{pro}}\)are calculated as the propagated high-level features of non-key frame, which can be sent to \(N_{d}\) to produce the detection results of non-key frame.

Fig. 4
figure 4

Our key frame decision strategy. We process low-level features (\(f_{t}^{l}\)) to obtain feature similarity, which is used as the basis for determining key frames

3.4 Key frame decision module

The key frame module is a switching device for the proposed dynamic network. Feature similarity of video frames is the basis of the key frame module. Due to object emergence, disappearance, or change in appearance, feature maps of video frames will change with time. Research (Shelhamer et al. 2016) proves that, compared with semantic feature layers, intermediate layers can better reflect the changes in video frames. In this paper, we design an adaptive key frame decision method from the perspective of measuring low-level feature similarity.

Fig. 5
figure 5

Example variation of feature similarity. The red circles represent key frames. Between two key frames, feature similarity of key frame and current frame gradually decreases as the frame number increases

As shown in Fig. 4, the module takes the previous key frame and current frame as inputs, and outputs feature similarity. The size of low-level features of previous key frame \(f_{k}^{l}\) and current frame \(f_{t}^{l}\) are defined as \(N*W*H\). We first convolve \(f_{k}^{l}\) and \(f_{t}^{l}\) with 1*1*1 convolution kernel to reduce their feature channels to 1 , respectively. The resulting features are then unfolded into feature vectors with the size of \(1*WH\). The previous key frame and current frame feature vectors are denoted by \(v_{k}^{l}\) and \(v_{t}^{l}\). Cosine similarity is used to calculate the similarity of these two feature vectors, so the similarity parameter \(s_{k,t}\) of \(v_{k}^{l}\) and \(v_{t}^{l}\) is obtained by Eq. (7):

$$\begin{aligned} s_{k,t}=\frac{a_{k}^{i}\cdot a_{t}^{i}}{\left\| a_{k}^{i}\right\| \left\| a_{t}^{i}\right\| } \end{aligned}$$
(7)

where \(a_{k}^{i}\) is an element of feature vector \(v_{k}^{l}\), \(a_{t}^{i}\) belongs to feature vector \(v_{t}^{l}\), \(0<i<W*H\). \(\left\| \cdot \right\|\) denotes 2-norn.

Through \(s_{k,t}\), the properties (key frame or non-key frame) of the current frame can be defined according to Eq. (8):

$$\begin{aligned} K_{t}=\left\{ \begin{array}{ll} 1, &{}\quad s_{k,t}\ge \sigma \\ 0, &{}\quad s_{k,t}< \sigma \\ \end{array} \right. \end{aligned}$$
(8)

where \(K_{t}\) stands for the indicator of key frames. The current frame t is a key frame when \(K_{t}=1\) , otherwise, the value 0 means a non-key frame. \(\sigma\) represents the threshold of \(s_{k,t}\), and \(\sigma =0.94\). The optimal value 0.94 is obtained by analyzing the influence of \(s_{k,t}\) on accuracy and running time, which is detailed in the second part of the Experiment.

Figure 5 shows an example variation curve of similarity parameter \(s_{k,t}\) with frame number. The first red point is the first frame of the video and is selected as the first key frame. The next six red points are the key frames selected by the proposed method. It is observed that frame difference of adjacent key frames is different. Also, non-key frames that farther away from the previous key frame has a smaller \(s_{k,t}\).

4 Experiment

In this section, we evaluate our method on the ImageNet VID dataset, displaying the experimental results both qualitatively and quantitatively. All the experiments are on the computer equipped with a single GPU (NVIDIA GeForce GTX 1080 Ti) and 12 CPU (Intel i7-6800K), 32G RAM.

4.1 Experiment setup

4.1.1 Dataset and evaluation metric

The ImageNet VID dataset (Russakovsky et al. 2015) is the most preventative dataset for video object detection now. There are 5354 videos in the dataset, containing 3862 on training set, 555 on validation set, and 937 on testing set. The frames of training set and validation set are fully annotated. The 30 categories in VID dataset are a subset of the 200 categories in the DET dataset. The data of each category in VID dataset is imbalance. Additionally, sample quality of VID is poor than that of DET. Therefore, like most previous VID methods, we train detection model on the mixture of VID and DET (using the same category as VID). We sample 10 frames from each video in VID dataset and up to 2K images per class from DET dataset to compose our training set. As with the other video object detection research (Wang et al. 2018a; Zhu et al. 2017b), the detection performance is tested on the validation set.

Average precision (AP) and mean average precision (mAP) are the most widely used metrics in object detection. AP is defined as the mean precision corresponding to 11 recall values, which are produced by equally taken 10 points on the horizontal axis [0, 1] on the Precision-recall (PR) curve. We select AP and mAP to evaluate the accuracy of our method. Runtime is expressed in frames per second (fps). In experiments, following R-FCN, 0.5 is applied to the IoU threshold between RPN proposals and ground truth.

4.1.2 Implementation details and training

In our study, R-FCN is selected as the static detector. For feature extraction, we use ResNet-101 pre-trained on the ImageNet as our backbone network \(N_{f}\). The convolution layer res4b3 is defined as the boundary between \(N_{f}^{l}\) and \(N_{f}^{h}\). Layers up to res4b3 belong to \(N_{f}^{l}\), and the higher layers are \(N_{f}^{h}\).

The detection model is trained end-to-end with Stochastic Gradient Descent (SGD). During training, a sample consists of two frames, which are randomly sampled within a certain range in VID. The former acts as key frame and the latter is non-key frame. In order to produce samples with the same form as in VID, we replicate sampled images of DET once. Iteration is set to be 120K, with learning rates of \(10^{-3}\) and \(10^{-4}\) in the first 80K and last 40K iterations, respectively. In both training and testing, input frames are resized such that their shorter side is 600 pixels.

Fig. 6
figure 6

Influence of feature similarity threshold \(\sigma\) on detection accuracy and running time

Table 1 Average precision (in %) of our method and the base detector on the ImageNet VID dataset

4.2 Ablation study

4.2.1 Parameter analysis

Similarity threshold \(\sigma\) is an important parameter for key frame strategy. It determines the density of key frames and has a significant impact on detection accuracy and speed. We investigate the influence of \(\sigma\) on detection accuracy and running time, and display the results in Fig. 6. As \(\sigma\) rising, key frames become denser, detection accuracy increases, but runtime decreases. When \(\sigma\) takes the maximum value 1, each frame is a key frame and complete features are extracted. Thus the proposed detector is equal to the static detector. Conversely, lower \(\sigma\) produces sparse key frames, thus causes lower detection accuracy but faster running speed. Moreover, referring to the two curves, accuracy decreases slowly in initial stage, while the running time increases slowly in later stages. Since feature difference between frames is mainly caused by objects, the difference has a limit, which corresponds to the minimum value of feature similarity. When \(\sigma\) is too low, key frames are too sparse, and most frames detecting objects according to the inaccurate propagated features, resulting in a sharply drop in accuracy. Based on above analysis and tradeoffs of accuracy and speed, we eventually choose 0.94 as the optimal \(\sigma\).

4.2.2 Tradeoffs of accuracy and speed

Table 1 shows the accuracy results of our method and the base detector R-FCN on the ImageNet VID dataset. We obtain an mAP of 73.7%, which is just 0.2% lower than the base detector, compared with 73.9% produced by R-FCN. This demonstrates that the proposed FTA based feature propagation method causes a slight decline in accuracy while accelerating processing speed. In addition, our AP is higher than R-FCN in several categories (e.g., bear, bus). The result is mainly due to the inter-frame association established at the feature-level. The extracted features are replaced with propagated features in non-key frames, thus avoiding detection failures on deteriorated non-key frames. This illustrates the necessity of inter-frame feature propagation in video object detection.

Since operation of detection network and post-processing are the same for each frame in our method, the total running time depends on the feature extracting time. Therefore, we analyze feature extracting times for both keys and non-key frames, as shown in Table 2. Input frames are preprocessed into 600*1000. Compared to 72 ms used in extracting complete features in key frames, it takes 12 ms extracting low-level features in non-key frames, which is five times lower than in key frames. The proposed key frame module and feature propagation module take 2 ms and 6 ms, respectively. Therefore, in non-key frame, we consume 20 ms to produce features available for object detection, which is less than 1/3 of key frame. These results indicate that the proposed temporal attention based feature propagation module (TAFPM) demonstrates improvement in the speed of feature processing for non-key frames.

Table 2 Feature processing time (in ms) for key and non-key frames
Table 3 Performance comparison of fixed and adaptive key frame stretagy

In order to verify the performance of our key frame strategy, we compare our strategy with the fixed key frame strategy. As shown in Table 3, when using the proposed adaptive key frame strategy, the mAP is 73.7%, which is 0.4% higher than using fixed key frame strategy. Similar results are achieved in terms of runtime, and our runtime is 0.81 fps higher than the fixed key frame strategy. This is because the fixed key frame strategy determines the properties of current frame based on frame difference, and the inter-frame feature variations are not taken into account. Therefore, large object appearance changes and emerging objects cannot be detected in time, resulting in a failure to detect the involved objects. Our adaptive key frame strategy effectively makes up for this deficiency. The enhancement in accuracy and speed proves that the proposed adaptive method is an improvement to the fixed key frame strategy.

Table 4 Accuracy and runtime comparison with state-of-the-arts on the ImageNet VID validation set
Fig. 7
figure 7

Example detection results of our method on the ImageNet VID validation dataset. The images in each row belong to one scene. For each scene, we sample one frame every 5 frames and display its detection results. Our method achieves satisfactory results in these scenes

4.2.3 Comparison with the state-of-the-art

Comparison with the state-of-the-art object detectors is reported in Table 4. Our method outperforms Faster R-CNN in both accuracy and runtime. We achieve 21.53 fps, which is about 3 times higher than Faster R-CNN. Unlike the comparable accuracy of our method and R-FCN, our processing speed is twice as fast as R-FCN. Among the compared video object detectors, due to using multi-frame feature aggregation to enhance feature quality, the accuracy-focus methods of FGFA, MANet, and STSN produce higher detection accuracy. However, the complex feature operations make detection speed of these detectors lower than that of DFF and TSSD-OTA. As a result, video object detectors (FGFA, MANet, and STSN) that focus on improving accuracy sacrifice speed for accuracy.

The optical flow based method DFF shares the same research focus and static detector as ours. Compared to the 73.1% mAP of DFF, we observe a 0.6% mAP improvement brought by the FTA and adaptive key-frame strategy. Our runtime is also 1.28 fps faster than DFF. The accuracy and runtime results prove than the proposed FTA based feature propagation method outperforms optical flow. In order to realize real-time processing, TSSD-OTA adopts a lightweight base network VGG16 (Simonyan and Zisserman 2015) and one-stage base detector SSD. TSSD-OTA runs at roughly the same speed as our method, but its mAP is 8.4% lower than ours. Since using the time-consuming Fast R-CNN (Girshick 2015) and LSTM, TPN has the lowest computing speed among these video object detection methods, and its mAP is 5.3% lower than the proposed method.

Figure 7 visualizes the qualitative detection results of the proposed method on the ImageNet VID validation dataset. We show six scenes. Scene 1 corresponds to the first row of images, and thus the sixth row is Scene 6. It can be seen from Scene 1 that the direction of the red car changes significantly (from an initial right front direction to a positive front direction, and finally a left front direction), our method successfully detects the car in all directions. In the remaining four scenes (Scene 3 with small objects and Scene 4 a complex scene), our method also detects objects accurately. For the case of large scale variation and occlusion in in Scene 6, our method is successful in detecting the car.

5 Conclusion

This paper aims at fast video object detection while ensuring detection accuracy. We propose an attention-guided dynamic video object detection method, by which complete and low-level features are extracted for the defined key frames and non-key frames, respectively. The complete features of key frames can be used for detection tasks. For non-key frames, the semantic information of low-level features is first enhanced through a lightweight network. Then, based on the proposed feature temporal attention (FTA), we propagate feature from key frames to non-key frames to produce the final features for detection. Furthermore, According to the feature similarity between frames, we design a new adaptive key frame decision method, which is served as the selection criteria for the two feature extraction processes. We demonstrate that our method offers a speed advantage while maintaining accuracy compared to the base detector. It is also competitive with the state-of-the-arts that focus on fast video object detection.

In the future, we will continue to study the algorithms of object detection in videos. We plan to further optimize the key frame decision method. The problem of key frame decision is to determine the interval between two adjacent key frames. Referring to the Keyframe Scheduling in Yao et al. (2020), the interval of key frames can be set to shorten interval, long interval, and mean interval, which correspond to fast change, slow change, and mean change of the objects, respectively. In this way the key frame decision problem can be viewed as a multiple attribute decision-making problem. Since the success of spherical fuzzy sets (SFSs) (Ashraf et al. 2019; Jin et al. 2019) and picture fuzzy sets (PFSs) (Qiyas et al. 2020) in the field of decision-making, we will explore using the improved concept (e.g. linguistic picture fuzzy Dombi (LPFD) aggregation operators (Qiyas et al. 2019a) and Triangular picture fuzzy linguistic induced ordered weighted aggregation operators (Qiyas et al. 2019b)) to solve our key frame decision problem.