1 Introduction

Object proposal generation, i.e., proposing the object-like regions from millions of sliding windows, has become a promising and helpful technique both in multimedia and computer vision [44, 56, 57]. It is derived from some visual cognitive and neuropsychological evidence that human can quickly and accurately identify objects without recognizing them [15, 16]. By object proposals, multimedia tasks can concentrate on a few of proposal bounding boxes probably containing objects, instead of starting from millions of sliding windows. Utilizing object proposals as a pre-processing procedure, benefits many applications both in efficiency and effectiveness, e.g. object tracking [26, 67], image/video segmentation [21, 43] and classification [13, 31, 59, 65], video summarization [22, 38], activity recognition [6, 66], object retrieval [47, 58, 63], multimodality retrieval [10, 12, 61], landmark recognition [11, 36, 50], and image/video storytelling [14, 35, 62]. Furthermore, it is better to consider domain knowledge when applying object proposals to some specific or novel applications.

For most video object proposals start from image object proposals [27, 40, 56], there are generally two categories of object proposals for both image and video, segment-based proposals [3, 51] and window-based proposals [24, 32]. The former model usually starts from generating multiple segments and then merges them into the proposed regions. To get better segments, the sophisticated algorithm is prerequisite which brings more computational consumptions. In contrast, the goal of window-based proposals aims at assigning high scores to the bounding boxes that probably contain objects. Due to the lightweight design, even some low-level features can achieve good results both in accuracy and efficiency [9, 68]. Considering the procedures of human vision to recognize objects, roughly localizing and accurately recognizing, the latter model is more intuitive and suitable for pre-processing. In this paper, we focus on applying the latter model to videos because of its simplicity, which contributes to making object proposals much more efficient, especially running as a pre-processing procedure in videos.

Although many works concentrate on object proposals [8, 33, 34], few previous works focus on video object proposals in recent years. To the best of our knowledge, most existing video object proposals mainly devote to proposing moving or dominant objects [27, 40, 53]. Merely locating moving objects has not achieved the goal set by object proposals. It is similar to the technique of moving object segmentation and tracking [28, 54]. To achieve multi-object proposals in videos, it is straightforward to apply object proposals [2, 9, 45, 68] frame by frame. But experiments find that applying image methods frame by frame may lead to proposal inconsistency. This experimental phenomenon is illustrated in Fig. 1b, obviously showing that proposal inconsistency even exists within frames with similar content and structure. Despite the high detection rate of image object proposals, directly applying these methods to videos still results in omitting objects. Two reasons lead to this omitting. One is that there are no dynamic cues in image object proposals because they are designed for detecting static objects. The other is that motion blur and color ambiguities will degrade the edge or contour based proposal results. Inspired from all the above, we further explore the criteria that should be considered when applying object proposals to videos.

Fig. 1
figure 1

Given (a) an unlabeled video, our model produces (c) a set of spatio-temporal bounding box proposals for both foreground and background objects at the same time. (b) shows the proposal inconsistency caused by utilizing image object proposals frame by frame

Good extendibility

Object proposals have been studied a lot and remarkable achievements have been made. It is wasteful to leave image achievements alone and demonstrate a different path for videos. Therefore, it is better to seek for a good scheme to extend image methods to videos.

Multi-object proposals

As a pre-processing procedure, a good video object proposal method should propose all the object-like regions no matter they are in the foreground or not, dominant or not. Besides, proposing multi-objects will benefit more applications.

Proposal consistency constraint

Directly applying object proposals frame by frame may lead to three defects. First, omitting objects is inevitable even in consecutive or similar frames. Second, motion blur and color ambiguities may degrade the edge or contour based proposal results. Third, temporal information preserved in neighboring frames is not being used.

Based on the above considerations, we propose an adaptive context-aware model for multi-object proposals in videos. Image methods are extended into videos by considering motion cues and spatial-temporal evolutions in our model. The increasing computational complexity lies in the motion cues’ calculation, which is inevitable in video processing. To achieve multi-object proposals, both spatial and temporal bounding box generations are considered. By adaptively integrated spatial and temporal proposals, the proposal consistency constraint is kept as much as possible. To evaluate the efficiency of the proposed method, we build a multi-object dataset specially for video object proposals. 30 shots are collected from five famous movies. This benchmark is suitable for evaluating multi-object proposals in videos, for the average number of objects achieves 3.34 and the ground truth is offered frame by frame. For a comprehensive evaluation, we also compare our method with the state-of-the-art on a public motion segmentation dataset, called Freiburg-Berkeley motion segmentation dataset [5, 39], including 30 shots with keyframe ground truths. We extend this motion segmentation dataset to video object proposal dataset by offering annotations per frame in bounding boxes. The proposed method achieves good performance on both datasets, showing that our method is competent for video object proposals. A similar work to this paper was proposed in [18], but the context-aware model is absolutely bidirectional and no classifications are considered when conducting motion estimation based mapping. Different to [18], we present an adaptive context-aware model with more elaborate processings. It can be transformed into a unidirectional model according to the temporal sequence by omitting the temporal scoring refinement. Therefore, it is more efficient while with improved detection rate, which means that it is much easier to apply to real time applications. Besides, we also expand the proposed dataset and validate the effectiveness of our method in a public dataset.

The contributions of this paper can be briefly stated as follows.

  • We propose an adaptive context-aware model for video object proposals, which contributes to proposing both still and moving objects no matter they are in the background or foreground in videos;

  • We integrate the spatial and temporal boxes by introducing an adaptive and classified motion based mapping, which can be extended to other applications. On account of no complicated computations, the efficiency is high enough;

  • We employ a temporal scoring refinement mechanism to further improve the detection rate;

  • We build a challenging dataset for multi-object proposals in videos, which is collected from five famous movies with 30 shots in total (about 3.34 objects per frame), and has bounding box annotations frame by frame.

The rest of this paper is organized as follows. In Section 2, we give a review of related works on both frame-based and sequence-based object proposals in videos. Then we introduce the main body of the proposed context-aware model in Section 3 and the temporal scoring refinement in Section 4. Next, we demonstrate the experimental results and give some discussions of the results in Section 5. At last, we give a brief conclusion and perspective of our work in Section 6.

2 Related work

Few methods are specially designed for video multi-object proposals. Most of them usually start from per-frame object proposals which are then generalized in the temporal domain. According to our survey on the related work, we generally divide video object proposals into frame-based object proposals and sequence-based object proposals.

Frame-based object proposals

The concept of object proposals is firstly presented by Alexe et al. in [1] aiming at reducing the number of true negative sliding windows. They further explored the solution and introduced a generic objectness measure to quantify the possibility of containing objects for the candidate windows. Rahtu et al. [45] scored the windows by utilizing an effective linear feature combination, which achieved better results compared with [1]. By adopting low-level features, such as gradient, saliency, and superpixel straddling [1], most methods can achieve good performance in both efficiency and accuracy. Along with maturity of current techniques, image methods can be directly applied into video frame by frame. Cheng et al. [9] proposed a very fast method to filter the initial sliding windows at 300fps by merely utilizing the gradient feature. It seems that this method has already fulfilled the demands of real time applications in videos, but it can only achieve better results on 0.5 intersection over union (IoU), which is not applicable in practical applications. Zitnick et al. [68] leveraged accuracy and efficiency very well by using the edge feature. Although it has the best performance even over the challenging overlap among the window-based proposals, the computing time has no competitive advantages compared with [9]. As to segment-based proposals, such as [3, 37, 46, 52], though achieving accurate segmented results to some extent, complicated computations may bring much more time-consuming when applying to videos, making them unsuitable for running as pre-processing procedures. In short, although frame-based object proposals can achieve the task of video multi-object proposals, frame-by-frame usage may lead to proposal inconsistency among temporal sequences according to experiments. Therefore, temporal information should be subtly adopted into video object proposals with limited increase of computational efforts.

Sequence-based object proposals

Few methods specially serve video multi-object proposals. Most related works about video object proposals mainly aim at proposing dominant objects in the temporal domain, which we call them sequence-based object proposals. Gilad et al. [48] aimed at finding the dominant objects in the scene and obtaining rough, yet consistent segmentations thereof. Due to the usage of multiple segments, it is inapplicable to serve as a pre-processing procedure. Van den Bergh et al. [53] proposed a novel method for the online extraction of video superpixels, contributing to delivering tubes of bounding boxes throughout extended time intervals. Though efficient in acquiring video superpixels, it is similar to the task of object tracking. Oneata et al. [40] explored the problem of generating video tube proposals for spatio-temporal action detection. This research is a branch of action detection in videos, while our method devotes to proposing the category independent bounding boxes that probably contain objects no matter they are still or not. Perazzi et al. [42] performed an SVM-based pruning step to retain high quality foreground proposals. Xiao et al. [56] presented an unsupervised approach to generate spatio-temporal tubes that localize the foreground objects. Though considering the importance of proposal consistency, these methods aim at keeping the proposal consistency of foreground objects. In brief, most related methods [27, 41, 60] are explicitly defined to propose dominant objects or moving objects for video object detection. These methods seem to be moving object segmentations rather than video object proposals.

With the emergence of deep learning, increasing works turn to deep architecture for help. There is no exception in the task of object proposals [19, 20, 30]. Zhang et al. [64] leveraged a Convolutional-Neural-Network model to generate location proposals of salient objects. Kong et al. [29] presented a deep hierarchical network for handling region proposal generation and object detection jointly. Hayder et al. [23] proposed an approach to co-generate object proposals in multiple images by introducing a deep structured network that jointly predicted the objectness scores and the bounding box locations of multiple object candidates. Though most of these methods have achieved pleasing results, Chavali et al. [7] reported the gameability of the current object proposal evaluation protocol especially for learning-based methods, for they argued that the choice of using a partially annotated dataset for evaluation of object proposals is problematic. Learning-based methods define an object as the set of annotated classes in the dataset, which obscures the boundary between a proposal algorithm and an object detector. In order to generalize object proposals as a pre-processing procedure in videos and localize category independent objects as much as possible, the low-level feature is more receivable and explicable.

3 Adaptive context-aware object proposal model

Given a video, we aim at generating a series of spatio-temporal bounding box proposals for both foreground and background objects at the same time by leveraging advantages of image object proposals and the basic feature in videos. Our solution devotes to minimizing the additional computing cost as much as possible to make it suitable for a pre-filtering process and improving the detection rate compared with the frame-by-frame usage of image object proposals. The main procedures are outlined in Fig. 2, including spatial candidate box generation, temporal box mapping, box confidence coefficient calculation and weighted scoring system. The temporal scoring refinement will be introduced in Section 4.

Fig. 2
figure 2

The framework of the proposed method. bcc is the box confidence coefficient

3.1 Spatial candidate box generation

The standard practice for generating initial bounding box is starting from densely sampled sliding windows. Millions of these windows are filtered by well-designed selective rules. In fact, there is no need to generate so many boxes for every frame. Proposal boxes generated by image methods can be used as initial candidate boxes. There are three benefits of generating spatial candidate boxes by image methods. First, image methods can generate proposals including both foreground and background objects. Second, the detection rate of image methods has increased a lot for meeting the demands of applications. Third, starting from image methods will significantly increase the computing efficiency because of saving time for handling so many boxes. Let t represent the t t h frame f t of one video. n is the number of generated spatial candidate boxes \(B_{n}^{(t)}\) which can be shown as:

$$ {B_{n}^{t}} = \{b_{i}|b_{i} \in I(f_{t}, M)~, n\leq M \}, $$
(1)

where M is the maximum number of generated bounding boxes. The computing consuming can be adjusted by setting M.

3.2 Temporal box mapping

As a pre-processing procedure, there is no need to pay much attention to feature extraction or bounding box matching. In order to make an effective temporal box mapping, we classify the surrounded relationship between the bounding box and the object based on the motion fields. Each frame has a corresponding motion field, e.g., calculated by an optical flow method. We manage to use motion distribution to guide the temporal box mapping. It is obvious that the bounding box should be moved according to the motion of main part surrounded by it, which is regarded as the displacement of the object. Therefore, it is significant to find the exact displacement of the object. To achieve this target, we firstly optimize the initial bounding box set \({B_{n}^{t}}\) of frame t by making each box approach to the boundary of objects. Let \(b_{i} (b_{i} \in {B_{n}^{t}})\) represent one generated bounding box, the coordinates of b i can be denoted as:

$$ c_{b_{i}} = \{ ({x_{l}^{i}}, {y_{l}^{i}}) |l \in [1,P], l \in N^{+}, P \geq 2\}, $$
(2)

where \(({x_{l}^{i}}, {y_{l}^{i}})\) are the coordinates of bounding box b i . l is the numerical order of sampling points which is greater than 2. lN + means that l is a positive integer. We use four corners as sampling points in our experiments.

Let \(f_{b_{i}}\) represent the corresponding motion field of bounding box b i , the optimization can be illustrated as (3).

$$\begin{array}{@{}rcl@{}} c_{{b_{i}^{o}}}=\arg \min\limits_{{x_{l}^{i}},{y_{l}^{i}}} \sum [ f_{b_{i}}({\Gamma}({\Phi}({x_{l}^{i}}, {y_{l}^{i}}, \beta)))-f_{b_{i_{c}}}({x_{c}^{i}}, {y_{c}^{i}}) ], \end{array} $$
(3)

where Γ(▪) represents a transformation of coordinates, and Φ(▪) does shrink to the bounding box’s coordinates. β is the step size rate for each shrink which equals 0.1 in our experiments. In optimization process, we utilize Γ(▪) transformation to find the midpoint of each edge of the rectangle with the bounding box’s shrinking. Because of comparing with the four corners, midpoints have more possibilities to approach the object. It is more helpful than relying on the four corners to map the temporal box. Our new temporal box is mapped based on its corresponding motion map. Not every bounding box can be optimized within fixed iterations. If the box can be converged, the new coordinates \(c_{b_{i,1}^{t+1}}\) can be denoted as shown in (4).

$$\begin{array}{@{}rcl@{}} c_{b_{i,1}^{t+1}}=Mapping(c_{b_{i}^{t,o}}, ~\omega f_{b_{i}^{t,o}}({\Gamma}({x_{l}^{i}}, {y_{l}^{i}}))+ (1-\omega) f_{b_{i_{c}}^{t,o}}({x_{c}^{i}}, {y_{c}^{i}})), \end{array} $$
(4)

where M a p p i n g(▪) is a motion field based mapping function that transforms pairs of coordinates to another. ω is used to weight the object’s displacement for suppressing the noisy motion as much as possible. As to the bounding boxes that cannot be optimized, motion mapping is performed on the four corners based on the corresponding motion. It is noted that we utilize a median filter with s × s patch to filter the motion vector of each corner to do denoising. Then the mapped coordinates of bounding box \(b_{i,2}^{t+1}\) can be denoted as:

$$\begin{array}{@{}rcl@{}} c_{b_{i,2}^{t+1}}=Mapping(c_{{b_{i}^{t}}}, ~\omega f_{{b_{i}^{t}}}({\Gamma}({x_{l}^{i}}, {y_{l}^{i}}))+ (1-\omega) f_{b_{i_{c}}^{t}}({x_{c}^{i}}, {y_{c}^{i}})), \end{array} $$
(5)

The difference of (4) and (5) lies in the referred bounding box. If the initial generated bounding box can be optimized, then the input for mapping temporal box is the optimized box \({b_{i}^{o}}\), while the input is the original box b i . This tactic contributes to making the proposed bounding box fit the object’s boundary as much as possible. The final mapped temporal box set can be described as:

$$ B_{n}^{t+1} = \{ b_{i}^{t+1}~|~b_{i}^{t+1} \in b_{i,1}^{t+1} ~or~ b_{i}^{t+1} \in b_{i,2}^{t+1},~i\in[1,n]\}, $$
(6)

Merely guided by the motion field, not every bounding box can be successfully optimized. The strategy is that keeping the bounding box containing the background object moving with the object yet without obvious appearance change. Meanwhile, making the bounding box containing the moving object shifting with the object yet with approaching change to its surrounding object. We also give a detailed algorithm description in Algorithm 1.

figure d

3.3 Box confidence coefficient calculation

Due to the motion blur, not every temporal mapping can bring the pleasing result. For example, inaccurate motion fields may lead to ambiguous displacements. In order to reduce the impact of obscure moving, not every frame is suitable for temporal box mapping. Therefore, an adaptive strategy should be introduced to determine whether the bounding boxes of the current frame can be temporally mapped or not. Different from directly evaluating the accuracy of motion fields, we introduce a concept of box confidence coefficient bcc, which is calculated by making statistics of the box loss of each frame after achieving the temporal mapping, as shown in (7).

$$ bcc=\frac{\mathcal{N}_{b^{t}_{loss}}}{\mathcal{N}_{{B_{n}^{t}}}}, $$
(7)

where \(b_{t}^{loss}\) represents the set of lost bounding boxes, which can be denoted as (8):

$$ b^{t}_{loss}=\{{b_{i}^{t}}~|~w_{{b_{i}^{t}}}h_{{b_{i}^{t}}}\le Th2\}. $$
(8)

wh is the area of one bounding box, i.e., the number of pixels. It is assumed that the mapping error may increase along with the increment of smaller bounding boxes. We utilize (9) to make a decision whether the bounding boxes of current frame can be mapped or not. If D = 1, we recommend generating bounding boxes by temporal mapping. The detailed procedures for achieving adaptive context-aware temporal mapping are illustrated in Algorithm 2. It is different from Algorithm 1. Algorithm 1 describes the procedure for generating temporal bounding boxes, while Algorithm 2 emphasizes the procedure when to generate boxes by spatial methods.

$$\begin{array}{@{}rcl@{}} D^{t+1}= \left\{ \begin{array}{rl} 1, ~~& bcc \le Th3 \\ 0, ~~& \text{otherwise} \end{array} \right. \end{array} $$
(9)
figure e

3.4 Weighted scoring system

Bounding box based object proposals are ranked based on window scores. For spatial boxes are generated by image methods, window scores are assigned by the scoring system of image method. The spatial score \(s_{b_{i}}^{t,s}\) of one bounding box b i for t t h frame can be denoted as:

$$ s_{b_{i}}^{t,s}=IS(f_{t},{b_{i}^{t}}), $$
(10)

where I S(▪) is the scoring system of image method. It is different from the function I(▪) described in Section 3.1. The former is used for assigning a score for each window, while the latter serves as the spatial bounding box generation by the image object proposals.

As to temporal mapped bounding boxes, there are two steps to get the final scores. First, these mapped windows should be scored by (10) to get their spatial scores. Second, considering that these windows are mapped from the previous frame, temporal impacts should be considered into the scoring system. To simplify the scoring procedure, we adopt a linear weighted scoring strategy for temporal mapped bounding boxes, as shown in (11).

$$ s_{b_{i}}^{t,tm}=\lambda s_{b_{i}}^{t-1}+(1-\lambda) s_{b_{i}}^{t,s}, $$
(11)

where \(s_{b_{i}}^{t,tm}\) means the temporal score s t,tm for the bounding box b i . Although the scores for temporal mapped windows are assigned between two neighboring frames, the mapping relationship can occur within several frames. The bounding boxes and the scores of those boxes are synthetically considered and acquired in the temporal sequences by the proposed context-aware model because of the global and temporal strategies.

4 Temporal scoring refinement

Although the improved results can be acquired by merely utilizing the above processings, further refinement can also be applied to the generated window-based proposals. There are two ways to do the refinement. One is adjusting the shape of generated proposals. The other is refining the scores of generated proposals to make sure that the proposals with objects could get top ranks. We choose to do the temporal scoring refinement for we have achieved the pleasing improvement in the previous processing steps. Besides, our method aims at presenting a pre-processing routine, which means that the refinement strategy should be designed as simple as possible.

Our strategy is derived from temporal consistency constraint. Good methods should meet the demands of temporal consistency when they are applied into videos [4, 25, 55]. As to scores of neighboring frames, the consistency constraint is also essential, i.e., the mapped box should have scoring consistency compared with its neighboring boxes. We utilize a centered moving average filter to do the temporal scoring refinement. Therefore, the refined score \(s_{b_{i}}^{r}(t)\) of the bounding box b i in t t h frame can be updated by (12).

$$ s_{b_{i}}^{r}(t)= \frac{s(t-\frac{\pi -1}{2}) + s(t-\frac{\pi -1}{2}+1) + ... + s(t)+...+s(t+\frac{\pi -1}{2})}{\pi}, $$
(12)

where π is an adaptive moving window size to eliminate the score noise across the temporal domain separated by (9). It is noted that our refinement is performed on the temporal domain, centered by the current frame. Therefore, it can be used as a post-processing procedure to further improve the detection rate. Meanwhile, due to only relying on temporal score denoising, the improvement is limited.

5 Experiment and analysis

5.1 Dataset

The proposed method is evaluated on two datasets. One is designed for multi-object proposals and built in this paper. The other is a public dataset specially for motion segmentation, called Freiburg-Berkeley motion segmentation dataset [5, 39]. The former dataset is built from five famous movies, Mission Impossible, Monsters University, Kung Fu Panda, X-Men and Toy Story. We randomly select six shots from each movie, forming 30 shots in total. Five subjects, three men and two women, are invited to annotate the dataset. They firstly annotated some keyframes and then mapped the bounding boxes to the other frames by motion-based mapping. Finally, they adjusted the annotations with obvious offsets. By this way, we offered bounding box annotations for every frame in the proposed dataset. For a multi-object proposal dataset, the average number of the proposed dataset achieves 3.34. The detailed description of our dataset is depicted in Table 1.

Table 1 Description of the proposed dataset. Num. Shot is the number of shots, Ave. Obj. is the number of average objects, and Ave. Frame is the number of average frames of all shots in each movie

As to the FBMS-59 dataset [5, 39], there are 29 shots for training set and 30 shots for testing set. For there is no learning process in the proposed method, i.e., the proposed framework is an unsupervised method, we only adopt the testing set to do the experiments. Because of being designed for motion segmentation task, the ground truths are object segmentations. Besides, only few keyframes are labeled. In fact, video frames are different from image sets. Although containing consecutive frames, objects may not occur in every frame. Because the task of our method focuses on proposing the object-like windows, we pay much attention to the frames with objects. It is noted that some keyframes only contain a tiny part of objects, and some of neighboring frames contain no objects. It is impossible to recognize the objects from those frames. Therefore, we re-annotated this dataset in the same way with labeling the proposed dataset, and removed several frames without obvious main objects. To make an overall illustration, we classified this testing set, FBMS-30, based on its categories to 13 classes. Table 2 shows the details of our annotated FBMS-30 dataset.

Table 2 Description of FMBS dataset [5, 39]. Num. Shot is the number of shots, Ave. Obj. is the number of average objects, and Ave. Frame is the number of average frames of all shots in each category

5.2 Experimental setting

Our approach is implemented using Matlab on a desktop PC with an Intel i5 4590 CPU and 8GB memory. To show the efficiency of the proposed method for eliminating the proposal inconsistency in sequential frames, we compare our method with the state-of-the-art bounding box based object proposals: Edgebox [68], Bing [9], Rahtu [45] and Objectness [2]. Considering the efficiency and fairness, the authors’ public source codes with optimized parameters in their papers are adopted in all the experiments. Three popular evaluation metrics are utilized to quantitatively evaluate the performance of the proposed method, the same as [17]. They are the detection rate (DR) with given number of windows (#WIN) (DR-#WIN), DR with variational IoU threshold covered by ground truth annotations for a fixed number of proposals (DR-IOU), and the average detection rate (ADR), i.e., average recall (AR) [24] between 0.5 and 1 by averaging over the overlaps of the images’ annotations with the closest matched proposals (ADR-#WIN). Let #GT represent the number of the annotative ground truth for one image, o be the IoU overlap, the DR-#WIN and ADR are separately calculated according to (13) and (14).

$$\begin{array}{@{}rcl@{}} \text{DR-\#WIN} = \frac{\text{\#}(o > \epsilon)@\text{\#WIN}}{\text{\#GT}}~~\epsilon \in \{x|0.5\leq x \leq 1\} , \end{array} $$
(13)
$$\begin{array}{@{}rcl@{}} ADR=2{\int}_{0.5}^{1}DR(o)\text{d}o , \end{array} $$
(14)

where DR-#WIN is curved by a fixed IoU threshold 𝜖 between 0.5 and 1 with incremental number of windows, while DR-IOU is plotted based on the different IoU between 0.5 and 1 with a fixed number of windows. And the ADR is calculated according to the different DR on distinct IoU with changing the number of proposals.

As to parameter settings, they are set as {M,ω,s,T h1,T h2,T h3,λ} = {104,0.5,5,3,100,0.01,0.5}. We use M = 104 as the upper bound for the number of generated bounding boxes. ω and λ are the weight value, we set both of them 0.5. s × s is the window size that we use to filter the motion fields, and we set s = 5. T h1 is used to define the similar motion difference in pixel and set as 3. T h2 is the area of the bounding box regarded as the lost one, set as 100. T h3 is the threshold contributing to determining temporal mapping or not, set as 0.01. Besides, we utilize [49] to calculate the motion fields in our experiments for its accuracy and efficiency. In fact, any motion field type with adequate accuracy and high efficiency can be utilized in our framework. As to the video’s basic feature, it is better to be pre-computed, while it can also be integrated into our model if it is efficient enough.

5.3 Comparison

Our method focuses on eliminating the proposal inconsistency when applying object proposals frame by frame, and manages to yield twice the results with half the effort by introducing image object proposals into videos. Besides, we aim at presenting a framework suitable for pre-processing and probably extending to real time applications by improving hardware configurations. Considering that few methods are specially for spatio-temporal bounding box based multi-object proposals in videos, we compare the proposed method with the bounding box based state-of-the-arts [2, 9, 45, 68], according to the survey in [24], performed on the temporal sequences frame by frame. Considering both accuracy and efficiency of the existing object proposals, we recommend the frame-by-frame usage of Edgebox [68] as the baseline in achieving the task of video multi-object proposals.

Qualitative evaluation

For we evaluate our method on two datasets, we separately exhibit qualitative comparisons with different methods from Figs. 3 to 10. The green solid rectangle is the ground truth, and the green dashed rectangle is the hit proposal. Those red solid rectangles are the missing matched ground truths. In order to show the performance of our method on IoU 0.7 and 0.8, both of which are the accepted intersection over union with the bounding box ground truth in real applications, we present two kinds of comparative results for each dataset. Figures 3 and 4 exhibit the five consecutive proposals for two shots in our dataset when #WIN=1000 and IoU o = 0.7 in the proposed dataset, and Figs. 5 and 6 exhibit the five consecutive proposals for two shots in our dataset when #WIN=1000 and IoU o = 0.8 in the proposed dataset. Figures 7 and 8 show seven sequential proposal results of two shots in FBMS dataset when #WIN=1000 and IoU o = 0.7, and Figs. 9 and 10 show seven sequential proposal results of two shots in FBMS dataset when #WIN=1000 and IoU o = 0.8. The spatio-temporal bounding box proposals generated by Rahtu [45], Objectness [2], Bing [9], Edgebox [68] and Ours are placed in rows. Obviously, our method achieves the best performance in eliminating the proposal inconsistency both in IoU 0.7 and 0.8 on different datasets.

Fig. 3
figure 3

Comparisons of spatio-temporal bounding box proposals for 5 sequential frames from one shot of our dataset. These proposals are generated by Rahtu [45], Objectness [2], Bing [9], Edgebox [68] and our method from the first line to the fifth line. Green solid rectangles are the annotated ground truths, while green dashed rectangles are the hit proposals. The red solid rectangles are the missed ground truths. (o = 0.7 and #WIN=1000)

Fig. 4
figure 4

Comparisons of spatio-temporal bounding box proposals for 5 sequential frames from one shot of our dataset. These proposals are generated by Rahtu [45], Objectness [2], Bing [9], Edgebox [68] and our method from the first line to the fifth line. Green solid rectangles are the annotated ground truths, while green dashed rectangles are the hit proposals. The red solid rectangles are the missed ground truths. (o = 0.7 and #WIN=1000)

Fig. 5
figure 5

Comparisons of spatio-temporal bounding box proposals for 5 sequential frames from one shot of our dataset. These proposals are generated by Rahtu [45], Objectness [2], Bing [9], Edgebox [68] and our method from the first line to the fifth line. Green solid rectangles are the annotated ground truths, while green dashed rectangles are the hit proposals. The red solid rectangles are the missed ground truths. (o = 0.8 and #WIN=1000)

Fig. 6
figure 6

Comparisons of spatio-temporal bounding box proposals for 5 sequential frames from one shot of our dataset. These proposals are generated by Rahtu [45], Objectness [2], Bing [9], Edgebox [68] and our method from the first line to the fifth line. Green solid rectangles are the annotated ground truths, while green dashed rectangles are the hit proposals. The red solid rectangles are the missed ground truths. (o = 0.8 and #WIN=1000)

Fig. 7
figure 7

Comparisons of spatio-temporal bounding box proposals for 7 sequential frames from one shot of FBMS dataset. These proposals are generated by Rahtu [45], Objectness [2], Bing [9], Edgebox [68] and our method from the first line to the fifth line. Green solid rectangles are the annotated ground truths, while green dashed rectangles are the hit proposals. The red solid rectangles are the missed ground truths. (o = 0.7 and #WIN=1000)

Fig. 8
figure 8

Comparisons of spatio-temporal bounding box proposals for 7 sequential frames from one shot of FBMS dataset. These proposals are generated by Rahtu [45], Objectness [2], Bing [9], Edgebox [68] and our method from the first line to the fifth line. Green solid rectangles are the annotated ground truths, while green dashed rectangles are the hit proposals. The red solid rectangles are the missed ground truths. (o = 0.7 and #WIN=1000)

Fig. 9
figure 9

Comparisons of spatio-temporal bounding box proposals for 7 sequential frames from one shot of FBMS dataset. These proposals are generated by Rahtu [45], Objectness [2], Bing [9], Edgebox [68] and our method from the first line to the fifth line. Green solid rectangles are the annotated ground truths, while green dashed rectangles are the hit proposals. The red solid rectangles are the missed ground truths. (o = 0.8 and #WIN=1000)

Fig. 10
figure 10

Comparisons of spatio-temporal bounding box proposals for 7 sequential frames from one shot of FBMS dataset. These proposals are generated by Rahtu [45], Objectness [2], Bing [9], Edgebox [68] and our method from the first line to the fifth line. Green solid rectangles are the annotated ground truths, while green dashed rectangles are the hit proposals. The red solid rectangles are the missed ground truths. (o = 0.8 and #WIN=1000)

Quantitative evaluation

In order to present an overall performance of the proposed method, we make a comprehensive quantitative evaluation by utilizing three popular metrics in object proposals, DR-#WIN, DR-IOU and ADR-#WIN defined in Section 5.2. Figure 11a and b show the detection rate on our dataset with IoU=0.7 and IoU=0.8, and Fig. 12a and b are drawn for FBMS dataset with the same settings as Fig. 11. We give the DR-IOU curve for our dataset and FBMS dataset in Fig. 13a and b. Figure 14a and b illustrate the ADR-#WIN curve for both our dataset and FBMS dataset. For our dataset consists of five different movies, we also give the separate quantitative evaluations on the shots in different movies to show the improvement distributions in Fig. 15. It is shown that the proposed method can achieve the best results on each movie set compared with others. As to FBMS dataset, we give an overall evaluation on different classes classified from FBMS dataset in Fig. 16. The height of the bar represents the detection rate on IoU = 0.7 and #WIN=800. It is shown that the proposed method can achieve good performance on different categories, i.e., there is no obvious category bias. To make a further comparison, we also present the detailed comparison of the detection rate between our method and the state-of-the-art on our dataset and FBMS dataset with different proposal numbers under IoU=0.7, IoU=0.8 in Tables 3 and 4.

Fig. 11
figure 11

Detection rate curves of different methods with (a) o = 0.7 and (b) o = 0.8 on our dataset

Fig. 12
figure 12

Detection rate curves of different methods with (a) o = 0.7 and (b) o = 0.8 on FBMS dataset

Fig. 13
figure 13

DR-IOU curves of different methods on (a) our dataset and (b) FBMS dataset with #WIN=1000

Fig. 14
figure 14

ADR-#WIN curves of different methods on (a) our dataset and (b) FBMS dataset with o ∈ [0.5,1]

Fig. 15
figure 15

Detection rate curves of the shots from different movies in our dataset. ae are separately from Mission Impossible, Monsters University, Kung Fu Panda, X-Men and Toy Story with o = 0.7

Fig. 16
figure 16

The comparison of detection rate distribution on different category of FBMS dataset with o = 0.7 and #WIN=800

Table 3 Comparison of our method and the frame usage proposal method with different #WIN under o = 0.7, o = 0.8 and average IoU on our dataset
Table 4 Comparison of our method and the frame usage proposal method with different #WIN under o = 0.7, o = 0.8 and average IoU on FBMS dataset

Running time comparison

For our contribution lies in eliminating proposal inconsistency occurring among the temporal sequences, we only compare the running time in generating object proposals. As to the motion field calculation, it has been pre-computed in our framework for motion is the basic feature in videos and many achievements on motion calculations exist. In a word, our model only relies on the calculated motion fields, while not on motion calculation methods. Table 5 shows the comparison of our method and the state-of-art in running time for generating temporal object proposals. Although not efficient as Bing [9], it is shown that the proposed method achieves better performance both in accuracy and efficiency. In addition, our running time is the average computational time across all the resolutions of our dataset.

Table 5 Average running time comparison on our dataset

5.4 Discussion

Our method manages to extend image object proposals to videos, by proposing multi-objects instead of only focusing on dominant objects in videos. It is designed for multi-object proposals compared with those moving object segmentation methods. Besides, the proposed method can eliminate the proposal inconsistency caused by frame-by-frame usage of image object proposals. In general, the proposed method has both advantages and disadvantages. The good characters can be summarized as good performance, category independent and unsupervised, meanwhile we also give the limitations of our method.

Good performance

Figure 13 shows that the proposed method has low IoU drop. Our method achieves good results on both IoU 0.7 and 0.8, while some state-of-the-art methods can only achieve the improvement on a fixed IoU or a lower IoU value. Considering the requirements from real applications, IoU 0.7 and 0.8 are sufficient enough to leverage accuracy and practicability. Figures 17 and 18 illustrate more results on some temporal frames in our dataset and FBMS dataset. Although without any complicated calculation, our method can localize the object as much as possible.

Fig. 17
figure 17

More results about spatio-temporal bounding box proposals generated by the proposed method for our dataset. The green solid rectangle is the annotated ground truth and the green dashed rectangle is the hit proposal. (o = 0.8 and #WIN=1000)

Fig. 18
figure 18

More results about spatio-temporal bounding box proposals generated by the proposed method for FBMS dataset. The green solid rectangle is the annotated ground truth and the green dashed rectangle is the hit proposal. (o = 0.8 and #WIN=1000)

Category independent

From the experimental dataset perspective, there are many kinds of objects including regular and irregular shapes. Figures 15 and 16 show the detection rate on different movies and categories. It could be seen that there is no obvious category bias. No matter the object is real or imaginary, the proposed method can make a further improvement, though there are differences in the improved values. Therefore, our method is category independent which is suitable for applying to practical applications.

Unsupervised method

There is no learning stage and no category tendentiousness in our method. Therefore, it is an unsupervised method independent of datasets. Because of no prerequisite, our method is more propitious to be utilized as a pre-processing procedure.

Limitations

Our method is an extension of image object proposals in videos as a pre-processing procedure. Therefore, some defects in image method may be inherited. But this issue can be fixed with the boosting of image object proposals. Furthermore, because of motion blur and the inaccurate motion field, bounding box proposals cannot be accurately mapped for every frame. If most bounding boxes cannot be temporally or accurately mapped, our method may degrade to frame-by-frame usage. That’s why we did not achieve significant improvements on every shot. Fortunately, with the advance of motion field estimation, the problem will be hopefully solved in the near future.

6 Conclusion

An adaptive context-aware model is proposed for video object proposals in this paper. It aims at eliminating proposal inconsistency when applying image methods frame by frame, while taking advantages of both image methods and video features. By introducing the proposed context-aware model, image object proposals can be successfully migrated into video processing, yielding twice the results with half the effort. To evaluate the efficiency of proposed video multi-object proposals, we build a specific multi-object dataset with bounding box based ground truths annotated frame by frame, and we also annotate one public dataset in the same way. Experiments on these challenge datasets demonstrate that the proposed approach outperforms the performance by utilizing the state-of-the-art method on the single frame in sequences.

Our future work will focus on the refinement strategy of our method to make a further improvement on the object detection rate rather than only refining on the ranking temporal scores. We will also manage to explore the parameter optimization of our model to provide more targeted parameter settings.