1 Introduction

Multimedia event detection (MED) has received a lot of interest largely due to the explosive growth of the user generated videos on the Internet [5, 8, 18, 42, 44]. For example, 300 hours of videos are uploaded to YouTube every minute,Footnote 1 which is the most popular video-sharing website all around the world. This task aims to identify videos of a particular event of interest, e.g., making a cake or landing a fish, which is the higher level semantic abstraction of long video clips consisting of multiple concepts [42]. For example, an event like landing a fish can be described by multiple concepts, such as objects (e.g., human, fish), actions (e.g., standing, pulling) and scenes (e.g., beside a river or lake). Compared with the previous visual content analysis tasks such as action detection and object recognition, MED is more challenging and complicated due to the dynamic content variations and uncontrolled capture conditions [33]. Techniques for recognizing such complex events are fundamental to many practical applications such as web video search, consumer video management, and user recommendation.

Multimedia data is usually represented by multiple features. Generally, these features can be divided into two categories, namely high-level and low-level features. Low-level features capture the local appearance and texture statistics of objects in the video at particular interest points, while high-level features are represented by a real number estimating the probability of observing a concept in the video [19]. Different features characterize different aspects of the multimedia data. Although high-performance feature descriptors have been developed to help characterize videos, it is still difficult to obtain enough required information with a single feature to discriminate between different kinds of complex events. Therefore, by common consent, combining multiple types of features or video sources is able to achieve better performance [7, 18, 19, 26, 32,33,34, 44]. For example, in [7] Chang et al. proposed to investigate the varying contribution of semantic representations from different image/video sources, thus enhancing the exploitation of semantic representation in the source-level. Ma et al. leveraged attributes from multiple sources to evaluate the negativity of the negative examples, demonstrating better performance than the approach that exploited each attribute source separately [26].

For a multimedia data, its multiple features, which are diverse and complementary, might assist the detection of a specific complex event on this data in varying extent. To illustrate this point more clearly, we take several different features associated with two event-related videos for an example in Fig. 1, where the top and bottom videos are with respect to the event landing a fish and birthday party, respectively. We can see that motion features such as MoSIFT [9] could be beneficial to identify the event landing a fish, as the event-related quick actions like “reeling or lifting” easily appear in the top video. However, for the detection of another event birthday party, motion features are relatively insignificant because these features are not the valuable indicators for this event. Different from motion features, some concepts such as scene descriptors are useful for both of these two types events. For example, “beside a river or lake” is helpful to indicate a landing a fish event, while “inside a room” is an important part of the event birthday party. On the other hand, the multi-feature representation of multimedia data such as videos usually produces high-dimension and large-volume data object. As one of the most widely used classification tools in MED community, Support Vector Machine (SVM) has been proven to be robust and effective for complex event detection task. However, previous research [2, 3, 23, 33, 41] mainly paid attention to feature selection or construction, and there are rarely studies on the problem of efficiency when applying SVM for complex event detection. As a result, multimedia event detectors have to suffer from a heavy computational burden and are time-consuming for real-world applications. In summary, we can conclude that it is challenging and complicated to involve multiple features of video data to enhance the performance of complex event detection.

Fig. 1
figure 1

An example showing the varying influence of different features with respect to the detection of a specific event. The line with dark color denotes an important impact on the event detection by the feature, while the light one means the feature is relatively insignificant to indicate the target event

In light of this, there are mainly two issues to be considered with respect to employing multiple features to enhance the performance of complex event detection. The first issue is how to leverage these diverse and complementary features reasonably when combining multiple features to detect events. The second issue is how to solve the SVM quickly to reduce the heavy computational cost, which is caused by the high-dimension and large-volume multi-feature representation. To address the both issues, we propose a self-adaptive multi-feature learning framework with the fast SVM solver for complex event detection, which is able to combine different features effectively and efficiently. In order to utilize multiple features more reasonably, we adopt an adaptively weighted linear combination for these features. This manner is simple yet effective, and is able to assign particular significance to each feature to improve the performance of MED task. Moreover, motivated by the inspiring progress in SVM-related research field [17, 25, 30], we design a fast SVM solver to alleviate the problem of expensive computational cost for complex event detection. Specifically, in the proposed alternating optimization algorithm, our solver employs an approximate solution obtained by the gradient descent method rather than the relatively costly closed form expression. Figure 2 displays the working flow of the proposed multi-feature learning framework with efficient SVM solver for complex event detection.

Fig. 2
figure 2

The working flow of the proposed framework for complex event detection. First, our model extracts multiple features from the training data. Next, for each kind of feature, the alternating optimization is employed to update the specific weight \( {\theta _{v}^{r}} \) and the detector W v , which is obtained by a gradient descent method, with respect to this feature. Finally, we recognize events from testing videos by using a combination of a series of weights and detectors over multiple features

We summarize our contributions as follows:

  • In consideration of that different features have varying influence to indicate a specific event, we design an adaptively weighted combination manner for multiple features rather than fusing them directly to enhance the performance of complex event detection.

  • In order to alleviate the heavy computational burden caused by the large-volume and high-dimension multi-feature data, the proposed alternating optimization algorithm employs an approximate solution with the gradient descent method in the large-scale scenario.

  • We conduct extensive experiments on the datasets of TRECVID MEDTest 2013 and 2014 for evaluation. The promising results demonstrate the effectiveness and superiority of the proposed method.

The rest of this paper is organized as follows. In Section 2, we review related work on MED with multi-feature learning and fast primal SVM solver. Sections 3 and 4 present the details of proposed multi-feature learning framework and alternating optimization algorithm, respectively. The experimental settings and evaluation results are presented in Section 5. Section 6 concludes the paper.

2 Related work

With the rapid growth of web videos, how to exploit multiple features for complex event detection efficiently and effectively has been receiving increasing attention in recent years. We briefly review the existing related work from multiple feature learning and SVM for MED.

2.1 Multi-feature learning

Utilizing multiple types of features is able to achieve better performance for MED task because representing video data with a single view is rarely possible to get all required information related to the target event within such complex data. Generally, there are two major categories of multiple feature combination methods [31]. The first one is early fusion, which combines different features before the training process and then put the aggregative vector into the classifier. Spatial pyramid matching methods like [24] and [1] are the most representative research of early fusion. They have the ability to encode the spatial information of the image by fusing the features, which are extracted from different tiles generated by dividing an image. The second one is late fusion, which combines the predictive values after the training process. Some research [21, 43] have concluded that early fusion of features is less effective than late fusion in multiple content analysis when the features are independent or heterogeneous.

In light of this, plenty of research is dedicated to the study for the late fusion mechanism of multiple features. Canonical Correlation Analysis (CCA) [29], SVM-2K [13] and Multiple Kernel Learning (MKL) [16] are the most classical late fusion approaches. CCA maximizes the correlation between two features in a compact subspace. In SVM-2K, authors propose a method that combines two stage learning (kernel CCA followed by SVM) into a single optimization. MKL is widely used in computer vision but its computation is costly as the construction of multiple kernels. Recently, a number of important low-level visual features and their combination are evaluated for the complex event detection [33], which systematically analyzes these mainstream features. Yang et al. propose a semi-supervised framework [44] to improve the performance of multimedia semantic understanding by exploiting the unlabeled multiple data. In [34], Tang et al. present a method which is able to be selective of different subsets of features to combine for certain classes. Jiang et al. use a graph based approach in [19] to diffuse scores among different video data, which makes the fusion result is interpretable for human. In [6] Chang et al. present a multiple feature learning method which embedded feature interaction into a joint framework to capture the nonlinear property within the data while simultaneously combine the linear effect and the nonlinear effect. An unsupervised event saliency revealing framework which extracted features from multiple modalities is designed in [47] to represent each shot in the given video collection. Coşar et al. [11] propose a unified multi-modal fusion framework that incorporates the output of object trajectory analysis with pixel-based analysis to detect abnormal behaviors related to speed and direction of object trajectories. As one of the most effectively methods for MED task, multiple feature learning also causes a heavy computational burden so that the fast optimization algorithm is desired.

2.2 SVM for MED

SVM is widely used for classification task as a result of its strong robust performance. In the field of MED, a series of SVM based algorithms have been proven to be effective for complex event detection task in both practical application [22, 23, 41, 45] and scientific research [4, 15, 36, 37]. In term of the practical application, these methods mainly contain two processes, first they construct proper features according to the characteristics of the specific data, then they directly take traditional SVM as the classifier for the final classification task. For example, Lan et al. [23] introduce double fusion scheme, which simply combines early fusion and late fusion together to incorporate their advantages, and then employ SVM as the classifier to detect the event of interest. Xu et al. [41] propose a discriminative video representation by leveraging deep convolutional neural networks, and then apply linear SVM over the learned features to advance event detection. In a word, it is obvious that this kind of methods emphasize feature construction or combination.

In term of the scientific research, authors focus on designing SVM-based models to enhance the performance of MED. Specifically, Gkalelis et al. [15] present a two-phase approach which combines a novel nonlinear generalized subclass discriminant analysis (GSDA) method to identify a discriminant subspace, and a linear SVM to efficiently learn the event in the derived subspace. In order to deal with the problem of limited number of positive and related event videos, Tzelepis et al. [37] extend the linear SVM with Gaussian sample uncertainty (LSVM-GSU) by assuming isotropic uncertainty into a new kernel-based algorithm (KSVM-iGSU). Furthermore, they also extend KSVM-iGSU based on the relevance degree kernel SVM (RD-KSVM) proposed in [36]. As a result, related samples can be effectively exploited as positive or negative examples with automatic weighting. Recently, Chang et al. [4] present a semantic saliency and nearly-isotonic SVM framework to detect event in long videos that may last for hours. First each shot of the event is assessed and prioritized according to their saliency scores. Next, they propose a new isotonic regularizer that is able to exploit the semantic ordering information and the resulting nearly-isotonic SVM classifier exhibits higher discriminative power. However, the research on SVM solver level in MED filed to meet the growing volume of data in the large-scale scenarios is still in its infancy.

3 The proposed methodology

In this section, we explain how to construct a self-adaptive multi-view learning framework along with a generalized SVM classifier for the MED task. Suppose we have n training data represented by V different features and denote them as \(X_{v} = [\mathbf {x}_{1}^{v},\mathbf {x}_{2}^{v},\cdots ,\mathbf {x}_{n}^{v}] \in \mathbb {R}^{d_{v} \times n}\) (v = 1,2,⋯ ,V ), where d v is the feature dimension of the v-th view. Let y = [y 1,y 2,⋯ ,y n ]T ∈{−1,1}n×1 are the labels of the training data, then y i = 1 if the i-th video is positive example whereas y i = 0 otherwise.

Generally, the low-level features are associated with the high-level events by a prediction function f. For the v-th view of the i-th video from n samples \( \mathbf{x}_{i}^{v} \), we have:

$$ \begin{array}{llllll} f_{i}(\mathbf{x}_{i}^{v}) = W_{v}^{\top}\mathbf{x}_{i}^{v} + b_{v}, \end{array} $$
(1)

where \(W_{v} \in \mathbb {R}^{d_{v} \times 1}\) is the event detector with respect to the v-th view which correlates x i v with its label y i . b v is the bias term that also in relation to the v-th view. In order to decide f i , we minimize the following objective:

$$ \begin{array}{llllll} \min\limits_{f_{i}} loss\left( f(\mathbf{x}_{i}^{v}),y_{i}\right) + {\Omega}(f_{i}), \end{array} $$
(2)

where l o s s(⋅) is a loss function and Ω(f i ) is the regularization function on f i .

Considering the fact that it has been widely used and has shown its robustness, SVM is employed in this paper for MED task. Specifically, we use the hinge loss, i.e.2, \(loss(f_{i},y_{i})=\max (1-f_{i}y_{i},0)\), as the loss function and the 2-norm as the regularization term. For better extensibility, we generalized popular hinge loss and squared hinge loss into a p-order form. Thus, for the v-th view of all n videos, we can get:

$$ \begin{array}{llllll} \min\limits_{W_{v},b_{v}} \ \ {\sum}_{i=1}^{n}\left( 1-\left( W_{v}^{\top}\mathbf{x}_{i}^{v} + b_{v}\right)y_{i} \right)_{+}^{p} + \frac{C}{2}\|W_{v}\|_{2}^{2}, \end{array} $$
(3)

where the operator \((a)_{+} \overset {def}{=} \max (a,0)\) returns the scalar a if a is non-negative, and zero otherwise. Here C is the parameter to balance the relative importance of the loss term and the regularization term. p is a constant and typically 1 ≤ p ≤ 2 for being meaningful.

For an event, different views usually have different contributions to the detection task as the complementary property of multiple features to each other to some extent. Therefore, it is reasonable to assign different weights 𝜃 = [𝜃 1,𝜃 2,⋯ ,𝜃 V ] to different views. A larger coefficient α v indicates that the corresponding feature plays a more important role in generating the final detector. For ease of notation, we use \(L_{v}={\sum }_{i=1}^{n}\left (1-(W_{v}^{\top }\mathbf {x}_{i}^{v} + b_{v})y_{i} \right )_{+}^{p}\) to denote the loss of all the videos with respect to the v-th view. Thus, the multi-view optimization problem with the p-order loss based primal SVM for the MED is:

$$ \begin{array}{llllll} \min\limits_{\{W_{v},b_{v},\theta_{v}\}_{v=1}^{V}} \ \ &{\sum}_{v=1}^{V}\left( {\theta_{v}^{r}}L_{v} + \frac{C}{2}\|W_{v}\|_{2}^{2}\right) \\ s.t. \ \ \ \ \ \ \ \ \ & {\sum}_{v=1}^{V}\theta_{v}=1, \ \ \theta_{v}\ge 0 \end{array} $$
(4)

where r > 1. Note that if we directly use the 𝜃 = [𝜃 1,𝜃 2,⋯ ,𝜃 V ] as the weight of all features, the solution to 𝜃 is 𝜃 v = 1 for \(v=\arg \min _{v}\{L_{v}\}\) and 𝜃 v = 0 otherwise. In other words, only the best view for this event is kept. Therefore, following the strategy in [38, 40], we adopt 𝜃 v r instead of 𝜃 v in the objective function (4) to weight the v-th view. With this trick, our model is able to avoid degenerating into a single-view method, which considers the best view but ignores the complementary property of multiple features for an event.

Given a testing video with V features x t v|v=1V, we can compute the predicted score by summing the obtained detector W v as well as its corresponding weight 𝜃 v r over each view as follows:

$$ \begin{array}{llllll} y_{t} = {\sum}_{v=1}^{V} {\theta_{v}^{r}}W_{v}^{\top}\mathbf{x}_{t}^{v}, \end{array} $$
(5)

where y t is the predicted score of the testing video. To be specific, we assign positive label to this video when y t > 0, otherwise negative label when y t ≤ 0.

In summary, by designing a multi-view learning framework for the video data, which consists of multiple features that possess complementary property to each other, our model is able to adaptively exploit different aspects of the training data for the MED task. Moreover, our model employs the p-order based hinge loss and the 2-norm based regularization term to get a more flexible SVM classifier, which has better generalization ability.

4 Optimization algorithm

In this section, we present how to obtain the event detector. Considering the non-smoothness of hinge loss used in the objective function (4), we exploit an alternating optimization algorithm to solve the proposed challenging problem effectively. We describe the alternating algorithm for optimization problem (4) in Algorithm 1.

figure f

Update feature weight

First we fix {W v ,b v }|v=1V to update 𝜃. In order to turn the objective function (4) into an unconstrained optimization problem, we introduce a Lagrange multiplier λ so that we have the Lagrange function of (4) as follow:

$$ \begin{array}{llllll} \mathcal{L}(\boldsymbol{\theta},\lambda) = {\sum}_{v=1}^{V}\left( {\theta_{v}^{r}}L_{v} + \frac{C}{2}\|W_{v}\|_{2}^{2}\right) - \lambda\left( {\sum}_{v=1}^{V}\theta_{v}-1\right). \end{array} $$
(6)

By setting the derivative of \(\mathcal {L}(\boldsymbol {\theta },\lambda )\) with respect to 𝜃 v and λ to zero, we have:

$$ \left\{\begin{array}{llllll} \frac{\partial{\mathcal{L}(\boldsymbol{\theta},\lambda)}}{\partial{\theta_{v}}} =r\theta_{v}^{r-1}L_{v}-\lambda=0, \ v = 1,2,\cdots,V \\ \frac{\partial{\mathcal{L}(\boldsymbol{\theta},\lambda)}}{\partial{\lambda}}=\sum\limits_{v=1}^{V}\theta_{v}-1=0. \end{array}\right. $$
(7)

Thus, 𝜃 v can be obtained

$$ \begin{array}{llllll} \theta_{v} = \frac{\left( \frac{1}{L_{v}}\right)^{\frac{1}{r-1}}}{\sum\limits_{v=1}^{V}\left( \frac{1}{L_{v}}\right)^{\frac{1}{r-1}}}. \end{array} $$
(8)

As mentioned in Section 3, we set r > 1 to make the weight 𝜃 v negatively correlate with the loss L v . As a result, the larger the loss L v is, the smaller the weight 𝜃 v is. if \(r \rightarrow \infty \), different weight 𝜃 v with respect to different view will close to each other, which means all features play the same role in detecting the event. If \(r \rightarrow 1\), the optimal solution to 𝜃 is 𝜃 v = 1 for \(v=\arg \min _{v}\{L_{v}\}\) and 𝜃 v = 0 otherwise, which indicates only the best feature for the event is kept. Generally, the value of r should be determined according to the complementary property of all views. The view which possesses rich complementary prefers large r; otherwise, small r is suitable.

Update primal SVM variables for each feature

With the feature weight 𝜃 fixed, we update {W v ,b v }|v=1V with the objective function below:

$$ \begin{array}{llllll} \min\limits_{\{W_{v},b_{v}\}_{v=1}^{V}} \ \ &{\sum}_{v=1}^{V}\left( {\theta_{v}^{r}}L_{v} + \frac{C}{2}\|W_{v}\|_{2}^{2}\right) \end{array} $$
(9)

which can be decomposed into V independent problems with respect to W v ,b v :

$$ \begin{array}{llllll} \min\limits_{W_{v},b_{v}} \ \ & {\theta_{v}^{r}}L_{v} + \frac{C}{2}\|W_{v}\|_{2}^{2} \\ \Leftrightarrow \ \ \min\limits_{W_{v},b_{v}} \ \ & {\theta_{v}^{r}}{\sum}_{i=1}^{n}\left( 1-\left( W_{v}^{\top}\mathbf{x}_{i}^{v} + b_{v}\right)y_{i} \right)_{+}^{p} + \frac{C}{2}\|W_{v}\|_{2}^{2} \end{array} $$
(10)

Note that y i ∈{−1,+1} in this paper, then it is true that 1 − (\( W_{v}^{\top}\mathbf{x}_{i}^{v} \) + b v )y i = y i y i − (\( W_{v}^{\top}\mathbf{x}_{i}^{v} \) + b v )y i = y i (y i − (\( W_{v}^{\top}\mathbf{x}_{i}^{v} \) + b v )). Inspired by [27], we introduce the auxiliary variables \( {\alpha_{i}^{v}} \) = y i − (\( W_{v}^{\top}\mathbf{x}_{i}^{v} \) + b v ),1 ≤ in, and the objective function with respect to W v ,b v is equivalent to:

$$ \begin{array}{llllll} \min\limits_{W_{v},b_{v},\boldsymbol{\alpha}^{v}} \ \ & {\theta_{v}^{r}}{\sum}_{i=1}^{n}\left( y_{i}{\alpha_{i}^{v}}\right)_{+}^{p} + \frac{C}{2}\|W_{v}\|_{2}^{2} \\ s.t. \ \ \ \ \ & {\alpha_{i}^{v}} = y_{i}-\left( W_{v}^{\top}\mathbf{x}_{i}^{v}+b_{v}\right), \ i=1,2,\cdots,n \end{array} $$
(11)

In order to solve the objective function (11), we adopt Augmented Lagrangian Method (ALM) [14] to update W v ,b v and α v. To begin with, by introducing a set of Lagrangian multiplier β v with respect to the v-th view to take the n constraints into consideration, we have the Lagrangian function of (11) as:

$$ \begin{array}{llllll} \mathcal{L}(W_{v},b_{v},\boldsymbol{\alpha}^{v},\boldsymbol{\beta}^{v})= {\theta_{v}^{r}}{\sum}_{i=1}^{n}\left( y_{i}{\alpha_{i}^{v}}\right)_{+}^{p} \ + \ \\ \frac{C}{2}\|W_{v}\|_{2}^{2}+(\boldsymbol{\beta}^{v})^{\top}\left( X_{v}^{\top}W_{v}+\mathbf{1}b_{v}-\mathbf{y}+\boldsymbol{\alpha}^{v}\right) \end{array} $$
(12)

where \(\mathbf {1}=[1,1,\cdots ,1]^{\top }\in \mathbb {R}^{n \times 1}\) and \(\boldsymbol {\alpha }^{v}=[{\alpha _{1}^{v}},{\alpha _{2}^{v}},\cdots ,{\alpha _{n}^{v}}]^{\top }\in \mathbb {R}^{n \times 1}\). The last term is the pointwise multiplication of the amount of violation of the n constraints \( {\alpha_{i}^{v}} \)y i + (\( W_{v}^{\top}\mathbf{x}_{i}^{v} \) + b v ) = 0 with the vector \(\boldsymbol {\beta }^{v}=[{\beta _{1}^{v}},{\beta _{2}^{v}},\cdots ,{\beta _{n}^{v}}]\in \mathbb {R}^{n \times 1}\) consisting of n Lagrangian multipliers. Next, by adding a supplemental term to (12), we can get the augmented Lagrangian function of (11) as:

$$ \begin{array}{llllll} \tilde{\mathcal{L}}(W_{v},b_{v},\boldsymbol{\alpha}^{v},\boldsymbol{\beta}^{v},\eta_{v}) = {\theta_{v}^{r}}{\sum}_{i=1}^{n}\left( y_{i}{\alpha_{i}^{v}}\right)_{+}^{p} + \frac{C}{2}\|W_{v}\|_{2}^{2} \ + \ \\ (\boldsymbol{\beta}^{v})^{\top}\left( X_{v}^{\top}W_{v}+\mathbf{1}b_{v}-\mathbf{y}+\boldsymbol{\alpha}^{v}\right) + \frac{\eta_{v}}{2}\|X_{v}^{\top}W_{v}+\mathbf{1}b_{v}-\mathbf{y}+\boldsymbol{\alpha}^{v}\|_{2}^{2} \end{array} $$
(13)

where \(\eta _{v} \in \mathbb {R}\) is the augmented penalty parameter with respect to the v-th view and “augments” to infinity and the last supplemental \(\frac {\eta _{v}}{2}\|X_{v}^{\top }W_{v}+\mathbf {1}b_{v}-\mathbf {y}+\boldsymbol {\alpha }^{v}\|_{2}^{2}\) forces the n constraints to be satisfied. By arranging the last two terms in (13), we have the quadratic form as follow:

$$ \begin{array}{llllll} \tilde{\mathcal{L}}(W_{v},b_{v},\boldsymbol{\alpha}^{v},\boldsymbol{\beta}^{v},\eta_{v}) = {\theta_{v}^{r}}{\sum}_{i=1}^{n}\left( y_{i}{\alpha_{i}^{v}}\right)_{+}^{p} \ + \ \\ \frac{C}{2}\|W_{v}\|_{2}^{2} + \frac{\eta_{v}}{2}\|X_{v}^{\top}W_{v}+\mathbf{1}b_{v}-\mathbf{y}+\boldsymbol{\alpha}^{v}+\frac{\boldsymbol{\beta}^{v}}{\eta_{v}}\|_{2}^{2} \end{array} $$
(14)

Compared with (13), (14) is added a term \(\frac {\boldsymbol {\beta }^{v}}{\eta _{v}}\), which is always a constant when updating the variables W v ,b v and α v within a single iteration. Note that \(\eta _{v}\rightarrow \infty \), thus this term is almost the zero and can be negligible eventually.

After obtaining the augmented Lagrangian function (14), we then analyze how to update the event detector W v ,b v , the auxiliary variables α v, and the Lagrangian multipliers β v as well as how to determine the the augmented penalty parameter η v with respect to the v-th view. These variables will also be updated in a alternating fashion.

First we calculate the Lagrangian multiplier vector β v at the t-th iteration with W v ,b v and α v fixed. Following the strategy used in [39], we update β v with the amount of violation of the n constraints:

$$ \begin{array}{llllll} \boldsymbol{\beta}^{v}_{(t)} = \boldsymbol{\beta}^{v}_{(t-1)} + (\eta_{v})_{(t)}\left( X_{v}^{\top}W+\mathbf{1}b_{v}-\mathbf{y}+\boldsymbol{\alpha}^{v}\right) \end{array} $$
(15)

Here the (η v )(t) is monotonically non-decreasing according to the Lemma 3 in [27]. Because the first two terms \({\theta _{v}^{r}}{\sum }_{i=1}^{n}(y_{i}{\alpha _{i}^{v}})_{+}^{p}\) and \(\frac {C}{2}\|W_{v}\|_{F}^{2}\) in (14) are likely to be squeezed out by the extremely large term \(\frac {\eta _{v}}{2}\|X_{v}^{\top }W+\mathbf {1}b_{v}-\mathbf {y}+\boldsymbol {\alpha }^{v}+\frac {\boldsymbol {\beta }^{v}}{\eta _{v}}\|_{2}^{2}\), η v can not go to infinity in fact. Therefore, η v can be generate under an upper bound, which is 105 in this paper.

Second, with W v ,b v and β v fixed, the optimization problem (14) can be decomposed into n independent problems with respect to \( {\alpha_{i}^{v}} \):

$$ \begin{array}{llllll} \min\limits_{{\alpha_{i}^{v}}} \ \ &{\theta_{v}^{r}}\left( y_{i}{\alpha_{i}^{v}}\right)_{+}^{p} + \frac{\eta_{v}}{2}\|W^{\top}\mathbf{x}_{i}^{v}+b_{v}-y_{i}+{\alpha^{v}_{i}}+\frac{{\beta_{i}^{v}}}{\eta_{v}}\|_{2}^{2} \\ \Leftrightarrow \ \ \min\limits_{{\alpha_{i}^{v}}} \ \ &\gamma_{v}\left( y_{i}{\alpha_{i}^{v}}\right)_{+}^{p} + \frac{1}{2}\left( {\alpha_{i}^{v}}-{t_{i}^{v}}\right)^{2} \end{array} $$
(16)

where \(\gamma _{v} = \frac {{\theta _{v}^{r}}}{\eta _{v}}\) and \({t_{i}^{v}} = y_{i}-W^{\top }\mathbf {x}_{i}^{v}-b_{v}-\frac {{\beta _{i}^{v}}}{\eta _{v}}\). It is easy to solve the objective function in (16) as α i v is the minimizer for the single-variable 2-piece piecewise function. The research [27] has given the result of a problem that is similar to ours. Based on this, for the 1-order hinge loss based primal SVM (p = 1), we can get:

$$ {\alpha_{i}^{v}}= \left\{\begin{array}{llllll} {t_{i}^{v}}-y_{i}\gamma_{v} &\text{ when \ \(y_{i}{t_{i}^{v}} > \gamma_{v}\)} \\ 0 &\text{ when \ \(0 \le y_{i}{t_{i}^{v}} \le \gamma_{v}\)} \\ {t_{i}^{v}} &\text{ when \ \(y_{i}{t_{i}^{v}} < 0\)} \end{array}\right. $$
(17)

Finally, the loss function term of (14) has no effect on the result when fixing β v and α v to update W v and b v . Therefore, the optimization problem becomes:

$$ \begin{array}{llllll} \min\limits_{W_{v},b_{v}} \ \frac{C}{2}\|W_{v}\|_{2}^{2} + \frac{\eta_{v}}{2}\|X_{v}^{\top}W+\mathbf{1}b_{v}-\mathbf{y}+\boldsymbol{\alpha}^{v}+\frac{\boldsymbol{\beta}^{v}}{\eta_{v}}\|_{2}^{2} \end{array} $$
(18)

Let \(\boldsymbol {\tau }=\boldsymbol {\alpha }^{v}+\frac {\boldsymbol {\beta }^{v}}{\eta _{v}}-\mathbf {y}\) and it is easy to observe that τ is a constant vector when update W v and b v . As a result, the problem (18) turns into an 2-norm regularized Least Square Regression (LSR) problem:

$$ \begin{array}{llllll} J(W_{v},b_{v}) = \min\limits_{W_{v},b_{v}} \ \frac{C}{\eta_{v}}\|W_{v}\|_{2}^{2} + \|X_{v}^{\top}W+\mathbf{1}b_{v} + \boldsymbol{\tau}\|_{2}^{2} \end{array} $$
(19)

We set \(z_{v}= \left [ \begin {array}{cccccc} W_{v} \\ b_{v} \end {array} \right ] \), \(A_{v}= \left [ \begin {array}{cccccc} X_{v}^{\top } \quad \quad & \mathbf {1} \\ (\frac {C}{\eta _{v}})^{\frac {1}{2}}I \quad \quad & \mathbf {0} \end {array} \right ] \)and \(d_{v}= \left [ \begin {array}{cccccc} -\boldsymbol {\tau } \\ \mathbf {0} \end {array} \right ] \), the optimization problem (19) can be turned into a standard LSR problem as follow:

$$ \begin{array}{llllll} J(W_{v},b_{v}) = J(z_{v}) = \min\limits_{z_{v}} \ \|A_{v}z_{v}-d_{v}\|_{2}^{2} \end{array} $$
(20)

By setting the derivative of (20) with respect to z v to zero, we have:

$$ \begin{array}{llllll} z_{v} = \left( A_{v}^{\top}A_{v}\right)^{-1}A_{v}^{\top}d_{v}. \end{array} $$
(21)

Thus, we get the closed form of the solution with respect to W v and b v .

However, the time complexity of the (21) is as costly as computing matrix inverse. Considering it is time-consuming to handle the large-volume and high-dimension multi-view video data, the existing methods are not proper for such complex MED task because of the relatively high time complexity. In large-scale scenarios, usually an approximate solution of the optimization problem is enough to produce a good model [17]. Motivated by this, we seek for an optimal step-size gradient descent method to update W v and b v more efficiently and effectively. The gradients of \(J(W_{v},b_{v})\) with respect to W v and b v are as follows:

$$ \left\{\begin{array}{llllll} W_{v}^{\prime}=\frac{\partial{J(W_{v},b_{v})}}{\partial{W_{v}}} = \frac{C}{\eta_{v}}W_{v}+X_{v}\left( X_{v}^{\top}W_{v}+\mathbf{1}b_{v}+\boldsymbol{\tau}\right) \\ b_{v}^{\prime}=\frac{\partial{J(W_{v},b_{v})}}{\partial{b_{v}}} = \mathbf{1}^{\top}\left( X_{v}^{\top}W_{v}+\boldsymbol{\tau}\right)+nb_{v} \end{array}\right. $$
(22)

Thus, we can get the optimal step-size l v with respect to the v-th view by minimizing the single-variable quadratic function:

$$ \begin{array}{llllll} \min\limits_{l_{v}} \ \frac{C}{\eta_{v}}\|W_{v}-l_{v}W_{v}^{\prime}\|_{2}^{2} + \|X_{v}^{\top}(W_{v}-l_{v}W_{v}^{\prime})+\mathbf{1}(b_{v}-l_{v}b_{v}^{\prime}) + \boldsymbol{\tau}\|_{2}^{2} \end{array} $$
(23)

which has the explicit solution:

$$ \begin{array}{llllll} l_{v} = \frac{{W_{v}^{\prime}}^{\top}W_{v}^{\prime}+{b_{v}^{\prime}}^{2}}{\left( X_{v}^{\top}W_{v}^{\prime}+\mathbf{1}b_{v}^{\prime}\right)^{\top}\left( X_{v}^{\top}W_{v}^{\prime}+\mathbf{1}b_{v}^{\prime}\right)+\frac{C}{\eta_{v}}{W_{v}^{\prime}}^{\top}W_{v}^{\prime}}. \end{array} $$
(24)

The alternating algorithm for optimization problem (14) with respect to each feature is summarized in Algorithm 2. The time complexity of the proposed Algorithm 1 contains V parts, where V is the number of feature types. For each view, at each iteration, Algorithm 2 only needs three matrix-by-vector multiplications with complexity \(\mathcal {O}(nd_{v})\), where d v is the feature dimension of the v-th view and n is the number of samples. The several pointwise addition and multiplication in (22) and (24) between two vectors are with complexity either \(\mathcal {O}(d_{v})\) or \(\mathcal {O}(n)\), which can be neglected compared to \(\mathcal {O}(nd_{v})\). Therefore, the entire time complexity of the proposed Algorithm 1 is \(\mathcal {O}(Vn\bar {d})\), where \(\bar {d}\) is the average number of feature dimensions. In large-scale scenario, the high dimensional features are always reduced by some dimension reduction methods. At each view, the proposed Algorithm 2 has linear computational cost with respect to the number of exemplars n and is much lower than the LSQR that is as costly as computing matrix inverse.

figure g

5 Experiment

To demonstrate the effectiveness and superiority of the proposed framework, in this section, we conduct thorough experimental evaluation over some real-world datasets and compare with other state-of-the-art methods for complex event detection.

5.1 Datasets

We evaluate on two large scale real-world datasets: the TRECVID MEDTest 2013 Footnote 2 and the TRECVID MEDTest, 2014Footnote 3 which are collected by the NIST for the TRECVID competition. The datasets consist of about 30,000 videos from 30 events of interest, with 100 positive examples per event. Specifically, we use the videos of E006 to E015 in MEDTest 2013 dataset and videos of E021 to E040 in MEDTest 2014 dataset. Please refer to Table 1 for the complete list of event names. Several examples of the datasets used in this paper are illustrated in Fig. 3.

Table 1 30 Events of TRECVID MEDTest 2013 and 2014
Fig. 3
figure 3

Example videos with respect to four events from the two datasets

In order to evaluate the performance of MED with multi-feature combination, we adopt four types of features:

  • SIN [28]: The SIN feature derives from the TRECVID Semantic Indexing (SIN) Task and contains 346 kinds of concepts. These concepts include objects, actions, scenes, attributes and non-visual concepts which are all the basic elements for an event, e.g., Baby, Outdoor, Sitting down.

  • YFCC [35]: The YFCC feature derives from the Yahoo Flickr Creative Common (YFCC 100M) data which contains 0.8m Amateur videos on Flickr and 609 classes of concepts. For example, the top 5 concepts detected in the YFCC100M dataset are Outdoor, Indoor, People, Nature, Architecture.

  • SPORTS [20]: The SPORTS feature derives from The YouTube Sports-1M Dataset, which consists of 1 million YouTube videos belonging to a taxonomy of 487 classes of sports. These classes are arranged in a manually-curated hierarchical taxonomy. For example, it contains 6 different types of bowling and 7 different types of American football.

  • DIY [46]. The DIY feature derives from Do it Yourself (DIY) data which is collected from online archives such as Creative Commons, Youku, Tudou and YouTube in an unsupervised fashion. These instructional videos are designed to facilitate learning for humans and include 1,601 concepts such as Yoga, Juggling, Cooking.

5.2 Comparison methods and experimental setup

We compare the proposed algorithm with the following important baselines:

  • Standard Least Square Regression (S-LSR): This algorithm turns the optimization function into a standard LSR problem when using augmented Lagrangian method (ALM) to solve SVM problem. An exact solution can be obtained with this algorithm and the details are presented in the Section 4.

  • Early Fusion (EF) [31]: EF is a combination scheme that runs before classification. We simply concatenate the four different features in a new high dimensional feature space. Disadvantage of the approach is the difficulty to combine features into a common representation. The proposed fast SVM solver is used for the classification.

  • Late Fusion (LF) [31]: LF happens after classification and focuses on the individual strength of modalities. We train classifier over each feature and then combine their predictions by averaging. This scheme needs more computational effort and has potential to lose the correlation in mixed feature space. The proposed fast SVM solver is used for the classification on each feature.

  • Early Fusion with Principal Component Analysis (EF-PCA): EF-PCA is a modified approach of EF based on PCA. Different from EF, We employ the PCA to reduce the dimension of the combined features. Same as EF, the proposed fast SVM solver is used for the classification.

  • Rule-Based Multiple Kernel Learning (RBMKL) [12]: RBMKL is able to obtain a valid kernel by taking the summation or multiplication of several valid kernels in [12]. We use RBMKL to train an SVM with the product of the combined kernels in the experiments.

  • Centered-Alignment-Based Multiple Kernel Learning (CABMKL) [10]: CABMKL is a two-stage learning algorithm. In the first step, CABMKL uses the analytical solution in [10] to determine the kernel weights. In the second step, CABMKL trains an SVM with the kernel calculated with these weights.

  • The Adaptively Weighted Feature Late Fusion: The proposed algorithm which is designed for MED by using a adaptively weighted multi-feature combination manner. During each iteration, our model first updates the weights for each feature adaptively, and then applies the proposed fast SVM solver over each view for the classification.

Average precision (AP) and mean average precious (mAP) are well known and popular measures in the field of video retrieval or classification. According to the literature [42], AP is a measure combining recall and precision for ranked retrieval results. The AP is the mean of the precision scores after each relevant sample is retrieved. Generally, AP can be calculated as follows:

$$ \begin{array}{llllll} AP = \frac{1}{m}{\sum}_{i=1}^{n}P_{i}r_{i} \end{array} $$
(25)

where m is the number of the relevant samples in the dataset, n is the total number of the samples, and P i is the top-i accuracy. r i = 1 when the i-th sample is relevant; otherwise r i = 0. Obviously, mAP is the average performance over all events, which can be obtained with the mean all AP values. Higher value of AP (mAP) indicates better performance.

We cross-validated the regularization parameters in the range of { 10−3, 10−2, 10−1, 1, 101, 102, 103}, and the parameter r of the proposed method is tuned from {1.1, 5, 10, 15, 20, 25, 30}. For simplicity, we set p = 1 in our experiments. We report the best results for each algorithm. Particularly, we report the average of the mAP values for four features with the proposed fast SVM solver. All experiments are conducted on an 8-core Intel Xeon E5-2660 2.00 GHz Windows server with 128G memory.

5.3 Experimental results analysis

We present the comparison of AP, mAP and training time in this section.

AP comparison

To begin with, we compare the AP performance of single feature with the proposed fast SVM solver on TRECVID MEDTest 2013 and 2014, respectively. The performance of AP with respect to each event are presented in Figs. 4 and 5. We observe from the experimental results that: 1) With the SPORTS feature, the proposed fast SVM solver achieves the best performance for 27 out of 30 events, indicating that the concepts related to some sports are useful for detection of events in TRECVID MEDTest 2013 and 2014; 2) Other features especially the SIN and YFCC have varying degree of success of getting the second place on different events, which states that giving different weights to different features is a promising fusion strategy; 3) The DIY feature is able to get better performance when detecting some instructional events such as “E007: Changing a vehicle tire”, “E011: Making a sandwich”, “E014: Repairing an appliance” and “E040: Tuning musical instrument.”

Fig. 4
figure 4

Comparison of single feature with the proposed method for events E006-E015 on TRECVID MEDTest 2013

Fig. 5
figure 5

Comparison of single feature with the proposed method for events E021-E030 (up) and E031-E040 (bottom) on TRECVID MEDTest 2014

We also present in Figs. 6 and 7 the AP values of fused features with all comparison models as well as the best single feature related to each event of TRECVID MEDTest 2013 and 2014, respectively. It can be concluded that: 1) The AP values of early fusion that simply concatenating different features trails behind all of other three features for 22 out of 30 events, which indicates such combination of different features is more likely to degrade the performance of classifier; 2) The kernel-based methods, i.e., RBMKL and CABMKL, are relatively effective for multiple features combination for MED except the proposed algorithm; 3) With the adaptively weighted feature fusion manner, the proposed algorithm achieves the best ot second-best performance for 25 out of 30 events. It’s worth noting that our method is better than or almost equal to the best single feature. This phenomenon indicates the positive function of assigning different weights to different concepts with respect to an particular event to some extent. To be specific, for events like “E022: Cleaning an appliance” which is related to concepts contained in SIN and DIY instead of YFCC and SPORTS, our method is able to give more weights to relevant features. However, other methods treat different features equally. As a result, it is reasonable that the proposed algorithm outperforms other methods in terms of “E022: Cleaning an appliance.”

Fig. 6
figure 6

Comparison of different methods of AP performance for events E006-E015 on TRECVID MEDTest 2013

Fig. 7
figure 7

Comparison of different methods of AP performance for events E021-E030 (up) and E031-E040 (bottom) on TRECVID MEDTest 2014

mAP comparison

For a fair comparison, We further compare the performance of mAP between different single feature as well as different comparison models. The values of mAP with respect to single feature over 10 events in TRECVID MEDTest 2013 and 20 events in TRECVID MEDTest 2014 with single feature are reported in Table 2 (top). The results consistently indicates that the SPORTS feature is the best among the four features on both datasets for MED task, followed by SIN and YFCC features, and DIY feature is the poorest one.

Table 2 The mAP comparison of single feature with the proposed method (top) and mAP comparison of different methods (bottom) on TRECVID MEDTest 2013 and TRECVID MEDTest 2014

We also report in Table 2 (bottom) the values of mAP with respect to different feature fusion models over 30 events of TRECVID MEDTest 2013 and TRECVID MEDTest 2014, respectively. The experimental results indicate that: 1) The best single feature, i.e., SPORTS, has very good performance for MED, which shows the importance of feature construction and extraction in the field of multimedia analysis; 2) Fusing features after classification is better than before classification for the combination of multiple features for MED task. However, LF scheme still might harm the performance according to the mAP values between LF and Single(average) on TRECVID MEDTest 2014; 3) Compared with EF, EF-PCA is able to improve the mAP performance significantly because that PCA has the ability for dimension reduction and regroup of the fused features; 4) Both RBMKL and CABMKL, which are based on multiple kernel learning, are competitive classifiers for MED by utilizing multiple features; 5) Our model consistently outperforms other multi-feature methods along with the best single feature with respect to both TRECVID MEDTest 2013 and 2014. Specifically, our model achieves a 16.5% on average improvement in terms of mAP comparing with the fast SVM over single feature that is widely used in MED competition. This result indicates that the proposed adaptively weighted feature fusion framework is suitable for MED with multiple features.

Training time comparison

Finally, in order to evaluate the efficiency of the proposed algorithm, we compare it with the S-LSR, where the former employs the gradient descent method while the latter solves a standard LSR problem. Specifically, we calculate the training time of these two methods with the stopping condition that the variation ratio of objective is less than 10−4. The results are listed in Tables 3 and 4, from which it is seen that, the proposed algorithm is with stable performance and on average faster than its competitor on both TRECVID MEDTest 2013 and 2014. In addition, the advantage of the proposed algorithm is more obvious for high-dimension datasets. For example, when the dimensionality of features increases from 346 (SIN) to 1601 (DIY) on TRECVID MEDTest 2013, the training time of the proposed algorithm increases by less than 2 times while the S-LSR’s training time increases by more than 15 times. As a result, the proposed fast SVM solver is efficient for large-scale and high-dimension datasets.

Table 3 The training time (seconds) of our method and standard LSR solution over TRECVID MEDTest 2013 with different features
Table 4 The training time (seconds) of our method and standard LSR solution over TRECVID MEDTest 2014 with different features

Convergence analysis

In order to prove the convergence of the proposed alternating optimization algorithm, we conduct some experiments on both TRECVID MEDTest 2013 and 2014 to show the efficiency of our method. As the experimental results on all the 30 events are similar, we only present the convergence curves on the several events, i.e., “Birthday party,” “Making a sandwich,” “Attempting a bike trick” and “Beekeeping.” Figure 8 shows these convergence curves. Based on our experimental results, it can be seen that the objective function value converges within 20 iterations. As a result, the convergence experiments demonstrate the efficiency of our alternating algorithm.

Fig. 8
figure 8

Convergence curves of the objective function value in (4) using Algorithm 1 for the events “Birthday party,” “Making a sandwich,” “Attempting a bike trick” and “Beekeeping.” The figures show that the objective function value monotonically decreases until convergence by applying the proposed algorithm

6 Conclusion

In this paper, we have explored how to fuse different features of multimedia data for MED effectively and efficiently. Considering the different impacts of different features on each event, we design an adaptively weighted combination framework for multiple features to enhance the performance of complex event detection. Moreover, in the large-scale scenario, an approximate solution with the gradient descent method is employed within the proposed alternating optimization algorithm to mitigate the heavy computational burden. We conduct extensive experiments on the datasets of MED 13 and MED 14 for evaluation. The promising results demonstrate the effectiveness and superiority of the proposed method. Finally, extensive experiment results on the datasets of TRECVID MEDTest 2013 and 2014 demonstrate the effectiveness and efficiency of the proposed method. In the future, we intend to take into account low-level features such as SIFT and MoSIFT as well as CNN features for MED except the concept-based high-level features used in this paper.