An efficient multi-feature SVM solver for complex event detection

Liu, Huan; Zheng, Qinghua; Li, Zhihui; Qin, Tao; Zhu, Lei

doi:10.1007/s11042-017-5166-z

An efficient multi-feature SVM solver for complex event detection

Published: 13 September 2017

Volume 77, pages 3509–3532, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

An efficient multi-feature SVM solver for complex event detection

Download PDF

Huan Liu¹,
Qinghua Zheng¹,
Zhihui Li²,
Tao Qin¹ &
…
Lei Zhu³

382 Accesses
3 Citations
Explore all metrics

Abstract

Multimedia event detection (MED) has become one of the most important visual content analysis tools as the rapid growth of the user generated videos on the Internet. Generally, multimedia data is represented by multiple features and it is difficult to gain better performance for complex event detection with only single feature. However, how to fuse different features effectively is the crucial problem for MED with multiple features. Meanwhile, exploiting multiple features simultaneously in the large-scale scenarios always produces a heavy computational burden. To address these two issues, we propose a self-adaptive multi-feature learning framework with efficient Support Vector Machine (SVM) solver for complex event detection in this paper. Our model is able to utilize multiple features reasonably with an adaptively weighted linear combination manner, which is simple yet effective, according to the various impact that different features on a specific event. In order to mitigate the expensive computational cost, we employ a fast primal SVM solver in the proposed alternating optimization algorithm to obtain the approximate solution with gradient descent method. Extensive experiment results over standard datasets of TRECVID MEDTest 2013 and 2014 demonstrate the effectiveness and superiority of the proposed framework on complex event detection.

Resource Constrained Multimedia Event Detection

Video Event Detection Using Kernel Support Vector Machine with Isotropic Gaussian Sample Uncertainty (KSVM-iGSU)

Recognizing Complex Events in Videos by Learning Key Static-Dynamic Evidences

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Multimedia event detection (MED) has received a lot of interest largely due to the explosive growth of the user generated videos on the Internet [5, 8, 18, 42, 44]. For example, 300 hours of videos are uploaded to YouTube every minute,^{Footnote 1} which is the most popular video-sharing website all around the world. This task aims to identify videos of a particular event of interest, e.g., making a cake or landing a fish, which is the higher level semantic abstraction of long video clips consisting of multiple concepts [42]. For example, an event like landing a fish can be described by multiple concepts, such as objects (e.g., human, fish), actions (e.g., standing, pulling) and scenes (e.g., beside a river or lake). Compared with the previous visual content analysis tasks such as action detection and object recognition, MED is more challenging and complicated due to the dynamic content variations and uncontrolled capture conditions [33]. Techniques for recognizing such complex events are fundamental to many practical applications such as web video search, consumer video management, and user recommendation.

Multimedia data is usually represented by multiple features. Generally, these features can be divided into two categories, namely high-level and low-level features. Low-level features capture the local appearance and texture statistics of objects in the video at particular interest points, while high-level features are represented by a real number estimating the probability of observing a concept in the video [19]. Different features characterize different aspects of the multimedia data. Although high-performance feature descriptors have been developed to help characterize videos, it is still difficult to obtain enough required information with a single feature to discriminate between different kinds of complex events. Therefore, by common consent, combining multiple types of features or video sources is able to achieve better performance [7, 18, 19, 26, 32,33,34, 44]. For example, in [7] Chang et al. proposed to investigate the varying contribution of semantic representations from different image/video sources, thus enhancing the exploitation of semantic representation in the source-level. Ma et al. leveraged attributes from multiple sources to evaluate the negativity of the negative examples, demonstrating better performance than the approach that exploited each attribute source separately [26].

For a multimedia data, its multiple features, which are diverse and complementary, might assist the detection of a specific complex event on this data in varying extent. To illustrate this point more clearly, we take several different features associated with two event-related videos for an example in Fig. 1, where the top and bottom videos are with respect to the event landing a fish and birthday party, respectively. We can see that motion features such as MoSIFT [9] could be beneficial to identify the event landing a fish, as the event-related quick actions like “reeling or lifting” easily appear in the top video. However, for the detection of another event birthday party, motion features are relatively insignificant because these features are not the valuable indicators for this event. Different from motion features, some concepts such as scene descriptors are useful for both of these two types events. For example, “beside a river or lake” is helpful to indicate a landing a fish event, while “inside a room” is an important part of the event birthday party. On the other hand, the multi-feature representation of multimedia data such as videos usually produces high-dimension and large-volume data object. As one of the most widely used classification tools in MED community, Support Vector Machine (SVM) has been proven to be robust and effective for complex event detection task. However, previous research [2, 3, 23, 33, 41] mainly paid attention to feature selection or construction, and there are rarely studies on the problem of efficiency when applying SVM for complex event detection. As a result, multimedia event detectors have to suffer from a heavy computational burden and are time-consuming for real-world applications. In summary, we can conclude that it is challenging and complicated to involve multiple features of video data to enhance the performance of complex event detection.

In light of this, there are mainly two issues to be considered with respect to employing multiple features to enhance the performance of complex event detection. The first issue is how to leverage these diverse and complementary features reasonably when combining multiple features to detect events. The second issue is how to solve the SVM quickly to reduce the heavy computational cost, which is caused by the high-dimension and large-volume multi-feature representation. To address the both issues, we propose a self-adaptive multi-feature learning framework with the fast SVM solver for complex event detection, which is able to combine different features effectively and efficiently. In order to utilize multiple features more reasonably, we adopt an adaptively weighted linear combination for these features. This manner is simple yet effective, and is able to assign particular significance to each feature to improve the performance of MED task. Moreover, motivated by the inspiring progress in SVM-related research field [17, 25, 30], we design a fast SVM solver to alleviate the problem of expensive computational cost for complex event detection. Specifically, in the proposed alternating optimization algorithm, our solver employs an approximate solution obtained by the gradient descent method rather than the relatively costly closed form expression. Figure 2 displays the working flow of the proposed multi-feature learning framework with efficient SVM solver for complex event detection.

We summarize our contributions as follows:

In consideration of that different features have varying influence to indicate a specific event, we design an adaptively weighted combination manner for multiple features rather than fusing them directly to enhance the performance of complex event detection.
In order to alleviate the heavy computational burden caused by the large-volume and high-dimension multi-feature data, the proposed alternating optimization algorithm employs an approximate solution with the gradient descent method in the large-scale scenario.
We conduct extensive experiments on the datasets of TRECVID MEDTest 2013 and 2014 for evaluation. The promising results demonstrate the effectiveness and superiority of the proposed method.

The rest of this paper is organized as follows. In Section 2, we review related work on MED with multi-feature learning and fast primal SVM solver. Sections 3 and 4 present the details of proposed multi-feature learning framework and alternating optimization algorithm, respectively. The experimental settings and evaluation results are presented in Section 5. Section 6 concludes the paper.

2 Related work

With the rapid growth of web videos, how to exploit multiple features for complex event detection efficiently and effectively has been receiving increasing attention in recent years. We briefly review the existing related work from multiple feature learning and SVM for MED.

2.1 Multi-feature learning

Utilizing multiple types of features is able to achieve better performance for MED task because representing video data with a single view is rarely possible to get all required information related to the target event within such complex data. Generally, there are two major categories of multiple feature combination methods [31]. The first one is early fusion, which combines different features before the training process and then put the aggregative vector into the classifier. Spatial pyramid matching methods like [24] and [1] are the most representative research of early fusion. They have the ability to encode the spatial information of the image by fusing the features, which are extracted from different tiles generated by dividing an image. The second one is late fusion, which combines the predictive values after the training process. Some research [21, 43] have concluded that early fusion of features is less effective than late fusion in multiple content analysis when the features are independent or heterogeneous.

In light of this, plenty of research is dedicated to the study for the late fusion mechanism of multiple features. Canonical Correlation Analysis (CCA) [29], SVM-2K [13] and Multiple Kernel Learning (MKL) [16] are the most classical late fusion approaches. CCA maximizes the correlation between two features in a compact subspace. In SVM-2K, authors propose a method that combines two stage learning (kernel CCA followed by SVM) into a single optimization. MKL is widely used in computer vision but its computation is costly as the construction of multiple kernels. Recently, a number of important low-level visual features and their combination are evaluated for the complex event detection [33], which systematically analyzes these mainstream features. Yang et al. propose a semi-supervised framework [44] to improve the performance of multimedia semantic understanding by exploiting the unlabeled multiple data. In [34], Tang et al. present a method which is able to be selective of different subsets of features to combine for certain classes. Jiang et al. use a graph based approach in [19] to diffuse scores among different video data, which makes the fusion result is interpretable for human. In [6] Chang et al. present a multiple feature learning method which embedded feature interaction into a joint framework to capture the nonlinear property within the data while simultaneously combine the linear effect and the nonlinear effect. An unsupervised event saliency revealing framework which extracted features from multiple modalities is designed in [47] to represent each shot in the given video collection. Coşar et al. [11] propose a unified multi-modal fusion framework that incorporates the output of object trajectory analysis with pixel-based analysis to detect abnormal behaviors related to speed and direction of object trajectories. As one of the most effectively methods for MED task, multiple feature learning also causes a heavy computational burden so that the fast optimization algorithm is desired.

2.2 SVM for MED

SVM is widely used for classification task as a result of its strong robust performance. In the field of MED, a series of SVM based algorithms have been proven to be effective for complex event detection task in both practical application [22, 23, 41, 45] and scientific research [4, 15, 36, 37]. In term of the practical application, these methods mainly contain two processes, first they construct proper features according to the characteristics of the specific data, then they directly take traditional SVM as the classifier for the final classification task. For example, Lan et al. [23] introduce double fusion scheme, which simply combines early fusion and late fusion together to incorporate their advantages, and then employ SVM as the classifier to detect the event of interest. Xu et al. [41] propose a discriminative video representation by leveraging deep convolutional neural networks, and then apply linear SVM over the learned features to advance event detection. In a word, it is obvious that this kind of methods emphasize feature construction or combination.

In term of the scientific research, authors focus on designing SVM-based models to enhance the performance of MED. Specifically, Gkalelis et al. [15] present a two-phase approach which combines a novel nonlinear generalized subclass discriminant analysis (GSDA) method to identify a discriminant subspace, and a linear SVM to efficiently learn the event in the derived subspace. In order to deal with the problem of limited number of positive and related event videos, Tzelepis et al. [37] extend the linear SVM with Gaussian sample uncertainty (LSVM-GSU) by assuming isotropic uncertainty into a new kernel-based algorithm (KSVM-iGSU). Furthermore, they also extend KSVM-iGSU based on the relevance degree kernel SVM (RD-KSVM) proposed in [36]. As a result, related samples can be effectively exploited as positive or negative examples with automatic weighting. Recently, Chang et al. [4] present a semantic saliency and nearly-isotonic SVM framework to detect event in long videos that may last for hours. First each shot of the event is assessed and prioritized according to their saliency scores. Next, they propose a new isotonic regularizer that is able to exploit the semantic ordering information and the resulting nearly-isotonic SVM classifier exhibits higher discriminative power. However, the research on SVM solver level in MED filed to meet the growing volume of data in the large-scale scenarios is still in its infancy.

3 The proposed methodology

In this section, we explain how to construct a self-adaptive multi-view learning framework along with a generalized SVM classifier for the MED task. Suppose we have n training data represented by V different features and denote them as $X_{v} = [\mathbf {x}_{1}^{v},\mathbf {x}_{2}^{v},\cdots ,\mathbf {x}_{n}^{v}] \in \mathbb {R}^{d_{v} \times n}$ (v = 1,2,⋯ ,V ), where d _v is the feature dimension of the v-th view. Let y = [y ₁,y ₂,⋯ ,y _n]^T ∈{−1,1}^n×1 are the labels of the training data, then y _i = 1 if the i-th video is positive example whereas y _i = 0 otherwise.

Generally, the low-level features are associated with the high-level events by a prediction function f. For the v-th view of the i-th video from n samples $ \mathbf{x}_{i}^{v} $, we have:

$$ \begin{array}{llllll} f_{i}(\mathbf{x}_{i}^{v}) = W_{v}^{\top}\mathbf{x}_{i}^{v} + b_{v}, \end{array} $$

(1)

where $W_{v} \in \mathbb {R}^{d_{v} \times 1}$ is the event detector with respect to the v-th view which correlates x i v with its label y _i. b _v is the bias term that also in relation to the v-th view. In order to decide f _i, we minimize the following objective:

$$ \begin{array}{llllll} \min\limits_{f_{i}} loss\left( f(\mathbf{x}_{i}^{v}),y_{i}\right) + {\Omega}(f_{i}), \end{array} $$

(2)

where l o s s(⋅) is a loss function and Ω(f _i) is the regularization function on f _i.

Considering the fact that it has been widely used and has shown its robustness, SVM is employed in this paper for MED task. Specifically, we use the hinge loss, i.e.2, $loss(f_{i},y_{i})=\max (1-f_{i}y_{i},0)$, as the loss function and the ℓ ₂-norm as the regularization term. For better extensibility, we generalized popular hinge loss and squared hinge loss into a p-order form. Thus, for the v-th view of all n videos, we can get:

$$ \begin{array}{llllll} \min\limits_{W_{v},b_{v}} \ \ {\sum}_{i=1}^{n}\left( 1-\left( W_{v}^{\top}\mathbf{x}_{i}^{v} + b_{v}\right)y_{i} \right)_{+}^{p} + \frac{C}{2}\|W_{v}\|_{2}^{2}, \end{array} $$

(3)

where the operator $(a)_{+} \overset {def}{=} \max (a,0)$ returns the scalar a if a is non-negative, and zero otherwise. Here C is the parameter to balance the relative importance of the loss term and the regularization term. p is a constant and typically 1 ≤ p ≤ 2 for being meaningful.

For an event, different views usually have different contributions to the detection task as the complementary property of multiple features to each other to some extent. Therefore, it is reasonable to assign different weights 𝜃 = [𝜃 ₁,𝜃 ₂,⋯ ,𝜃 _V] to different views. A larger coefficient α _v indicates that the corresponding feature plays a more important role in generating the final detector. For ease of notation, we use $L_{v}={\sum }_{i=1}^{n}\left (1-(W_{v}^{\top }\mathbf {x}_{i}^{v} + b_{v})y_{i} \right )_{+}^{p}$ to denote the loss of all the videos with respect to the v-th view. Thus, the multi-view optimization problem with the p-order loss based primal SVM for the MED is:

$$ \begin{array}{llllll} \min\limits_{\{W_{v},b_{v},\theta_{v}\}_{v=1}^{V}} \ \ &{\sum}_{v=1}^{V}\left( {\theta_{v}^{r}}L_{v} + \frac{C}{2}\|W_{v}\|_{2}^{2}\right) \\ s.t. \ \ \ \ \ \ \ \ \ & {\sum}_{v=1}^{V}\theta_{v}=1, \ \ \theta_{v}\ge 0 \end{array} $$

(4)

where r > 1. Note that if we directly use the 𝜃 = [𝜃 ₁,𝜃 ₂,⋯ ,𝜃 _V] as the weight of all features, the solution to 𝜃 is 𝜃 _v = 1 for $v=\arg \min _{v}\{L_{v}\}$ and 𝜃 _v = 0 otherwise. In other words, only the best view for this event is kept. Therefore, following the strategy in [38, 40], we adopt 𝜃 v r instead of 𝜃 _v in the objective function (4) to weight the v-th view. With this trick, our model is able to avoid degenerating into a single-view method, which considers the best view but ignores the complementary property of multiple features for an event.

Given a testing video with V features x t v|v=1V, we can compute the predicted score by summing the obtained detector W _v as well as its corresponding weight 𝜃 v r over each view as follows:

$$ \begin{array}{llllll} y_{t} = {\sum}_{v=1}^{V} {\theta_{v}^{r}}W_{v}^{\top}\mathbf{x}_{t}^{v}, \end{array} $$

(5)

where y _t is the predicted score of the testing video. To be specific, we assign positive label to this video when y _t > 0, otherwise negative label when y _t ≤ 0.

In summary, by designing a multi-view learning framework for the video data, which consists of multiple features that possess complementary property to each other, our model is able to adaptively exploit different aspects of the training data for the MED task. Moreover, our model employs the p-order based hinge loss and the ℓ ₂-norm based regularization term to get a more flexible SVM classifier, which has better generalization ability.

4 Optimization algorithm

In this section, we present how to obtain the event detector. Considering the non-smoothness of hinge loss used in the objective function (4), we exploit an alternating optimization algorithm to solve the proposed challenging problem effectively. We describe the alternating algorithm for optimization problem (4) in Algorithm 1.

Update feature weight

First we fix {W _v,b _v}|v=1V to update 𝜃. In order to turn the objective function (4) into an unconstrained optimization problem, we introduce a Lagrange multiplier λ so that we have the Lagrange function of (4) as follow:

$$ \begin{array}{llllll} \mathcal{L}(\boldsymbol{\theta},\lambda) = {\sum}_{v=1}^{V}\left( {\theta_{v}^{r}}L_{v} + \frac{C}{2}\|W_{v}\|_{2}^{2}\right) - \lambda\left( {\sum}_{v=1}^{V}\theta_{v}-1\right). \end{array} $$

(6)

By setting the derivative of $\mathcal {L}(\boldsymbol {\theta },\lambda )$ with respect to 𝜃 _v and λ to zero, we have:

$$ \left\{\begin{array}{llllll} \frac{\partial{\mathcal{L}(\boldsymbol{\theta},\lambda)}}{\partial{\theta_{v}}} =r\theta_{v}^{r-1}L_{v}-\lambda=0, \ v = 1,2,\cdots,V \\ \frac{\partial{\mathcal{L}(\boldsymbol{\theta},\lambda)}}{\partial{\lambda}}=\sum\limits_{v=1}^{V}\theta_{v}-1=0. \end{array}\right. $$

(7)

Thus, 𝜃 _v can be obtained

$$ \begin{array}{llllll} \theta_{v} = \frac{\left( \frac{1}{L_{v}}\right)^{\frac{1}{r-1}}}{\sum\limits_{v=1}^{V}\left( \frac{1}{L_{v}}\right)^{\frac{1}{r-1}}}. \end{array} $$

(8)

As mentioned in Section 3, we set r > 1 to make the weight 𝜃 _v negatively correlate with the loss L _v. As a result, the larger the loss L _v is, the smaller the weight 𝜃 _v is. if $r \rightarrow \infty $, different weight 𝜃 _v with respect to different view will close to each other, which means all features play the same role in detecting the event. If $r \rightarrow 1$, the optimal solution to 𝜃 is 𝜃 _v = 1 for $v=\arg \min _{v}\{L_{v}\}$ and 𝜃 _v = 0 otherwise, which indicates only the best feature for the event is kept. Generally, the value of r should be determined according to the complementary property of all views. The view which possesses rich complementary prefers large r; otherwise, small r is suitable.

Update primal SVM variables for each feature

With the feature weight 𝜃 fixed, we update {W _v,b _v}|v=1V with the objective function below:

$$ \begin{array}{llllll} \min\limits_{\{W_{v},b_{v}\}_{v=1}^{V}} \ \ &{\sum}_{v=1}^{V}\left( {\theta_{v}^{r}}L_{v} + \frac{C}{2}\|W_{v}\|_{2}^{2}\right) \end{array} $$

(9)

which can be decomposed into V independent problems with respect to W _v,b _v:

$$ \begin{array}{llllll} \min\limits_{W_{v},b_{v}} \ \ & {\theta_{v}^{r}}L_{v} + \frac{C}{2}\|W_{v}\|_{2}^{2} \\ \Leftrightarrow \ \ \min\limits_{W_{v},b_{v}} \ \ & {\theta_{v}^{r}}{\sum}_{i=1}^{n}\left( 1-\left( W_{v}^{\top}\mathbf{x}_{i}^{v} + b_{v}\right)y_{i} \right)_{+}^{p} + \frac{C}{2}\|W_{v}\|_{2}^{2} \end{array} $$

(10)

Note that y _i ∈{−1,+1} in this paper, then it is true that 1 − ($ W_{v}^{\top}\mathbf{x}_{i}^{v} $ + b _v)y _i = y _i y _i − ($ W_{v}^{\top}\mathbf{x}_{i}^{v} $ + b _v)y _i = y _i(y _i − ($ W_{v}^{\top}\mathbf{x}_{i}^{v} $ + b _v)). Inspired by [27], we introduce the auxiliary variables $ {\alpha_{i}^{v}} $ = y _i − ($ W_{v}^{\top}\mathbf{x}_{i}^{v} $ + b _v),1 ≤ i ≤ n, and the objective function with respect to W _v,b _v is equivalent to:

$$ \begin{array}{llllll} \min\limits_{W_{v},b_{v},\boldsymbol{\alpha}^{v}} \ \ & {\theta_{v}^{r}}{\sum}_{i=1}^{n}\left( y_{i}{\alpha_{i}^{v}}\right)_{+}^{p} + \frac{C}{2}\|W_{v}\|_{2}^{2} \\ s.t. \ \ \ \ \ & {\alpha_{i}^{v}} = y_{i}-\left( W_{v}^{\top}\mathbf{x}_{i}^{v}+b_{v}\right), \ i=1,2,\cdots,n \end{array} $$

(11)

In order to solve the objective function (11), we adopt Augmented Lagrangian Method (ALM) [14] to update W _v,b _v and α ^v. To begin with, by introducing a set of Lagrangian multiplier β ^v with respect to the v-th view to take the n constraints into consideration, we have the Lagrangian function of (11) as:

$$ \begin{array}{llllll} \mathcal{L}(W_{v},b_{v},\boldsymbol{\alpha}^{v},\boldsymbol{\beta}^{v})= {\theta_{v}^{r}}{\sum}_{i=1}^{n}\left( y_{i}{\alpha_{i}^{v}}\right)_{+}^{p} \ + \ \\ \frac{C}{2}\|W_{v}\|_{2}^{2}+(\boldsymbol{\beta}^{v})^{\top}\left( X_{v}^{\top}W_{v}+\mathbf{1}b_{v}-\mathbf{y}+\boldsymbol{\alpha}^{v}\right) \end{array} $$

(12)

where $\mathbf {1}=[1,1,\cdots ,1]^{\top }\in \mathbb {R}^{n \times 1}$ and $\boldsymbol {\alpha }^{v}=[{\alpha _{1}^{v}},{\alpha _{2}^{v}},\cdots ,{\alpha _{n}^{v}}]^{\top }\in \mathbb {R}^{n \times 1}$. The last term is the pointwise multiplication of the amount of violation of the n constraints $ {\alpha_{i}^{v}} $ − y _i + ($ W_{v}^{\top}\mathbf{x}_{i}^{v} $ + b _v) = 0 with the vector $\boldsymbol {\beta }^{v}=[{\beta _{1}^{v}},{\beta _{2}^{v}},\cdots ,{\beta _{n}^{v}}]\in \mathbb {R}^{n \times 1}$ consisting of n Lagrangian multipliers. Next, by adding a supplemental term to (12), we can get the augmented Lagrangian function of (11) as:

$$ \begin{array}{llllll} \tilde{\mathcal{L}}(W_{v},b_{v},\boldsymbol{\alpha}^{v},\boldsymbol{\beta}^{v},\eta_{v}) = {\theta_{v}^{r}}{\sum}_{i=1}^{n}\left( y_{i}{\alpha_{i}^{v}}\right)_{+}^{p} + \frac{C}{2}\|W_{v}\|_{2}^{2} \ + \ \\ (\boldsymbol{\beta}^{v})^{\top}\left( X_{v}^{\top}W_{v}+\mathbf{1}b_{v}-\mathbf{y}+\boldsymbol{\alpha}^{v}\right) + \frac{\eta_{v}}{2}\|X_{v}^{\top}W_{v}+\mathbf{1}b_{v}-\mathbf{y}+\boldsymbol{\alpha}^{v}\|_{2}^{2} \end{array} $$

(13)

where $\eta _{v} \in \mathbb {R}$ is the augmented penalty parameter with respect to the v-th view and “augments” to infinity and the last supplemental $\frac {\eta _{v}}{2}\|X_{v}^{\top }W_{v}+\mathbf {1}b_{v}-\mathbf {y}+\boldsymbol {\alpha }^{v}\|_{2}^{2}$ forces the n constraints to be satisfied. By arranging the last two terms in (13), we have the quadratic form as follow:

$$ \begin{array}{llllll} \tilde{\mathcal{L}}(W_{v},b_{v},\boldsymbol{\alpha}^{v},\boldsymbol{\beta}^{v},\eta_{v}) = {\theta_{v}^{r}}{\sum}_{i=1}^{n}\left( y_{i}{\alpha_{i}^{v}}\right)_{+}^{p} \ + \ \\ \frac{C}{2}\|W_{v}\|_{2}^{2} + \frac{\eta_{v}}{2}\|X_{v}^{\top}W_{v}+\mathbf{1}b_{v}-\mathbf{y}+\boldsymbol{\alpha}^{v}+\frac{\boldsymbol{\beta}^{v}}{\eta_{v}}\|_{2}^{2} \end{array} $$

(14)

Compared with (13), (14) is added a term $\frac {\boldsymbol {\beta }^{v}}{\eta _{v}}$, which is always a constant when updating the variables W _v,b _v and α ^v within a single iteration. Note that $\eta _{v}\rightarrow \infty $, thus this term is almost the zero and can be negligible eventually.

After obtaining the augmented Lagrangian function (14), we then analyze how to update the event detector W _v,b _v, the auxiliary variables α ^v, and the Lagrangian multipliers β ^v as well as how to determine the the augmented penalty parameter η _v with respect to the v-th view. These variables will also be updated in a alternating fashion.

First we calculate the Lagrangian multiplier vector β ^v at the t-th iteration with W _v,b _v and α ^v fixed. Following the strategy used in [39], we update β ^v with the amount of violation of the n constraints:

$$ \begin{array}{llllll} \boldsymbol{\beta}^{v}_{(t)} = \boldsymbol{\beta}^{v}_{(t-1)} + (\eta_{v})_{(t)}\left( X_{v}^{\top}W+\mathbf{1}b_{v}-\mathbf{y}+\boldsymbol{\alpha}^{v}\right) \end{array} $$

(15)

Here the (η _v)_(t) is monotonically non-decreasing according to the Lemma 3 in [27]. Because the first two terms ${\theta _{v}^{r}}{\sum }_{i=1}^{n}(y_{i}{\alpha _{i}^{v}})_{+}^{p}$ and $\frac {C}{2}\|W_{v}\|_{F}^{2}$ in (14) are likely to be squeezed out by the extremely large term $\frac {\eta _{v}}{2}\|X_{v}^{\top }W+\mathbf {1}b_{v}-\mathbf {y}+\boldsymbol {\alpha }^{v}+\frac {\boldsymbol {\beta }^{v}}{\eta _{v}}\|_{2}^{2}$, η _v can not go to infinity in fact. Therefore, η _v can be generate under an upper bound, which is 10⁵ in this paper.

Second, with W _v,b _v and β ^v fixed, the optimization problem (14) can be decomposed into n independent problems with respect to $ {\alpha_{i}^{v}} $:

$$ \begin{array}{llllll} \min\limits_{{\alpha_{i}^{v}}} \ \ &{\theta_{v}^{r}}\left( y_{i}{\alpha_{i}^{v}}\right)_{+}^{p} + \frac{\eta_{v}}{2}\|W^{\top}\mathbf{x}_{i}^{v}+b_{v}-y_{i}+{\alpha^{v}_{i}}+\frac{{\beta_{i}^{v}}}{\eta_{v}}\|_{2}^{2} \\ \Leftrightarrow \ \ \min\limits_{{\alpha_{i}^{v}}} \ \ &\gamma_{v}\left( y_{i}{\alpha_{i}^{v}}\right)_{+}^{p} + \frac{1}{2}\left( {\alpha_{i}^{v}}-{t_{i}^{v}}\right)^{2} \end{array} $$

(16)

where $\gamma _{v} = \frac {{\theta _{v}^{r}}}{\eta _{v}}$ and ${t_{i}^{v}} = y_{i}-W^{\top }\mathbf {x}_{i}^{v}-b_{v}-\frac {{\beta _{i}^{v}}}{\eta _{v}}$. It is easy to solve the objective function in (16) as α i v is the minimizer for the single-variable 2-piece piecewise function. The research [27] has given the result of a problem that is similar to ours. Based on this, for the 1-order hinge loss based primal SVM (p = 1), we can get:

$$ {\alpha_{i}^{v}}= \left\{\begin{array}{llllll} {t_{i}^{v}}-y_{i}\gamma_{v} &\text{ when \ $y_{i}{t_{i}^{v}} > \gamma_{v}$} \\ 0 &\text{ when \ $0 \le y_{i}{t_{i}^{v}} \le \gamma_{v}$} \\ {t_{i}^{v}} &\text{ when \ $y_{i}{t_{i}^{v}} < 0$} \end{array}\right. $$

(17)

Finally, the loss function term of (14) has no effect on the result when fixing β ^v and α ^v to update W _v and b _v. Therefore, the optimization problem becomes:

$$ \begin{array}{llllll} \min\limits_{W_{v},b_{v}} \ \frac{C}{2}\|W_{v}\|_{2}^{2} + \frac{\eta_{v}}{2}\|X_{v}^{\top}W+\mathbf{1}b_{v}-\mathbf{y}+\boldsymbol{\alpha}^{v}+\frac{\boldsymbol{\beta}^{v}}{\eta_{v}}\|_{2}^{2} \end{array} $$

(18)

Let $\boldsymbol {\tau }=\boldsymbol {\alpha }^{v}+\frac {\boldsymbol {\beta }^{v}}{\eta _{v}}-\mathbf {y}$ and it is easy to observe that τ is a constant vector when update W _v and b _v. As a result, the problem (18) turns into an ℓ ₂-norm regularized Least Square Regression (LSR) problem:

$$ \begin{array}{llllll} J(W_{v},b_{v}) = \min\limits_{W_{v},b_{v}} \ \frac{C}{\eta_{v}}\|W_{v}\|_{2}^{2} + \|X_{v}^{\top}W+\mathbf{1}b_{v} + \boldsymbol{\tau}\|_{2}^{2} \end{array} $$

(19)

We set $z_{v}= \left [ \begin {array}{cccccc} W_{v} \\ b_{v} \end {array} \right ] $, $A_{v}= \left [ \begin {array}{cccccc} X_{v}^{\top } \quad \quad & \mathbf {1} \\ (\frac {C}{\eta _{v}})^{\frac {1}{2}}I \quad \quad & \mathbf {0} \end {array} \right ] $and $d_{v}= \left [ \begin {array}{cccccc} -\boldsymbol {\tau } \\ \mathbf {0} \end {array} \right ] $, the optimization problem (19) can be turned into a standard LSR problem as follow:

$$ \begin{array}{llllll} J(W_{v},b_{v}) = J(z_{v}) = \min\limits_{z_{v}} \ \|A_{v}z_{v}-d_{v}\|_{2}^{2} \end{array} $$

(20)

By setting the derivative of (20) with respect to z _v to zero, we have:

$$ \begin{array}{llllll} z_{v} = \left( A_{v}^{\top}A_{v}\right)^{-1}A_{v}^{\top}d_{v}. \end{array} $$

(21)

Thus, we get the closed form of the solution with respect to W _v and b _v.

However, the time complexity of the (21) is as costly as computing matrix inverse. Considering it is time-consuming to handle the large-volume and high-dimension multi-view video data, the existing methods are not proper for such complex MED task because of the relatively high time complexity. In large-scale scenarios, usually an approximate solution of the optimization problem is enough to produce a good model [17]. Motivated by this, we seek for an optimal step-size gradient descent method to update W _v and b _v more efficiently and effectively. The gradients of $J(W_{v},b_{v})$ with respect to W _v and b _v are as follows:

$$ \left\{\begin{array}{llllll} W_{v}^{\prime}=\frac{\partial{J(W_{v},b_{v})}}{\partial{W_{v}}} = \frac{C}{\eta_{v}}W_{v}+X_{v}\left( X_{v}^{\top}W_{v}+\mathbf{1}b_{v}+\boldsymbol{\tau}\right) \\ b_{v}^{\prime}=\frac{\partial{J(W_{v},b_{v})}}{\partial{b_{v}}} = \mathbf{1}^{\top}\left( X_{v}^{\top}W_{v}+\boldsymbol{\tau}\right)+nb_{v} \end{array}\right. $$

(22)

Thus, we can get the optimal step-size l _v with respect to the v-th view by minimizing the single-variable quadratic function:

$$ \begin{array}{llllll} \min\limits_{l_{v}} \ \frac{C}{\eta_{v}}\|W_{v}-l_{v}W_{v}^{\prime}\|_{2}^{2} + \|X_{v}^{\top}(W_{v}-l_{v}W_{v}^{\prime})+\mathbf{1}(b_{v}-l_{v}b_{v}^{\prime}) + \boldsymbol{\tau}\|_{2}^{2} \end{array} $$

(23)

which has the explicit solution:

$$ \begin{array}{llllll} l_{v} = \frac{{W_{v}^{\prime}}^{\top}W_{v}^{\prime}+{b_{v}^{\prime}}^{2}}{\left( X_{v}^{\top}W_{v}^{\prime}+\mathbf{1}b_{v}^{\prime}\right)^{\top}\left( X_{v}^{\top}W_{v}^{\prime}+\mathbf{1}b_{v}^{\prime}\right)+\frac{C}{\eta_{v}}{W_{v}^{\prime}}^{\top}W_{v}^{\prime}}. \end{array} $$

(24)

The alternating algorithm for optimization problem (14) with respect to each feature is summarized in Algorithm 2. The time complexity of the proposed Algorithm 1 contains V parts, where V is the number of feature types. For each view, at each iteration, Algorithm 2 only needs three matrix-by-vector multiplications with complexity $\mathcal {O}(nd_{v})$, where d _v is the feature dimension of the v-th view and n is the number of samples. The several pointwise addition and multiplication in (22) and (24) between two vectors are with complexity either $\mathcal {O}(d_{v})$ or $\mathcal {O}(n)$, which can be neglected compared to $\mathcal {O}(nd_{v})$. Therefore, the entire time complexity of the proposed Algorithm 1 is $\mathcal {O}(Vn\bar {d})$, where $\bar {d}$ is the average number of feature dimensions. In large-scale scenario, the high dimensional features are always reduced by some dimension reduction methods. At each view, the proposed Algorithm 2 has linear computational cost with respect to the number of exemplars n and is much lower than the LSQR that is as costly as computing matrix inverse.

5 Experiment

To demonstrate the effectiveness and superiority of the proposed framework, in this section, we conduct thorough experimental evaluation over some real-world datasets and compare with other state-of-the-art methods for complex event detection.

5.1 Datasets

We evaluate on two large scale real-world datasets: the TRECVID MEDTest 2013 ^{Footnote 2} and the TRECVID MEDTest, 2014^{Footnote 3} which are collected by the NIST for the TRECVID competition. The datasets consist of about 30,000 videos from 30 events of interest, with 100 positive examples per event. Specifically, we use the videos of E006 to E015 in MEDTest 2013 dataset and videos of E021 to E040 in MEDTest 2014 dataset. Please refer to Table 1 for the complete list of event names. Several examples of the datasets used in this paper are illustrated in Fig. 3.

Table 1 30 Events of TRECVID MEDTest 2013 and 2014

Full size table

In order to evaluate the performance of MED with multi-feature combination, we adopt four types of features:

SIN [28]: The SIN feature derives from the TRECVID Semantic Indexing (SIN) Task and contains 346 kinds of concepts. These concepts include objects, actions, scenes, attributes and non-visual concepts which are all the basic elements for an event, e.g., Baby, Outdoor, Sitting down.
YFCC [35]: The YFCC feature derives from the Yahoo Flickr Creative Common (YFCC 100M) data which contains 0.8m Amateur videos on Flickr and 609 classes of concepts. For example, the top 5 concepts detected in the YFCC100M dataset are Outdoor, Indoor, People, Nature, Architecture.
SPORTS [20]: The SPORTS feature derives from The YouTube Sports-1M Dataset, which consists of 1 million YouTube videos belonging to a taxonomy of 487 classes of sports. These classes are arranged in a manually-curated hierarchical taxonomy. For example, it contains 6 different types of bowling and 7 different types of American football.
DIY [46]. The DIY feature derives from Do it Yourself (DIY) data which is collected from online archives such as Creative Commons, Youku, Tudou and YouTube in an unsupervised fashion. These instructional videos are designed to facilitate learning for humans and include 1,601 concepts such as Yoga, Juggling, Cooking.

5.2 Comparison methods and experimental setup

We compare the proposed algorithm with the following important baselines:

Standard Least Square Regression (S-LSR): This algorithm turns the optimization function into a standard LSR problem when using augmented Lagrangian method (ALM) to solve SVM problem. An exact solution can be obtained with this algorithm and the details are presented in the Section 4.
Early Fusion (EF) [31]: EF is a combination scheme that runs before classification. We simply concatenate the four different features in a new high dimensional feature space. Disadvantage of the approach is the difficulty to combine features into a common representation. The proposed fast SVM solver is used for the classification.
Late Fusion (LF) [31]: LF happens after classification and focuses on the individual strength of modalities. We train classifier over each feature and then combine their predictions by averaging. This scheme needs more computational effort and has potential to lose the correlation in mixed feature space. The proposed fast SVM solver is used for the classification on each feature.
Early Fusion with Principal Component Analysis (EF-PCA): EF-PCA is a modified approach of EF based on PCA. Different from EF, We employ the PCA to reduce the dimension of the combined features. Same as EF, the proposed fast SVM solver is used for the classification.
Rule-Based Multiple Kernel Learning (RBMKL) [12]: RBMKL is able to obtain a valid kernel by taking the summation or multiplication of several valid kernels in [12]. We use RBMKL to train an SVM with the product of the combined kernels in the experiments.
Centered-Alignment-Based Multiple Kernel Learning (CABMKL) [10]: CABMKL is a two-stage learning algorithm. In the first step, CABMKL uses the analytical solution in [10] to determine the kernel weights. In the second step, CABMKL trains an SVM with the kernel calculated with these weights.
The Adaptively Weighted Feature Late Fusion: The proposed algorithm which is designed for MED by using a adaptively weighted multi-feature combination manner. During each iteration, our model first updates the weights for each feature adaptively, and then applies the proposed fast SVM solver over each view for the classification.

Average precision (AP) and mean average precious (mAP) are well known and popular measures in the field of video retrieval or classification. According to the literature [42], AP is a measure combining recall and precision for ranked retrieval results. The AP is the mean of the precision scores after each relevant sample is retrieved. Generally, AP can be calculated as follows:

$$ \begin{array}{llllll} AP = \frac{1}{m}{\sum}_{i=1}^{n}P_{i}r_{i} \end{array} $$

(25)

where m is the number of the relevant samples in the dataset, n is the total number of the samples, and P _i is the top-i accuracy. r _i = 1 when the i-th sample is relevant; otherwise r _i = 0. Obviously, mAP is the average performance over all events, which can be obtained with the mean all AP values. Higher value of AP (mAP) indicates better performance.

We cross-validated the regularization parameters in the range of { 10⁻³, 10⁻², 10⁻¹, 1, 10¹, 10², 10³}, and the parameter r of the proposed method is tuned from {1.1, 5, 10, 15, 20, 25, 30}. For simplicity, we set p = 1 in our experiments. We report the best results for each algorithm. Particularly, we report the average of the mAP values for four features with the proposed fast SVM solver. All experiments are conducted on an 8-core Intel Xeon E5-2660 2.00 GHz Windows server with 128G memory.

5.3 Experimental results analysis

We present the comparison of AP, mAP and training time in this section.

AP comparison

To begin with, we compare the AP performance of single feature with the proposed fast SVM solver on TRECVID MEDTest 2013 and 2014, respectively. The performance of AP with respect to each event are presented in Figs. 4 and 5. We observe from the experimental results that: 1) With the SPORTS feature, the proposed fast SVM solver achieves the best performance for 27 out of 30 events, indicating that the concepts related to some sports are useful for detection of events in TRECVID MEDTest 2013 and 2014; 2) Other features especially the SIN and YFCC have varying degree of success of getting the second place on different events, which states that giving different weights to different features is a promising fusion strategy; 3) The DIY feature is able to get better performance when detecting some instructional events such as “E007: Changing a vehicle tire”, “E011: Making a sandwich”, “E014: Repairing an appliance” and “E040: Tuning musical instrument.”

We also present in Figs. 6 and 7 the AP values of fused features with all comparison models as well as the best single feature related to each event of TRECVID MEDTest 2013 and 2014, respectively. It can be concluded that: 1) The AP values of early fusion that simply concatenating different features trails behind all of other three features for 22 out of 30 events, which indicates such combination of different features is more likely to degrade the performance of classifier; 2) The kernel-based methods, i.e., RBMKL and CABMKL, are relatively effective for multiple features combination for MED except the proposed algorithm; 3) With the adaptively weighted feature fusion manner, the proposed algorithm achieves the best ot second-best performance for 25 out of 30 events. It’s worth noting that our method is better than or almost equal to the best single feature. This phenomenon indicates the positive function of assigning different weights to different concepts with respect to an particular event to some extent. To be specific, for events like “E022: Cleaning an appliance” which is related to concepts contained in SIN and DIY instead of YFCC and SPORTS, our method is able to give more weights to relevant features. However, other methods treat different features equally. As a result, it is reasonable that the proposed algorithm outperforms other methods in terms of “E022: Cleaning an appliance.”

mAP comparison

For a fair comparison, We further compare the performance of mAP between different single feature as well as different comparison models. The values of mAP with respect to single feature over 10 events in TRECVID MEDTest 2013 and 20 events in TRECVID MEDTest 2014 with single feature are reported in Table 2 (top). The results consistently indicates that the SPORTS feature is the best among the four features on both datasets for MED task, followed by SIN and YFCC features, and DIY feature is the poorest one.

Table 2 The mAP comparison of single feature with the proposed method (top) and mAP comparison of different methods (bottom) on TRECVID MEDTest 2013 and TRECVID MEDTest 2014

Full size table

We also report in Table 2 (bottom) the values of mAP with respect to different feature fusion models over 30 events of TRECVID MEDTest 2013 and TRECVID MEDTest 2014, respectively. The experimental results indicate that: 1) The best single feature, i.e., SPORTS, has very good performance for MED, which shows the importance of feature construction and extraction in the field of multimedia analysis; 2) Fusing features after classification is better than before classification for the combination of multiple features for MED task. However, LF scheme still might harm the performance according to the mAP values between LF and Single(average) on TRECVID MEDTest 2014; 3) Compared with EF, EF-PCA is able to improve the mAP performance significantly because that PCA has the ability for dimension reduction and regroup of the fused features; 4) Both RBMKL and CABMKL, which are based on multiple kernel learning, are competitive classifiers for MED by utilizing multiple features; 5) Our model consistently outperforms other multi-feature methods along with the best single feature with respect to both TRECVID MEDTest 2013 and 2014. Specifically, our model achieves a 16.5% on average improvement in terms of mAP comparing with the fast SVM over single feature that is widely used in MED competition. This result indicates that the proposed adaptively weighted feature fusion framework is suitable for MED with multiple features.

Training time comparison

Finally, in order to evaluate the efficiency of the proposed algorithm, we compare it with the S-LSR, where the former employs the gradient descent method while the latter solves a standard LSR problem. Specifically, we calculate the training time of these two methods with the stopping condition that the variation ratio of objective is less than 10⁻⁴. The results are listed in Tables 3 and 4, from which it is seen that, the proposed algorithm is with stable performance and on average faster than its competitor on both TRECVID MEDTest 2013 and 2014. In addition, the advantage of the proposed algorithm is more obvious for high-dimension datasets. For example, when the dimensionality of features increases from 346 (SIN) to 1601 (DIY) on TRECVID MEDTest 2013, the training time of the proposed algorithm increases by less than 2 times while the S-LSR’s training time increases by more than 15 times. As a result, the proposed fast SVM solver is efficient for large-scale and high-dimension datasets.

Table 3 The training time (seconds) of our method and standard LSR solution over TRECVID MEDTest 2013 with different features

Full size table

Table 4 The training time (seconds) of our method and standard LSR solution over TRECVID MEDTest 2014 with different features

Full size table

Convergence analysis

In order to prove the convergence of the proposed alternating optimization algorithm, we conduct some experiments on both TRECVID MEDTest 2013 and 2014 to show the efficiency of our method. As the experimental results on all the 30 events are similar, we only present the convergence curves on the several events, i.e., “Birthday party,” “Making a sandwich,” “Attempting a bike trick” and “Beekeeping.” Figure 8 shows these convergence curves. Based on our experimental results, it can be seen that the objective function value converges within 20 iterations. As a result, the convergence experiments demonstrate the efficiency of our alternating algorithm.

6 Conclusion

In this paper, we have explored how to fuse different features of multimedia data for MED effectively and efficiently. Considering the different impacts of different features on each event, we design an adaptively weighted combination framework for multiple features to enhance the performance of complex event detection. Moreover, in the large-scale scenario, an approximate solution with the gradient descent method is employed within the proposed alternating optimization algorithm to mitigate the heavy computational burden. We conduct extensive experiments on the datasets of MED 13 and MED 14 for evaluation. The promising results demonstrate the effectiveness and superiority of the proposed method. Finally, extensive experiment results on the datasets of TRECVID MEDTest 2013 and 2014 demonstrate the effectiveness and efficiency of the proposed method. In the future, we intend to take into account low-level features such as SIFT and MoSIFT as well as CNN features for MED except the concept-based high-level features used in this paper.

Notes

References

Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 401–408
Chang X, Yang Y (2016) Semisupervised feature analysis by mining correlations among multiple tasks. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2016.2582746
Article MathSciNet Google Scholar
Chang X, Nie F, Yang Y, Huang H (2014) A convex formulation for semi-supervised multi-label feature selection. In: Proceedings of the 28th AAAI conference on artificial intelligence, pp 1171–1177
Chang X, Yang Y, Xing EP, Yu YL (2015) Complex event detection using semantic saliency and nearly-isotonic svm. In: Proceedings of the 32nd international conference on machine learning, pp 1348–1357
Chang X, Yang Y, Long G, Zhang C, Hauptmann AG (2016) Dynamic concept composition for zero-example event detection. In: Proceedings of the 30th AAAI conference on artificial intelligence, pp 3464–3470
Chang X, Ma Z, Lin M, Yang Y, Hauptmann A (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920
Article MathSciNet Google Scholar
Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann A G (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197
Article Google Scholar
Chang X, Yu Y L, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39 (8):1617–1632
Article Google Scholar
Chen MY, Hauptmann A (2009) Mosift: recognizing human actions in surveillance videos. Tech. rep. CMU-CS-09-161, Carnegie Mellon University
Cortes C, Mohri M, Rostamizadeh A (2010) Two-stage learning kernel algorithms. In: Proceedings of the 27th international conference on machine learning, pp 239–246
Coṡar S, Donatiello G, Bogorny V, Garate C, Alvares LO, Brémond F (2017) Toward abnormal trajectory and event detection in video surveillance. IEEE Trans Circ Syst Vid Technol 27(3):683–695
Article Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Book Google Scholar
Farquhar JD, Hardoon DR, Meng H, Shawe-Taylor J, Szedmak S (2005) Two view learning: Svm-2k, theory and practice. In: Proceedings of the 19th annual conference on neural information processing systems, pp 355–362
Gill PE, Robinson DP (2012) A primal-dual augmented lagrangian. Comput Optim Appl 51(1):1–25
Article MathSciNet Google Scholar
Gkalelis N, Mezaris V (2014) Video event detection using generalized subclass discriminant analysis and linear support vector machines. In: Proceedings of the 4th international conference on multimedia retrieval, p 25
Gönen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268
MathSciNet MATH Google Scholar
Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear svm. In: Proceedings of the 25th international conference on machine learning, pp 408–415
Izadinia H, Shah M (2012) Recognizing complex events using large margin joint low-level event model. In: Proceedings of the 10th European conference on computer vision, pp 430–444
Chapter Google Scholar
Jiang L, Hauptmann AG, Xiang G (2012) Leveraging high-level and low-level features for multimedia event detection. In: Proceedings of the 20th ACM international conference on multimedia, pp 449–458
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the 27th IEEE conference on computer vision and pattern recognition, pp 1725– 1732
Kludas J, Bruno E, Marchand-Maillet S (2007) Information fusion in multimedia information retrieval. In: Proceedings of the 5th international workshop on adaptive multimedia retrieval, pp 147–159
Google Scholar
Lan ZZ, Jiang L, Yu SI, Rawat S, Cai Y, Gao C, Xu S, Shen H, Li X, Wang Y et al (2013) Cmu-informedia at trecvid 2013 multimedia event detection. In: Proceedings of NIST TRECVID 2013 Workshop, vol 1(2), p 5
Lan ZZ, Bao L, Yu SI, Liu W, Hauptmann AG (2014) Multimedia classification and event detection using double fusion. Multimed Tools Appl 71 (1):333–347
Article Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 19th IEEE conference on computer vision and pattern recognition, vol 2, pp 2169–2178
Lin CJ, Weng RC, Keerthi SS (2008) Trust region newton method for logistic regression. J Mach Learn Res 9:627–650
MathSciNet MATH Google Scholar
Ma Z, Chang X, Yang Y, Sebe N, Hauptmann A (2017) The many shades of negativity. IEEE Trans Multimed 19(7):1558–1568
Article Google Scholar
Nie F, Huang Y, Wang X, Huang H (2014) New primal svm solver with linear computational cost for big data. In: Proceedings of the 31th international conference on machine learning, pp II-505
Over P, Fiscus J, Sanders G, Joy D, Michel M, Awad G, Smeaton A, Kraaij W, Quénot G (2014) Trecvid 2014–an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of NIST TRECVID 2014 workshop, p 52
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, pp 251–260
Shalev-Shwartz S, Singer Y, Srebro N (2007) Pegasos: primal estimated sub-gradient solver for svm. In: Proceedings of the 24th international conference on machine learning, pp 807–814
Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on multimedia, pp 399–402
Song J, Yang Y, Huang Z, Shen HT, Hong R (2011) Multiple feature hashing for real-time large scale near-duplicate video retrieval. In: Proceedings of the 19th ACM international conference on multimedia, pp 423–432
Tamrakar A, Ali S, Yu Q, Liu J, Javed O, Divakaran A, Cheng H, Sawhney H (2012) Evaluation of low-level features and their combinations for complex event detection in open source videos. In: Proceedings of the 25th IEEE conference on computer vision and pattern recognition, pp 3681–3688
Tang K, Yao B, Fei-Fei L, Koller D (2013) Combining the right features for complex event recognition. In: Proceedings of the 16th IEEE international conference on computer vision, pp 2696–2703
Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li LJ (2015) The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817 1(8)
Tzelepis C, Gkalelis N, Mezaris V, Kompatsiaris I (2013) Improving event detection using related videos and relevance degree support vector machines. In: Proceedings of the 21st ACM international conference on multimedia, pp 673–676
Tzelepis C, Mezaris V, Patras I (2016) Video event detection using kernel support vector machine with isotropic gaussian sample uncertainty (ksvm-igsu). In; Proceedings of the 22nd international conference on multimedia modeling, pp 3–15
Wang M, Hua XS, Yuan X, Song Y, Dai LR (2007) Optimizing multi-graph learning: towards a unified video annotation scheme. In: Proceedings of the 15th ACM international conference on multimedia, pp 862–871
Wright J, Ganesh A, Rao S, Peng Y, Ma Y (2009) Robust principal component analysis: exact recovery of corrupted low-rank matrices via convex optimization. In: Proceedings of the 23rd annual conference on neural information processing systems, pp 2080–2088
Xia T, Tao D, Mei T, Zhang Y (2010) Multiview spectral embedding. IEEE Trans Syst Man Cybern Part B (Cybern) 40(6):1438–1446
Article Google Scholar
Xu Z, Yang Y, Hauptmann AG (2015) A discriminative cnn video representation for event detection. In: Proceedings of the 28th IEEE conference on computer vision and pattern recognition, pp 1798–1807
Yan Y, Yang Y, Meng D, Liu G, Tong W, Hauptmann A G, Sebe N (2015) Event oriented dictionary learning for complex event detection. IEEE Trans Image Process 24(6):1867–1878
Article MathSciNet Google Scholar
Yang Y, Zhuang Y, Xu D, Pan Y, Tao D, Maybank S (2009) Retrieval based interactive cartoon synthesis via unsupervised bi-distance metric learning. In: Proceedings of the 17th ACM international conference on multimedia, pp 311–320
Yang Y, Song J, Huang Z, Ma Z, Sebe N, Hauptmann AG (2013) Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans Multimed 15(3):572–581
Article Google Scholar
Yu SI, Xu Z, Ding D, Sze W, Vicente F, Lan Z, Cai Y, Rawat S, Schulam PF, Bahmani S et al (2012) Informedia e-lamp@ trecvid 2012: multimedia event detection and recounting (med and mer). In: Proceedings of NIST TRECVID 2012 Workshop
Yu SI, Jiang L, Hauptmann A (2014) Instructional videos for unsupervised harvesting and learning of action examples. In: Proceedings of the 22nd ACM international conference on multimedia, pp 825–828
Zhang D, Han J, Jiang L, Ye S, Chang X (2017) Revealing event saliency in unconstrained video collection. IEEE Trans Image Process 26(4):1746–1758
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work is was supported in part by “The Fundamental Theory and Applications of Big Data with Knowledge Engineering” under the National Key Research and Development Program of China with grant Nos. 2016YFB1000903; Ministry of Education Innovation Research Team No. IRT_17R86; Project of China Knowledge Centre for Engineering Science and Technology; National Science Foundation of China under Grant Nos. 61502377.

Author information

Authors and Affiliations

MOEKLINNS Lab, Department of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China
Huan Liu, Qinghua Zheng & Tao Qin
Beijing Etrol Technologies Co., Ltd., Beijing, China
Zhihui Li
School of Information Technology and Electrical Engineering, University of Queensland, Brisbane, Australia
Lei Zhu

Authors

Huan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qinghua Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhihui Li
View author publications
You can also search for this author in PubMed Google Scholar
Tao Qin
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huan Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, H., Zheng, Q., Li, Z. et al. An efficient multi-feature SVM solver for complex event detection. Multimed Tools Appl 77, 3509–3532 (2018). https://doi.org/10.1007/s11042-017-5166-z

Download citation

Received: 29 April 2017
Revised: 07 July 2017
Accepted: 28 August 2017
Published: 13 September 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s11042-017-5166-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An efficient multi-feature SVM solver for complex event detection

Abstract

Similar content being viewed by others

Resource Constrained Multimedia Event Detection

Video Event Detection Using Kernel Support Vector Machine with Isotropic Gaussian Sample Uncertainty (KSVM-iGSU)

Recognizing Complex Events in Videos by Learning Key Static-Dynamic Evidences

1 Introduction