1 Introduction

Intelligent visual surveillance has made great progresses in recent years [1416]. However, we still need to deal with more challenges such as emergent behaviors and self-organizing activities [20] in crowd scene analysis. According to the definition in Oxford English Dictionary, the abnormal events can be regarded as irregular or rarely events in contrast with normal ones. Data on the normal events in video are cheap to obtain, while data of anomalies always require an expensive cost to obtain. Thus, anomaly detection task can be turned into an one-class learning (outlier detection) problem [10] that identifies abnormalities (outliers) based on some normal training samples.

Outlier detection approaches are usually classified into four categories based on the techniques used, which are: distribution-based, distance-based, density-based and deviation-based approaches [29]. There are many representative outlier detection methods such as resolution based outlier factor method [30], influenced outlierness method [31], and angle-based outlier degree method [32]. However, outlier detection is still highly challenging mainly due to three reasons: (1) Defining the normal behavior or region [28]; (2) An existing notion of normal behavior might not be sufficiently representative in the future; (3) The boundary between normal and outlying behavior is often fuzzy.

Kernel methods are famous for its learning ability. Nevertheless, a predefined kernel function might not always be appropriate for a given one-class learning problem. We now face the problem of how to find a suitable kernel function and the associated feature space. To cope with this problem, we try to learn a kernel matrix (also known as a “Gram matrix”) to embed the training data into a Hilbert space where we can easily solve this one-class learning problem.

Non negative matrix factorization (NMF) [1, 2] decomposes the non-negative data as a product of two non-negative matrices. The non-negativity constraint leads NMF to a part-based representation of the object in the sense that it only allows additive, not subtractive, combination of the original data. NMF is widely used in many computer vision applications such as face recognition [1], clustering [9] and action recognition [3]. NMF is also used in subspace learning through producing a projection bases matrix [8, 24, 27]. However, NMF has its shortcomings [8, 22] so it is almost impossible to get the desired feature space merely by NMF in one-class learning. To solve this intractable problem, we need a method to control the process of learning feature space.

Abnormal detection task consists of normal examples and abnormal examples. Taking a human cognition view, we can easily represent a normal example by normal ones, but representing an abnormal example that rarely happens by normal ones is hard. Therefore, it is reasonable to assume that normal examples have the self-similarity property, in mathematic way, it is equivalent to assume that the training instances will have good clustering property in some feature space. Thus, we utilize the clustering property to control the direction of learning feature space. In our proposed method, we use support vector data description (SVDD) [4] to measure the degree of instances’ clustering property. SVDD gives a description of a set of objects by minimizing the chance of accepting outliers and finding an hypersphere with minimal radius, and it is an effective method to describe one-class data as it finds support vectors and allows outliers.

By appending the clustering constraints in the process of learning feature space, we can get a suitable feature space mapping and the associated kernel function. Our algorithm converges to a local optimal solution by applying an alternating optimization approach. Considering the above factors, a novel unsupervised framework named Unsupervised Kernel Learning with Clustering Constraint (CCUKL) is proposed to solve the unsuspected anomaly detection task in video surveillance.

The rest of the paper is organized as follows. The proposed method is given in Sect. 2. In Sect. 3, we formulate the proposed CCUKL framework and present the method that solves the corresponding optimization problem. We present experimental results on several anomaly detection problems using publicly available datasets in Sect. 4. Finally, we summarize the approach and present some clues for future research work.

2 Proposed method

As we said before, we utilize the self-similarity property of training samples to learn a suitable feature space. As seen in Fig. 1, our main idea is to define an unknown matrix kernel and then solve it by a matrix learning method under clustering constraint. In the process of learning feature space, self-similarity property controls the learning direction and evaluates the quality of the learned space.

Fig. 1
figure 1

Unsupervised kernel learning. Normal examples have the self-similarity property in human cognition, in mathematic way, training instances will have good clustering property in the obtained feature space. The clustering degree of instances in the feature space can reflect the quality of the learned feature space. The center of the hypersphere converges in the process of kernel matrix learning

Given an input training set \(X\in {\mathbb {R}}^{\mathrm{m}\times \mathrm{n}}\) and an embedding space \(\mathcal{F}\), we consider a feature space mapping \(\varphi :{\mathbb {R}}^m\rightarrow \mathcal{F}\), where \(\varphi ( x)=G^\mathrm{T}x, x\in {\mathbb {R}}^\mathrm{m},G\in {\mathbb {R}}^\mathrm{{m}\times \mathrm {k}},G \succcurlyeq 0\). The corresponding kernel function can be defined as \(\kappa ( {x,y})=( {\varphi ( x),\varphi ( y)})=x^\mathrm{T}GG^\mathrm{T}y\), where \(\kappa :{\mathbb {R}}^m\times {\mathbb {R}}^m\rightarrow {\mathbb {R}}\). \(\kappa \) is a positive definite kernel for it satisfies Mercer’s condition [5], and the symmetric and positive definite matrix \(G^\mathrm{T}G\) is the kernel matrix which needs to be learned.

We denote the clustering parameter as De, and denote the self-similarity constraint on instances in feature space \(\mathcal{F}\) as \(C_\mathcal{F} (G,De,X)\le 0\). For two given machine learning method \(ml_1 ,ml_2 \), we minimize the sum of the cost function \(F_{ml_1 1} (G)\) and the cost function \(F_{ml_2 } (De)\) under the non-negative constraint for \(G\) and the clustering constraint. Therefore, the Unsupervised Kernel Learning with Clustering Constraint (CCUKL) is actually an optimization problem that can be written as:

$$\begin{aligned} \mathop \mathrm{argmin}\limits _{G,De} \lambda F_{ml_1 1} ( G)+F_{ml_2 } ( {De})\,\quad s.t.\,\, C_\mathcal{F} (G,De,X) \le 0,G\succcurlyeq 0 \end{aligned}$$
(2.1)

Non-negative matrix factorization (NMF) is used as the method \(ml_1\) to produce the bases matrix \(G\) as the projection in mapping \(\varphi ( x)\). Let \(X\in {\mathbb {R}}^\mathrm{{m}\times \mathrm {n}}\) represent a non-negative matrix having \(n\) examples in its columns, NMF aims to find two non-negative matrices: the bases matrix \(G\in {\mathbb {R}}^\mathrm{{m}\times \mathrm {k}}\) and the coefficients matrix \(H\in {\mathbb {R}}^\mathrm{{k}\times \mathrm {n}}\) such that \(X\approx GH\). The unknown matrices \(G\) and \(H\) are estimated by minimizing the reconstruction error \(\vert \vert X-\mathrm{GH}\vert \vert _F^2 \) or the Kullback–Leibler divergence \(D(X\vert \mathrm{GH})\) [2]. The bases matrix \(G\) is used as the projection matrix in mapping \(\varphi ( x)\) recently [8, 24]. At the same time, classical NMF-based algorithms use \(G^\dag =(G^\mathrm{T}G)^{-1}G^\mathrm{T}\) as the projection matrix. The above two choices are equally valid [8], and the former one is easier to work with. Support vector data description (SVDD) is used as the method \(ml_2 \) to describe instances’ self-similarity and measure their clustering degree.

We form our optimization problem by incorporating the minimal enclosing sphere constraints within the NMF formulation. In the process of solving the optimization problem, the kernel matrix updates iteratively and the obtained hypersphere of training samples becomes smaller. Every two instances will have a close distance as small as they can in the obtained feature space. In respect that the difference between points and the punishment of NMF factorization error, the distance between points will not equal zero.

Our proposed algorithm is solved by applying an alternating optimization approach. For more precise, we solve a convex (quadratic or SVDD-type) sub-problem which contains only a subset of the unknown parameters meanwhile keeping the others fixed at each iteration. Finally, the obtained feature space and the obtained hypersphere can be used for anomaly detection. The proposed method judges anomaly only by a judge function in practice, so CCUKL will be effective when faces mass testing data.

In testing phase, a test sample will be attracted by the hypersphere if it is similar with the training samples. Otherwise, a testing sample will be mapped into the feature space in an uncertain way if it is dissimilar with the training samples. Notice that the hypersphere in the feature space is very small, so the probability of the abnormal samples falling into the hypersphere is quite small.

3 Algorithm

3.1 Optimization problem

The proposed method is expatiated in Sect. 2. In this section, we will present the CCUKL algorithm. In mathematics way, the optimization problem is:

$$\begin{aligned} \mathop \mathrm{argmin}\limits _{G,H,C,R,\xi _i } \lambda \vert \vert X-\mathrm{GH}\vert \vert _F^2 +R^2+\frac{1}{\upsilon n}\mathop \sum \limits _{i=1}^n \xi _i \end{aligned}$$
(3.1)

Subject to

$$\begin{aligned}&\!\!\!\!\vert \vert G^\mathrm{T}x_i -C\vert \vert _\mathcal{H}^2 \le R^2 +\xi _i ,H \succcurlyeq 0,G \succcurlyeq 0,\nonumber \\&\quad R\in {\mathbb {R}}^+,C\in {\mathbb {R}}^k,\xi _1, \xi _2 ,\ldots ,\xi _n \in {\mathbb {R}}^+ \end{aligned}$$
(3.2)

At first, we fix some notation, \(\vert \vert \cdot \vert \vert _F \) denotes the \(L_2 \) norm of a vector and the Frobenius norm of a matrix. We use the notation \(H\succcurlyeq 0\) to express that the elements of the matrix \(H\) are non-negative. \(X\in {\mathbb {R}}^{m\times n}\) denotes training data containing \(n\) columns, \(X=[x_1 ,\ldots ,x_n ]\), \(x_i \in {\mathbb {R}}^{{m}\times 1}\), \(G\in {\mathbb {R}}^{{m}\times {k}}\) is the bases matrix and \(H\in {\mathbb {R}}^{k\times n}\) is the coefficients matrix. \(R\) is the radius of hypersphere and \(C\) is the center point of hypersphere in the feature space.

The first term of the above optimization problem (\(\lambda \vert \vert X-\mathrm{GH}\vert \vert _F^2 ,\lambda >0)\) is a classical NMF-type reconstruction error, while sum of the second and the third term (\(R^2+\mathop \sum _{i=1}^n \xi _i /\upsilon n)\) is a SVDD type cost, the parameter \(1/\upsilon n\) controls the trade-off between the radius and the errors [4]. Smaller \(\upsilon \) means fewer points will be outliers, a more detailed analysis shows that \(\upsilon n\) is in fact a lower bound on the number of outliers over the training set [13]. Actually, the hypersphere’s center point \(C\) can be expressed as a linear combination \(C=\sum _{i=1}^n \varepsilon _i G^\mathrm{T}x_i \) using represent theorem [5]\(.\)

Notice that the inequality constraint \(\vert \vert G^\mathrm{T}x_i -C\vert \vert _\mathcal{H}^2 \le R^2+\xi _i \) involves slack variables \({\xi }\in {\mathbb {R}}^{1\times n}\) which control the misclassification errors. To address this optimization problem, inspired by some literatures [8, 9], we solve for subsets of the unknown parameters \(G,H,\,R,C,\xi _i \) by keeping the remaining parameters fixed at each iteration.

3.2 Update strategy

3.2.1 Update strategy for G

In this section, we solve for \(G,{\xi }\) by keeping \(H\), \(R\) and \(C\) fixed, the optimization problem in Eq. 3.1 can be simplified as:

$$\begin{aligned} \mathop \mathrm{argmin}\limits _{G,H,C,R,\xi _i } \vert \vert X-\mathrm{GH}\vert \vert _F^2 +\frac{1}{\lambda \upsilon n}\mathop \sum \limits _{i=1}^n \xi _i \end{aligned}$$
(3.3)

Subject to

$$\begin{aligned} \vert \vert G^\mathrm{T}x_i -C\vert \vert _\mathcal{H}^2 \le R^2+\xi _i ,G \succcurlyeq 0,\xi _1 ,\xi _2 ,\ldots ,\xi _n \in {\mathbb {R}}^+ \end{aligned}$$
(3.4)

Considering constraints in Eq. 3.4 as nonlinear, we solve this optimization problem by getting its dual problem, it’s Lagrangian function is shown as follows:

$$\begin{aligned} \mathcal{L}(G,\xi _i ,\alpha _i ,\beta _i )&= tr[ {-2X^\mathrm{T}GH+H^\mathrm{T}G^\mathrm{T}GH} ]\nonumber \\&+\frac{1}{\lambda \upsilon n}\mathop \sum \limits _{i=1}^n \xi _i +\mathop \sum \limits _{i=1}^n \alpha _i (x_i ^\mathrm{T}GG^\mathrm{T}x_i \nonumber \\&-2x_i ^\mathrm{T}GC+C^\mathrm{T}C-R^2-\xi _i )\quad \nonumber \\&-\mathop \sum \limits _{i=1}^n \beta _i \xi _i -tr[\varpi ^\mathrm{T}G] \end{aligned}$$
(3.5)

where \(\alpha _i ,\beta _i ,\varpi _{i,j} >0\) are the Lagrangian multipliers, and \(\varpi _{i,j} \) enforces non-negative constraints \(G_{i,j} >0\). Setting the partial derivatives to zero, new constraints are obtained:

$$\begin{aligned}&\!\!\! 0\le \alpha _i \le \frac{1}{\lambda \upsilon n},\quad 1 \le i\le n \end{aligned}$$
(3.6)
$$\begin{aligned}&\!\!\!\! \varpi ={ 2GHH}^\mathrm{T}-{ 2XH}^\mathrm{T}+2( {o_m \alpha \circ X})X^T\!G-2X\alpha ^T\!C^T\nonumber \\ \end{aligned}$$
(3.7)

where \(o_m =[1,1,\ldots ,1]^\mathrm{T}\in {\mathbb {R}}^{m\times 1}\), \(\alpha =[\alpha _1 ,\alpha _2 ,\ldots ,\alpha _n ]\in {\mathbb {R}}^{1\times n}\), \(\varpi =[\varpi _{ij} ]\in {\mathbb {R}}^{m\times k}\).

Taking under consideration the KKT conditions [11], we get:

$$\begin{aligned}&2( {{ GHH}^\mathrm{T}-{ XH}^\mathrm{T}+( {o_m \alpha \circ X})X^\mathrm{T}G-X\alpha ^\mathrm{T}C^\mathrm{T}})_{i,j} G_{i,j}\nonumber \\&\quad =\varpi _{i,j} G_{i,j} =0 \end{aligned}$$
(3.8)

Equation 3.8 is a fixed point equation that the solution must satisfy at convergence, then \(G_{i,j} \) can be updated, we can directly calculate columns of \(G\) directly to save computation time:

$$\begin{aligned} \tilde{G}_j =G_j \circ \sqrt{\frac{( {[{XH^\mathrm{T}} ]_j +[ {X\alpha ^\mathrm{T}C^\mathrm{T}} ]_j })}{([{GHH^\mathrm{T}} ]_j +[ {( {o_m \alpha \circ X})X^\mathrm{T}G} ]_j)}} \end{aligned}$$
(3.9)

where \(G_j \) means \(j\)th column of \(G\), \(\circ \) denotes Hadarmard product(element-wise multiplication).

Denote \(\tilde{G}=[\tilde{G}_{ij} ]\) and substitute the value of \(G\) and \(\varpi \) in Eq. 3.7 and simplifying, we get the dual problem:

$$\begin{aligned}&\!\!\!\mathop \mathrm{argmax}\limits _\alpha tr [ {( {X-\tilde{G}H})^\mathrm{T}( {X-\tilde{G}H})} ] \nonumber \\&-2tr[(\tilde{G}HH^\mathrm{T}-XH^\mathrm{T}+( {o_m \alpha \circ X})X^\mathrm{T}\tilde{G}-X\alpha ^\mathrm{T}C^\mathrm{T})^\mathrm{T}\tilde{G}] \nonumber \\&+\alpha (X^\mathrm{T}\tilde{G}\tilde{G}^\mathrm{T}X\circ I_n )o_n -2\alpha X^\mathrm{T}\tilde{G}C+\alpha o_n (C^\mathrm{T}C-R^2)\nonumber \\ \end{aligned}$$
(3.10)

Subject to

$$\begin{aligned} 0\le \alpha _i \le \frac{1}{\lambda \upsilon n},\quad 1\le i\le n \end{aligned}$$
(3.11)

where \(I_n \) denotes an unit matrix of size \(n\).

The above problem is quadratic in \(\alpha \), thus can be solved by using conventional quadratic programming tools. The estimated \(\alpha \) is then used to update \(G\) using Eq. 3.9, notice that large values of \(\lambda \) (when compared to \(1/vn)\) result in getting small \(\alpha _i \).

3.2.2 Update strategy for R, C

In this section, we proceed in solving for the minimal enclosing sphere that minimizes the radius of the enclosing sphere by keeping the bases matrix \(G\) and weights matrix \(H \) fixed, so the optimization problem in Eq. 3.1 is simplified to a classical SVDD problem:

$$\begin{aligned} \mathop \mathrm{argmin}\limits _{C,R,\xi _i } R^2+\frac{1}{\upsilon n}\mathop \sum \limits _{i=1}^n \xi _i \end{aligned}$$
(3.12)

Subject to

$$\begin{aligned} \left| {\left| {G^\mathrm{T}x_i -C} \right| } \right| _\mathcal{H}^2 \le R^2+\xi _i ,C\in {\mathbb {R}}^d,\xi _1 ,\xi _2 ,\ldots ,\xi _n \in {\mathbb {R}}^+ \end{aligned}$$
(3.13)

We introduce the Lagrangian function:

$$\begin{aligned}&\!\!\!\!\mathcal{L}(R,\xi _i ,C,\tau _i ,\varrho _i ,\zeta )=R^2+\frac{1}{\upsilon n}\mathop \sum \limits _{i=1}^n \xi _i\nonumber \\&\quad +\mathop \sum \limits _{i=1}^n \tau _i (-2x_i ^\mathrm{T}GC+C^\mathrm{T}C-R^2-\xi _i )-\mathop \sum \limits _{i=1}^n \varrho _i \xi _i\nonumber \\ \end{aligned}$$
(3.14)

where \(\tau _i ,\varrho _i >0\) are the Lagrangian multipliers, \(\tau =[\tau _1 ,\tau _2 ,\ldots ,\tau _n ]\in {\mathbb {R}}^{1\times n}\), setting the partial derivatives to zero, new constraints are obtained:

$$\begin{aligned}&0\le \tau _i \le \frac{1}{\upsilon n},\quad \mathop \sum \limits _{i=1}^n \tau _i =1,1\le i\le n \nonumber \\&C=\mathop \sum \limits _{i=1}^n \frac{\tau _i ( {G^Tx_i })}{\mathop \sum \nolimits _{i=1}^n \tau _i }=\mathop \sum \limits _{i=1}^n \tau _i ( {G^Tx_i }) \end{aligned}$$
(3.15)

Resubstituting gives to maximize with respect to \(\tau \):

$$\begin{aligned} \mathop \mathrm{argmax}\limits _\tau \tau (X^\mathrm{T}GG^\mathrm{T}X\circ I_n )o_n -\tau X^\mathrm{T}GG^\mathrm{T}X\tau ^\mathrm{T} \end{aligned}$$
(3.16)

Subject to

$$\begin{aligned} \mathop \sum \limits _{i=1}^n \tau _i =1,0\le \tau _i \le \frac{1}{\upsilon n},1\le i\le n \end{aligned}$$
(3.17)

Then, \(R\) can be updated as follows:

$$\begin{aligned} R^2=x_k ^\mathrm{T}GG^\mathrm{T}x_k -2\tau X^\mathrm{T}GG^\mathrm{T}x_k +\tau X^\mathrm{T}GG^\mathrm{T}X\tau ^\mathrm{T} \end{aligned}$$
(3.18)

where \(G^\mathrm{T}x_k \) is one of the support vectors on the ball boundary. In this way, the minimal enclosing sphere parameters \(R\) and \(C\) are obtained.

3.2.3 Update strategy for H

After above two steps, the next task is to update weights matrix \(H\) for we have already acquired the values for \(G\),\(R\) and \(C\). By keeping all the other variables fixed, the objective function Eq. 3.1 is simplified as:

$$\begin{aligned} \mathop \mathrm{argmin}\limits _H \vert \vert X-\mathrm{GH}\vert \vert _F^2 \end{aligned}$$
(3.19)

Subject to

$$\begin{aligned} H\succcurlyeq 0 \end{aligned}$$

The Lagrangian of the above cost function is:

$$\begin{aligned} \mathcal{F}( {H,\tau })=tr[-2X^\mathrm{T}GH+H^\mathrm{T}G^\mathrm{T}GH-\eta ^\mathrm{T}H] \end{aligned}$$
(3.20)

where \(\eta _{i,j} >0\) are the Lagrangian multipliers, and \(\eta _{i,j} \) enforce non-negative constraints \(H_{i,j} >0\). Setting the partial derivatives to zero, new constraints are obtained:

$$\begin{aligned} \eta =-2G^\mathrm{T}X+2G^\mathrm{T}GH \end{aligned}$$
(3.21)

Taking under consideration the KKT conditions [11], we get:

$$\begin{aligned} 2( {-[ {G^\mathrm{T}X} ]+[ {G^\mathrm{T}GH} ]})_{i,j} H_{i,j} =\eta _{i,j} H_{i,j} =0 \end{aligned}$$
(3.22)

Equation 3.22 is a fixed point equation that the solution must satisfy at convergence, then \(H_{i,j} \) can be updated, notice that \(h_j \) (\(j\)th column of \(H)\) contributes only to the j-th data point \(x_j \), so columns of \(H \)can be solved independently as follows:

$$\begin{aligned} H_j =H_j \circ \sqrt{\frac{([ {G^\mathrm{T}X} ]_j )}{([ {G^\mathrm{T}GH} ]_j )}} \end{aligned}$$
(3.23)

Experiments show that this kind of update strategy is better than optimizing the primary problem which is more time consuming and less accurate in practice.

3.3 Judge rule

In this section, we discuss the criterion rule of anomaly alarming. Given an instance vector \(x\) which denotes a patch in a frame, we map it into the feature space by mapping \(\varphi ( x)\) and then determine whether to alarm. The judge functions is:

$$\begin{aligned} \fancyscript{F}( x)=\vert \vert G^\mathrm{T}x-C\vert \vert _\mathcal{H}^2 -R^2 \end{aligned}$$
(3.24)

The new patch’s \(t\) history normal patches in \(t\) frames are denoted as \(X_\mathrm{h} =[x_\mathrm{h1} ,x_\mathrm{h2} ,\ldots ,x_\mathrm{ht} ]\). Given a patch \(x\) and its \(t\) history normal patches \(X_\mathrm{h} \), we can judge \(x\) using permutation hypothesis testing method [25].

Now, we address the hypothesis testing problem by denoting \(H_0:F_\mathrm{N} =F_\mathrm{A} \), where \(F_\mathrm{N} \) and \(F_\mathrm{A} \) are distribution functions of \(\fancyscript{F}(X_\mathrm{h} )\) and \(\fancyscript{F}(x)\), respectively, i.e. \(\fancyscript{F}(x_\mathrm{h1} ),\,\fancyscript{F}(x_\mathrm{h2} ),\ldots ,\,\,\fancyscript{F}(x_\mathrm{ht} )\sim F_\mathrm{T} \) and \(\fancyscript{F}(x)\sim F_\mathrm{A} \). To judge whether a patch is normal or not, we compute the \(p\) value which is the minimal significance level to refuse \(H_0 \). Algorithm of permutation hypothesis testing method includes three steps:

  1. 1.

    \( T_\mathrm{obs} = T( {\fancyscript{F}( {x_\mathrm{h1} }),\fancyscript{F}( {x_\mathrm{h2} }),\ldots ,\fancyscript{F}( {x_\mathrm{ht} }),\fancyscript{F}( x)})\) \(=\vert \frac{\mathop \sum \nolimits _i \fancyscript{F}( {x_\mathrm{hi} })}{t}-\fancyscript{F}( x)\vert \).

  2. 2.

    Permutate \(\fancyscript{F}({x_\mathrm{h1} }),\fancyscript{F}\!({x_\mathrm{h2} }),\ldots ,\fancyscript{F}\!({x_\mathrm{ht} }),\fancyscript{F}\!(x)\) randomly \(B\) times and compute \(T\) at the same time, denoted as \(T_1 ,T_2 ,\ldots ,T_B \).

  3. 3.
    $$\begin{aligned} p\approx \frac{1}{B}\mathop \sum \limits _{i=1}^B I(T_i >T_\mathrm{obs} ). \end{aligned}$$
    (3.25)

By setting the significance level \(\mathrm{a}\), we can judge a patch by computing the \(p\) value, more precisely, alarm when \(p<\mathrm{a}\). During the real-time testing, we judge each patch in each frame, so we can not only detect anomalies but also locate them.

3.4 CCUKL analysis

Inspired by EM algorithm and alterative projections, we solve a set of convex sub-problems each of which is guaranteed to converge, so Eq. 3.1 which can be regarded as a monotonic function converges to a local minimum value. We show the experiment result of the convergence issues in Sect. 4.

The parameter selection is important in our framework. Tradition NMF method often tries to find a low-dimensional representation so that the dimension \(k\) should be small and a small \(\mathrm {k}\) will also reduce the computational complexity, i.e. \(k<\mathrm{min}(m,n)\). However, we find that the proposed algorithm converges faster when \(k\) gets larger and a large \(k\) ensures the mapped space has a necessary large dimension. Therefore, \(k\) should be a trade-off based on the above two factors. Large \(\lambda \) and \(\upsilon \) will reduce the effect of similarity constraint and improve the efficiency of getting feature space. Besides, the parameters \(\upsilon \) has similar effects on generalization as in the original SVDD.

We select combinations of the parameter values on a finite grid, based on the idea in [33], as seen in Table 1. In this way, it is sufficient to perform algorithm comparisons.

Table 1 A finite grid of parameter values

In practice, one has to try many choices of parameters during the model selection. We use tenfold cross-validation method [34] to evaluate the choices of parameters. In our experiments, \(\lambda \) and \(\upsilon \) are set to \(1\), and \(\mathrm \mathrm {k}\) is set to \(1.2m\). Significance level \(\mathrm{a}\) in Sect. 3.3 is set to 0.05, a small \(\mathrm{a}\) predicates a low tolerance to make the second-type error. In this way, we avoid threshold setting which may be different in different scenes.

CCUKL framework mainly embraces four part: inputs, initialization, training and testing, notice that the training part do not need too many normal samples. Moreover, CCUKL will not take too much time for training and testing in practice, actually the most time-consuming part is the computing of optical flow. Algorithm process of CCUKL is shown as follows:

figure e

4 Experiment

Training samples \(X\) can be quite fiexible, such as image patches or spatio-temporal subvolumes. Actually, \(X\) correspond to the bases selection, such as spatial bases, temporal bases and spatial-temporal bases [6]. We show our algorithm process in practice using spatial bases and divide each frame into several patches.

Optical flow [21] method is used as the feature extraction, and then we use MHOF [6] method to extract feature of patches. By encoding each patch as a column of \(X (X=[x_1 ,x_2 ,\ldots ,x_n ])\), the whole feature data \(X \) are generated.

4.1 Tests on UMN dataset

The UMN dataset is a publicly available dataset of normal and abnormal crowd videos from University of Minnesota [26] which consists of 11 video segments. Each video segment consists of an initial part of normal behavior and ends with sequences of the abnormal behavior. There are total 7739 frames with a 320 \(\times \) 240 resolution. We split each frame into 4 \(\times \) 6 local patches with no pixel overlapping and extract the MHOF from each sub-region.

In Fig. 2, we show the proposed algorithm in a video segment which contains 625 frames. We use only 10 pairs of adjacent frames for training and test the whole segment by computing their judge function values. A new frame will be decided whether to alarm by considering a certain number of history normal frames and utilizing the permutation hypothesis testing method.

Fig. 2
figure 2

A temporal smooth is applied for we assume that abnormal events cannot occur only in one frame. 10 normal frames are used for training and the whole video segment is used for testing

We use only 10 pairs of adjacent frames in one video segment for training and then test all 11 video segments, and the result of our method is shown in Fig. 3. We also display the ground truth and the result using the sparse reconstruction cost (SRC) method [6]. The comparison of results shows that our proposed method performs better.

Fig. 3
figure 3

The qualitative results of the abnormal event detection for 11 video segments from UMN dataset. The top row represents the ground truth bar where green color denotes the normal frames and red corresponds to abnormal frames. The middle row is the result using sparse reconstruction cost (SRC) method [6]. At the bottom, we show the result using our proposed method

Table 2 provides the quantitative comparisons to the state-of-the-art methods. The AUC of our method is 0.98 which outperforms [19], and the anomalies can be detected earlier than [7, 9].

Table 2 The comparison of the use of the proposed method and the state-of-the-art methods on the UMN dataset

Figure 4 shows that the factorization error in CCUKL is declining while searching the feature space. Figure 5 displays the hypersphere’s center point \(C \)at all iterations. The center point C of training samples in feature space converges. Figure 6 shows the change of hypersphere’s radius \(R \)at all iterations. Change of \(R \) means that the hypersphere is becoming smaller and clustering degree is becoming higher. In the end, a suitable feature space and a stable hypersphere gradually emerge.

Fig. 4
figure 4

Factorization error \(\vert \vert X-{ GH}\vert \vert _F^2 \) decreases with iteration and we magnify factorization error’s value with iterations change from 50 to 100

Fig. 5
figure 5

Hypersphere’s center point \(C\) converges with iteration, \(C\) is expressed as a vector, we calculate the “distance” (using norm 1) between \(C_i \) and \(C_{i+1} \) and the result shows the convergence of \(C\)

Fig. 6
figure 6

Hypersphere’s radius \(R\) decreases with iteration and we magnify \(R\)’s value with iterations change from 50 to 100

4.2 Tests on recessive walking sequence

We download a sequence about recessive walking [19] and use it as a type of local abnormal event to evaluate our model’s performance. There are total 583 frames with a 270 \(\times \) 480 resolution. Each frame is divided into 4 \(\times \) 6 local patches with no pixel overlapping. Then, we extract features from each patch. Training data consist of 10 pairs of normal frames. The normal scene can be described as pedestrians walk in the same direction. Figure 7 shows the experiment’s result and we can see that the detecting time and localization effect are all that could be desired (Table 3).

Fig. 7
figure 7

Examples in recessive walking sequence. Pedestrian who walks in an opposite direction is abnormal and the anomaly is detected fast and localized well

Table 3 The comparison of the use of the proposed method and the state-of-the-art methods on the recessive walking dataset

4.3 Tests on UCSD Ped1 dataset

The UCSD Ped1 dataset [6, 12, 21] contains 70 short clips and each clip has 200 frames, with a 158 \(\times \) 238 resolution. The training set contains 34 short clips for learning normal patterns. The testing set contains 36 short clips for testing and there is a subset of 10 clips in testing set provided with pixel-level binary masks, which identify the regions containing abnormal events. We split each frame into 7 \(\times \) 7 local patches with a 22 \(\times \) 34 resolution, then utilize the judge rule in Sect. 3.3 to determine whether a local patch is normal or not.

In testing phase, patches are mapped into the feature space. Normal ones will be attracted by the hypersphere and abnormal ones will fall into the feature space in an unpredictable way. For the hypersphere is a very small region compared to the feature space, the probability that the abnormal sample falling into the hypersphere is quite small. Some image results are shown in Fig. 8. The proposed method can detect and locate abnormal events which are different from training samples such as biker, skater and vehicle.

Fig. 8
figure 8

Examples of abnormal event detections and localizations for UCSD Ped1 datasets. The abnormal events such as biker, skater and vehicle are all well detected and located

As defined in [21], a frame is considered as a detection if it contains at least one abnormal pixel when utilizing frame-level measurement and a frame is considered detected correctly if at least 40 % of the truly anomalous pixels are detected when utilizing pixel-level measurement. We compare the proposed method with sparse reconstruction cost model (denoted SRC) [6], MDT [21], original SVDD with RBF kernel [4], social force model (denoted SF) [21] and SF-MPPCA [21]. In Fig. 9, performance of the approaches tested for the anomaly detection task and performance of the approaches tested on the anomaly localization with pixel level groundtruth on the UCSD Ped1 dataset are displayed. It is clear that the proposed method outperforms the state-of-the-art methods.

Fig. 9
figure 9

The detection and localization results of UCSD Ped1 dataset. a Frame-level ROC curve for Ped1 dataset. b Pixel-level ROC curve for Ped1 dataset

In Table 4, some evaluation results [6, 21] are presented: the Equal Error Rate (EER) is the percentage of misclassified frames when the false positive rate is equal to the miss rate (ours 19 % \(<\) 25 % [21]), Rate of detection (RD) in anomaly localization experiment (ours 51 % \(>\) 46 % [6]) and area under curve (AUC) (ours 50.3 % \(>\) 46.1 % [6]), the results show that the CCUKL anomaly detection outperforms the state-of-the-art methods.

Table 4 Quantitative comparison of the proposed method with the state-of-art methods on the UCSD Ped1 dataset

5 Conclusions and discussion

The proposed method provides a novel unsupervised kernel learning method to solve the one-class learning problem by utilizing the “self-similarity” property of training samples. In human cognition, the self-similarity property means that similar samples will show a clearly clustering property in a certain feature space in mathematic way. We apply CCUKL to accomplish the anomaly detection and localization task without abnormal samples.

CCUKL judges anomalies just by a judge function so it will be effective when facing mass testing data. The probability of anomalies falling into the hypersphere is quite small since the hypersphere is just a very small region in the feature space. For instance, \(les\)0.01 in experimental 4.2.

In future work, the proposed algorithm can be extended to its semi-supervised manner by considering both normal and abnormal samples. Anomalies have different types and differ from each other, so we can apply the CCUKL algorithm to learn any anomaly’s feature space if we are interested. Since anomaly clustering is an interesting task, we can do a matching job to recognize one special anomaly if we have an anomaly database. Last but not the least, expanding CCUKL online learning algorithm by updating parameters incrementally is also a meaningful task.