RETRACTED ARTICLE: Discovering Graphical Visual Features for Abnormal Semantic Event Detection

Wang, Fenghua; Ma, Yuhui; Jin, Yanjuan; Jiang, Ying; Wang, Yunye

doi:10.1007/s11042-017-5057-3

RETRACTED ARTICLE: Discovering Graphical Visual Features for Abnormal Semantic Event Detection

Published: 12 August 2017

Volume 77, pages 3245–3260, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

RETRACTED ARTICLE: Discovering Graphical Visual Features for Abnormal Semantic Event Detection

Download PDF

Fenghua Wang¹,
Yuhui Ma¹,
Yanjuan Jin²,
Ying Jiang¹ &
…
Yunye Wang¹

455 Accesses
4 Citations
1 Altmetric
Explore all metrics

This article was retracted on 10 June 2019

This article has been updated

Abstract

Intrusion detection systems play an important role in numerous industrial applications, such as network security and abnormal event detection. They effectively protect our critical computer systems or networks against the network attackers. Anomaly detection is an effective detection method, which can find patterns that do not meet a desired behavior. Mainstream anomaly detection system (ADS) typically depend on data mining techniques. That is, they recognize abnormal patterns and exceptions from a set of network data. Nevertheless, supervised or semi-supervised data mining techniques rely on data label information. This setup may be infeasible in real-world applications, especially when the network data is large-scale. To solve these problems, we propose a novel unsupervised and manifold-based feature selection algorithm, associated with a graph density search mechanism for detecting abnormal network behaviors. First, toward a succinct set of features to describe each network pattern, we realize that these pattern can be optimally described on manifold. Thus, a Laplacian score feature selection is developed to discover a set of descriptive features for each pattern, wherein the patterns’ locality relationships are well preserved. Second, based on the refined features, a graph clustering method for network anomaly detection is proposed, by incorporating the patterns’ distance and density properties simultaneously. Comprehensive experimental results show that our method can achieve higher detection accuracy as well as a significant efficiency improvement.

Network Anomaly Detection Using Unsupervised Feature Selection and Density Peak Clustering

Towards Network Anomaly Detection Using Graph Embedding

One-class graph neural networks for anomaly detection in attributed networks

Article 25 March 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Network intrusion is a set of behavior that harm computer security, such as confidentiality, integrity, as well as the availability of network components [16, 21, 30, 31, 33, 35, 38,39,40, 49,50,51,52]. To avoid this problem, intrusion detection techniques have been designed, which can be roughly categorized into two groups: misuse detection and anomaly detection. The first group recognize intrusions by discovering patterns collected from known attackers. Meanwhile, the second group identify intrusions [1] by detecting distinguished deviations from normal activities [12]. In the literature, signature based methods [27] were based on learning the particular features of each attack, called its signature, is extensively utilized. These systems are highly effective in defending against an unknown invasion. However, they are not sufficiently effective to handle large-scale network anomaly detection. This low performance is caused by the famous 4 V [2]: Volume: The complexity of scale and network data goes beyond the Moores law. Specifically, this reveals that the amount of traffic detected at each terminal increases fast. Variety: Typically network data is characterized from multiple sources, which are not described in an appropriate manner. Value: The value of the data is very low. Outlier detection problem is usually confronted with high-dimensional network data. Some of the features of these data are useless and thus should be abandoned. Velocity: The anomaly detection speed should be increased in order to ensure a real-time response. Furthermore, the establishment of new signatures requires manual inspection by artificial experts. This is not only expensive, but may potentially leads to a serious fragility when discovering new attacks and signature construction. Anomaly detection is further categorized into types: statistical methods, data mining-based and machine learning-based methods [9]. Statistical methods are challenging to adapt to nonstationary variations in network traffic, resulting in higher false positive rates [24]. To avoid this limitation, many ADSS applications leverage data mining techniques [23, 29], which can accurately discover understandable patterns or models from known data sets [14]. This approach can effectively characterize profiles of normal network behavior, and subsequently establish classifier to search attacks. Evidences from many experiments have shown that this approach can sufficiently assist to identify abnormal network activity.

Supervised anomaly detection methods [17, 23, 57] are highly dependent on data is collected from normal activity. Since the training data contains only historical events, profiles are generally existed in historical patterns of normal behavior. In this way, new activities caused by the changes in the network environment are treated as a deviation from the previously constructed profile. In addition, it is not easy to obtain training data without attack in the real world. The ADS trained using data from hidden intrusions are usually lacking the ability to detect intrusions. To overcome the limitations of supervised ADS frameworks. The research and application of unsupervised methods has become the focus [34]. Unsupervised ADS is free from the attack-tagged training data. Usually, a distance method clusters the data set characterized by small distances into a few clustering center. However, the data points are always allocated to the nearest center. Thereby, these approaches may not be able to detect non-spherical clusters. Density-based spatial clustering method, a selection of density threshold, is discarded as outliers in regions below this threshold and assigned to different ones. Even worse, it is generally difficult to choose an appropriate threshold. Another challenge in ADS is feature selection. Many existing algorithms are frustrated from the low efficiency and inefficiency due to the intolerably high-dimensional data. Therefore, feature selection is an essential component for performance improvement. Feature selection not only helps to reduce the computational cost, but also can remove irrelevant, noisy, and redundancy features to improve accuracy. However, in data mining domain, feature selection is typically based on mutual information between features and tags. However, in practice, the network data contains continuous variables, which will be challenging for measuring relationships between features. This is due to the reason that the results depend heavily on discretization methods. Moreover, the conventional feature selection is conducted on the Euclidean space, whereas the data locality information are not exploited for feature selection.

In order to avoid or at least alleviate the aforementioned challenges, we propose a novel large-scale anomaly detection algorithm, which can effectively handle high-dimensional network data by selecting informative features on manifold. The key contributions of this article can be summarized as follows. First, we propose a manifold-based feature selection algorithm, where the sophisticated correlations among multimodal network features and the locality among network data are well exploited. Second, we designed a graph-based clustering for anomaly detection, which exhibits the following advantages: i) high compatibility with graph representation, and ii) robustness to outliers. Third, comprehensive empirical comparisons are made to evaluate the performance of our method.

The reminder of this paper is organized as follows: Sec 2 briefly reviews the related works. Sec 3 introduces our anomaly detection framework, including the manifold-based feature selection and graph-based clustering for anomaly detection. Experimental results in Sec 4 demonstrates the effectiveness of our method. Sec 5 concludes and suggests some future work.

2 Related Work

Generally, our proposed method is closely related to two research topics in industrial environments: 1) feature selection algorithms and 2) unsupervised anomaly detection, and 3) abnormal event detection. We will briefly review the representative work of these two topics in the following.

2.1 Feature Selection (FS) Algorithms

Conventional FS methods can broadly fall into two classes: unsupervised methods and supervised methods. Unsupervised FS algorithms, such as Principle Component Analysis (PCA) [14], do not make use of category information (class labels). As a result, features selected by these methods do not necessarily enhance the classification accuracy. In order to enhance the discriminative ability, people find that performing feature selection on nonlinear features that mapped from the original features is a good choice, like isometric feature mapping (ISOMAP) [28].

Supervised linear FS algorithms, like Linear Discriminant Analysis (LDA) [13], Multiple Discriminant Analysis (MDA) [13], etc., can take advantage of category information in original feature space. Features with the best discriminative ability can be acquired and the recognition rate will be better than that of the original features based and the unsupervised FS based approaches. However, LDA and MDA project full features space into a feature subspace. So there is no time reduction in the feature extraction stage because all features must be extracted before the projecting operation. Other supervised FS methods, such as Fast Correlation-based Filter (FCBF) [10], has been presented to extract original features optionally so as to obtain good discriminative ability. Motivated by the unsupervised nonlinear FS, supervised nonlinear FS methods, such as Kernel Discriminant Analysis (KDA) [32], Kernel Gram-Schmidt Process PCA (FSKSPCA) [3], etc., focus on selecting the more discriminative features in the nonlinearly mapped feature space in a supervised manner. However, all these conventional FS methods don’t take into account of time consumption as feature selection criterion. So this approach may not increase the recognition speed obviously when it selects the features with high time consumption in feature extraction.

2.2 Anomaly Event Detection

Presently, most network anomaly detection systems are supervised learning paradigms. However, it is generally acknowledged that training data is expensive, and thereby adopting unsupervised anomaly detection technology allows unlabeled training of the system. The competiveness of unsupervised methods are the capability to detect attacks that have never been seen before. Clustering is a ubiquitous unsupervised learning method, with the objective of grouping objects into a pre-specified number of categories. Therefore, the resulting network data from different attack mechanisms or normal activities have distinctive characteristics can be well distinguished from each other. Kmeans, another well-known clustering method, is used to detect unknown attacks and therefore the network data space can be effectively divided. Noticeably, performance and computation cost of Kmeans is sensitive to predefined numbers of clusters and initialization clustering center. To alleviate this problem, Wei et al. [19] proposed a so-called improved FCM algorithms to calculate an optimal K. The authors in [11] designed a new method of spectral clustering for anomaly detection, focusing on investigating a graph-based framework over wireless sensor networks. Graphs are adopted to obtain useful measurements for approaching information. And data is utilized to project graphical signals into heterogeneous subspaces. In [18], an anomaly detection framework based on SOM is proposed. High dimensional data can be synchronized with low dimensional data while maintaining the preliminary relationships between clustering and topological relation. Notably, the algorithm is sensitive to the inherent parameters like the neuron number.

In [6], Chang et al. defined a novel notion of semantic saliency that assesses the relevance of each shot with the event of interest. They prioritized the shots according to their saliency scores since shots that are semantically more salient are expected to contribute more to the final event analysis. In [7], the authors proposed a bi-level semantic representation analyzing method. Regarding source-level, the method learns weights of semantic representation attained from different multimedia archives. Meanwhile, it restrains the negative influence of noisy or irrelevant concepts in the overall concept-level. The authors particularly focused on efficient multimedia event detection with few positive examples, which is highly appreciated in the real-world scenario. In [8], Chang et al. tackled event detection by proposing a linear algorithm, which is augmented by feature interaction. The linear property guarantees its speed whereas feature interaction captures the higher order effect from the data to enhance its accuracy. The Schatten-p norm is leveraged to integrate the main linear effect and the higher order nonlinear effect by mining the correlation between them. The resulted classification model is a desirable combination of speed and accuracy. In [4], Chang et al. proposed a novel semi-supervised feature selection framework by mining correlations among multiple tasks and apply it to different multimedia applications. Instead of independently computing the importance of features for each task, their algorithm leverages shared knowledge from multiple related tasks, thus improving the performance of feature selection. Note that the proposed algorithm is built upon an assumption that different tasks share some common structures. In [5], Chang et al. proposed a novel compound rank-k projection (CRP) algorithm for bilinear analysis. The CRP deals with matrices directly without transforming them into vectors, and it, therefore, preserves the correlations within the matrix and decreases the computation complexity.

3 Our Anomaly Detection System

Each network data may have intolerably high dimensionalities. In our ADS, we first design a manifold-based selection scheme to select a few refined features from the high dimensionalities. Thereafter, a graph-based clustering algorithms efficiently search the abnormal network data which are distinguishable from the others.

3.1 Manifold-based Feature Selection

In many cases, the network data are unlabeled. And labeling is tedious and expensive, especially when the number of samples is large. So it is necessary to select informative network features without label. To quantize the correlation between multimodal network features, two measures are defined in our approach based on classic linear correlation and information theory respectively.

Given a pair of features X and Y, the linear correlation coefficient is given by the formula:

$$ \mathrm{R}\left(\mathrm{X},\mathrm{Y}\right)=\frac{\sum_{\mathrm{i}}\left({\mathrm{x}}_{\mathrm{i}}-\overline{{\mathrm{x}}_{\mathrm{i}}}\right)\left({\mathrm{y}}_{\mathrm{i}}-\overline{{\mathrm{y}}_{\mathrm{i}}}\right)}{\sqrt{\sum_{\mathrm{i}}{\left({\mathrm{x}}_{\mathrm{i}}-\overline{{\mathrm{x}}_{\mathrm{i}}}\right)}^2}\sqrt{\sum_{\mathrm{i}}{\left({\mathrm{y}}_{\mathrm{i}}-\overline{{\mathrm{y}}_{\mathrm{i}}}\right)}^2}}, $$

(1)

where x_i is the value of feature X corresponding to the ith sample; $ \overline{{\mathrm{x}}_{\mathrm{i}}} $ is the mean of the value of feature X. It is noticeable that the value of r is restricts between −1 and 1. If X and Y are completely correlated, then r take the value of 1 or −1; if X and Y are totally independent, then r take the value of 0. Under the assumption that a pair of multimodal features are linear separable, the linear correlation is believed to be an optimal choice to represent features’ correlation. However, it is difficult to always meet the assumption when some of the correlations are nonlinear.

To overcome the limitation of the linear correlation, an information-theoretical concept based correlation is presented in our approach to work as the measure of the uncertainty of a multimodal feature. Given a feature X, its entropy is computed as:

$$ \mathrm{S}\left(\mathrm{X}\right)=-{\sum}_{\mathrm{i}=1}\mathrm{p}\left({\mathrm{x}}^{\mathrm{i}}\right){\mathrm{log}}_2\left(\mathrm{p}\left({\mathrm{x}}^{\mathrm{i}}\right)\right), $$

(2)

where P(xⁱ) is the probability of xⁱ existing in all training samples. The conditional entropy of feature X given Y is computed as:

$$ \mathrm{S}\left(\mathrm{X}|\mathrm{Y}\right)=-{\sum}_{\mathrm{j}=1}\mathrm{p}\left({\mathrm{y}}^{\mathrm{j}}\right){\sum}_{\mathrm{i}=1}\mathrm{p}\left({\mathrm{x}}^{\mathrm{i}}|{\mathrm{y}}^{\mathrm{j}}\right){\mathrm{log}}_2\left(\mathrm{p}\left({\mathrm{x}}^{\mathrm{i}}|{\mathrm{y}}^{\mathrm{j}}\right)\right), $$

(3)

Thus, we can compute the correlation between features in terms of information gain [26]:

$$ \mathrm{G}\left(\mathrm{X}|\mathrm{Y}\right)=\mathrm{S}\left(\mathrm{X}\right)-\mathrm{S}\left(\mathrm{X}|\mathrm{Y}\right), $$

(4)

And symmetrical uncertainty [] is obtained by normalizing G(X| Y):

$$ \mathrm{U}\left(\mathrm{X},\mathrm{Y}\right)=\frac{\mathrm{G}\left(\mathrm{X}|\mathrm{Y}\right)}{\mathrm{S}\left(\mathrm{X}\right)+\mathrm{S}\left(\mathrm{Y}\right)}, $$

(5)

Based on the symmetrical uncertainty, given D modalities, each containing a number of features, the inter-group feature correlation between a pair of modalities is defined as:

$$ \mathrm{C}\left(\mathrm{i},\mathrm{j}\right)={\updelta}^{\ast}\sum_{\begin{array}{c}\hfill {\mathrm{X}}_{\mathrm{i}}\in {\mathrm{M}}_{\mathrm{i}}\hfill \\ {}\hfill {\mathrm{X}}_{\mathrm{j}}\in {\mathrm{M}}_{\mathrm{j}}\hfill \end{array}}\mathrm{U}\left({\mathrm{X}}_{\mathrm{i}},{\mathrm{X}}_{\mathrm{j}}\right), $$

(6)

where X_i and X_j are features belonging to modality i and j respectively, and δ is a factor to normalize C(i, j) between −1-1 and +1. The intra-group feature correlation within modality i is defined as:

$$ \mathrm{C}\left(\mathrm{i}\right)=\mathrm{C}\left(\mathrm{i},\mathrm{i}\right), $$

(7)

Based on the definition of symmetrical uncertainty, a criteria is defined to allocate the multimodal features into D groups. Namely, the inter-group feature correlation is minimized and the intra-group feature correlation is maximized. The criteria can be formulated into the following objective function:

$$ {\mathrm{argmin}}_{\mathrm{i},\mathrm{j}}\ {\sum}_{\begin{array}{c}\hfill i=1\hfill \\ {}\hfill j>i\hfill \end{array}}^{\mathrm{D}}\left[\frac{\mathrm{C}\left(\mathrm{i},\mathrm{j}\right)}{\mathrm{C}\left(\mathrm{i}\right)}+\frac{\mathrm{C}\left(\mathrm{i},\mathrm{j}\right)}{\mathrm{C}\left(\mathrm{j}\right)}\right], $$

(8)

Notably, the objective function yields |D| modalities, with minimal inter-modality correlation and balanced features in each modality. Let N denotes the number of multimodal features, the computational complexity is O(N²), which is computational efficient.

Principle angles on the Grassmannian manifold [41]

As illustrated in Fig. 1, the Grassmann manifold G(m, D) [41] formed by a set of m-dimensional linear subspaces of R^D. Each m-dimensional linear subspace corresponds to a point on Grassmann manifold. The point can be seen as a matrix with size m^∗D.

Let M₁ , M₂ be two matrices with size D by m. There are m principle angles for each matrix. And the ith principle angle can be defined as:

$$ {\uptheta}_{\mathrm{i}}={\mathrm{cos}}^{-1}\left({\mathrm{max}}_{{\mathrm{u}}_{\mathrm{k}}\in \upbeta \left({\mathrm{M}}_1\right)}{\mathrm{max}}_{{\mathrm{v}}_{\mathrm{k}}\in \upbeta \left({\mathrm{M}}_2\right)}{\mathrm{u}}_{\mathrm{k}}^{\hbox{'}}{\mathrm{v}}_{\mathrm{k}}\right),i=1\cdots m; $$

(9)

where β(⋅) is the orthogonal basis vector of a matrix. Besides, the principle angles can be computed using the SVD of data Y₁ and Y₂, i.e., $ {\mathrm{Y}}_1^{\hbox{'}}{Y}_2= Ucos\Theta {V}^{\prime } $.

Unsupervised feature selection by Laplacian score

In network environment, it is common to face a large number of features, which will lead to low recognition accuracy due to the curse of dimensionality. In addition, more features means more computational cost. Thus, it is necessary to select a few representative features for multimodal feature recognition. As we claimed, the features selection is usually implemented in an unsupervised manner due to the absence of labels. Let L_r denote the Laplacian score of the rth feature. Let f_ri denote the ith sample of the rth feature, i = 1 , … m. The graph Laplacian [x] is an m × m matrix obtained as follows.

First, a nearest neighboring graph G with m vertex is constructed. Specifically, the ith node corresponds to x _i; an edge is constructed between vertex i and j if the kernel space distance between x _i and x _j are close, i.e., x _i is among the k nearest neighbors of x _j, or x _j is among the k nearest neighbors of x _i. To describe the local structure of the data space, an m × m matrix S is constructed. Specifically, if vertex i and vertex j are connected, set $ {\mathrm{S}}_{\mathrm{i}\mathrm{j}}=\exp \left(-\frac{{\mathrm{D}}_{\mathrm{K}}\left({\mathrm{x}}_{\mathrm{i}},{\mathrm{x}}_{\mathrm{j}}\right)}{\mathrm{t}}\right) $, where t is parameter to be tuned; otherwise S_ij= 0. By constructing the matrix S, the graph laplacian L is computed as below:

$$ \mathrm{L}=\mathrm{D}-\mathrm{S}, $$

(10)

where D is an m × m diagonal matrix obtained by D = diag(S1), 1 = [1, … , 1]^T. Let $ {\tilde{\boldsymbol{f}}}_r={\boldsymbol{f}}_r-\frac{{{\boldsymbol{f}}_r}^TD1}{1^TD1}1 $, the Laplacian Score of the rth feature is:

$$ {L}_r=\frac{{{\tilde{\boldsymbol{f}}}_r}^TL{\tilde{\boldsymbol{f}}}_r}{{{\tilde{\boldsymbol{f}}}_r}^TD{\tilde{\boldsymbol{f}}}_r}, $$

(11)

As proved in [15], the Laplacian score of a feature can be deemed as the degree it concidents with the structure of graph laplacian. Specifically, a “good” feature should be the one on which a pair of corresponding multimodal samples are close to each other if and only if there is an edge between them. Clearly, we can employ Laplacian Score as the quality of a feature. Consequently, a small set of informative features is selected to capture each network data.

3.2 Abnormality Network Detection using Dense Subgraph Clustering Technique

Obviously, abnormal network data are small-scale and distribute densely in their feature space. In our approach, a dense subgraph clustering algorithm is proposed to discover distinguishable network data belonging to different accidents, as the pipeline shown in Fig. 2.

Affinity graph construction

To construct an affinity graph that describes the similarity between network data, a similarity measure is required. In our system, the Gaussian kernel is utilized to capture this relationships, i.e., A_ij ∝ exp(−y _i − y _j ²/σ²), where y denotes the refined network features selected using the Laplacian score.

Mining Subgraph by graph shift

To effectively discover dense subgraphs from an affinity graph, two conditions are required:

1)
Compatibility with graph representation: many similarity metrics are defined based on binary relationships, such as our multimodal feature-based similarity. Only graph-based clustering can utilize this pairwise relation directly.
2)
Robustness to outliers: many samples such as those from the background and highly noisy ones, may not belong to any abnormal network behavior. Methods insisting on partitioning all input network data into coherent groups without explicit outliers may fail to preserve the structure of data manifold.

Conventional clustering algorithms, e.g., Kmeans, are not suitable here as they insist on partitioning all the input data. Comparatively, graph shift, which is efficient and robust for graph mode seeking, is particularly suitable for the abnormal network data mining. It directly works on graph, supports an arbitrary number of clusters, and leaves the outlier points ungrouped.

Formally, we define an individual graph G = (Y, A)for each network label, Y = {y₁, y₂, ⋯ , y_n} is a set of vertices network data the graphlets extracted from images in a category. A is a symmetric matrix with non-negative elements. The diagonal elements of A are one while the non-diagonal element measures the similarity between network data, as detailed above. The modes of a graph G are defined as local maximizers of graph density function g(y) = y^T Ay, where y ∈ Δⁿ and Δⁿ = {y ∈ Rⁿ : y ≥ 0 and y ₁ = 1}. More specifically, the similarity between network data is expressed as the edge weights of graph G. The vertices represent the network data corresponding to a category. Therefore, abnormal network data correspond to vertices of those strongly connected subgraphs. It is worth emphasizing that those strongly connected subgraphs correspond to large local maxima of g(y) over simplex, which is an approximation of the average affinity score of these subgraphs.

The target patterns are the local maximizers of g(y), which are detected by solving the quadratic optimization problem as follows:

$$ \underset{y}{\max }g(y)={y}^T\mathbf{A}y\kern0.5em \mathrm{s}.\mathrm{t}.\kern0.75em \mathrm{y}\in {\Delta}^{\mathrm{n}}, $$

(12)

Obtaining an analytic solution of (13) is difficult. Therefore, we employ replicator dynamics to find the local maxima of (13). Given an initialization y(0), the corresponding local solution y^∗ can be iteratively computed by the discrete-time version of the first-order replicator equation:

$$ {\mathrm{y}}_{\mathrm{i}}\left(t+1\right)={y}_i(t)\frac{{\left(\boldsymbol{A}y(t)\right)}_i}{y{(t)}^T\boldsymbol{A}y(t)}, $$

(13)

Finally, by summarizing the discussion in Sec 3, the procedure of our designed anomaly detection system (ADS) is briefed below.

4 Experimental Results and Analysis

This section validates the performance of our proposed ADS based on three experiments. We first evaluate the usefulness of our manifold-based FS. Then, we testify the effectiveness of the developed graph mining-based clustering algorithm. Lastly, we use the KDDCup99, a standard benchmark data set, to compare our ADS with a series of FS + classifiers.

4.1 Manifold-based FS Evaluation

ETH-80 image dataset [20] consists of color images of 80 objects from 8 different categories, i.e., apples, tomatoes, pears, toy-cows, toy-horses, toy-dogs, toy-cars and cups. Each category contains 10 objects with 41 views per object, spaced equally over the view hemisphere. The whole dataset contains 3280 128*128 images. Each color images comes with a high-quality figure-ground segmentation mask. Two types of features, RGB-domain spin image and PCA mask, are extracted as multimodal features for the object recognition tasks. As a local image descriptor, the RGB-domain spin image is extracted independently on each channel. In detail, for each channel, we build a two dimensional histogram with bins indexed by two parameters: d, the distance from the center pixel of the patch, and i, the intensity. The d*i spin image feature from each RGB channel is extracted and stacked into a 3*d*i dimensional feature vector. In this experiment, we set d = 2, i = 20 and obtain a set of 120-dimensional feature vector as local image representation. Then, these features vector are averaged as a global image representation. PCA mask is a feature vector extracted by conducting principle component analysis (PCA) on the huge dimensional segmentation mask. For each image, the first 100 principle components are adopted into PCA mask.

As Fig. 3 shown, in both with and without supervised feature selection cases, the recognition accuracy increases along with the number of subspace when the number of subspace is less than 7. But the accuracy decreases when the number of subspace becomes larger than 7. In comparison with 1 subspace, 7-modal feature fusion brings nearly 6% increase of recognition accuracy, which demonstrate the advantage of employing multimodal features. The curse of dimensionality is alleviated in benefit of the Grassmannian manifold based feature selection. Apart from the supervised feature selection, the unsupervised feature selection is also evaluated through k-means clustering. Two metrics, the clustering accuracy and the mutual information are used to measure the performance of the selected features []. Specifically, given a data point x _i, let s_i be the obtained cluster label. The accuracy A is defined as follows:

$$ \mathrm{A}=\frac{\sum_{\mathrm{i}=1}^{\mathrm{n}}\upomega \left({\mathrm{s}}_{\mathrm{i}},\mathrm{map}\left({\mathrm{r}}_{\mathrm{i}}\right)\right)}{\mathrm{n}}, $$

(14)

where n is the total number of data points; ω(x, y) is an indicator function, if x = y, then ð(x, y) = 1, otherwise ð(x, y) = 0; map(r_i) is the permutation mapping function that maps each cluster label r_i to the equivalent label from the data corpus (Fig. 4).

The clustering accuracies from different numbers of subspace are depicted in Fig. 5. Even though the absence of class labels, the features obtained by our unsupervised feature selection still provide competitive discriminative ability. Let C denote the set of clusters obtained from the ground truth and C^' obtained from our approach. The mutual information is computed as follows:

$$ \mathrm{\mathcal{E}}\left(\mathrm{C},{\mathrm{C}}^{\prime}\right)={\sum}_{{\mathrm{c}}_{\mathrm{i}}\in \mathrm{C},{\mathrm{c}}_{\mathrm{j}}^{\hbox{'}}\in {\mathrm{C}}^{\hbox{'}}}\mathrm{p}\left({\mathrm{c}}_{\mathrm{i}},{\mathrm{c}}_{\mathrm{j}}^{\hbox{'}}\right)\cdot {\mathrm{log}}_2\frac{\mathrm{p}\left({\mathrm{c}}_{\mathrm{i}},{\mathrm{c}}_{\mathrm{j}}^{\hbox{'}}\right)}{\mathrm{p}\left({\mathrm{c}}_{\mathrm{i}}\right)\cdot \mathrm{p}\left({\mathrm{c}}_{\mathrm{j}}^{\hbox{'}}\right)}, $$

(15)

where p(c_i) and $ \mathrm{p}\left({\mathrm{c}}_{\mathrm{j}}^{\hbox{'}}\right) $ are the probabilities that a data point arbitrarily selected from the corpus belongs to clusters c_i and $ {\mathrm{c}}_{\mathrm{j}}^{\hbox{'}} $, respectively; $ \mathrm{p}\left({\mathrm{c}}_{\mathrm{i}},{\mathrm{c}}_{\mathrm{j}}^{\hbox{'}}\right) $ is the probability that the arbitrarily selected data point belongs to clusters c_i and $ {\mathrm{c}}_{\mathrm{j}}^{\hbox{'}} $ at. To compensate for the mutual information’s bias toward features with more values, we use the normalized mutual information ℰ_nor as follows:

$$ {\mathrm{\mathcal{E}}}_{\mathrm{nor}}=\frac{\mathrm{MI}\left(\mathrm{C},{\mathrm{C}}^{\hbox{'}}\right)}{\max \left(\mathrm{H}\left(\mathrm{C}\right),\mathrm{H}\left({\mathrm{C}}^{\hbox{'}}\right)\right)}, $$

(16)

Where H(C) and H(C^′) are the entropies of C and C^', respectively. The denominator functions as a normalize factor which scales ℰ between 0 and 1. If the two sets of clusters are identical, then ℰ = 1, otherwise, ℰ=0. The normalized mutual information is presented in Fig. 7. In almost all the cases, the proposed unsupervised selected features have better clustering result.

4.2 Advantage of the Graph-based Clustering

To evaluate the effectiveness of the second component, we compare the affinity graph constructed by different feature refinement techqniues, i.e., PCA, KPCA, LDA, and KLDA. As the scatter plots shown in Fig. 5, abnormal network data points are densely distributed in the affinity graph. These distinguishable patterns can be efficiently discovered by graph shift. Moreover, affinity graphs generated using the other four schemes are suboptimal, as different object parts are mixed. Besides the qualitative analysis in Fig. 6, we calculate the ratio of scatters within and between normal/abnormal network data. As shown in Table 1, on all the subset of KDDcup99, the lowest ratio is achieved by our constructed affinity graph. This observation clearly demonstrates the competiveness of our method.

Table 1 The Ratio of Within/Between category under affinity graphs constructed using different schemes (each of the three subsets contains 10% KDDcup99 and are selected by random)

Full size table

4.3 Effectiveness of our designed ADS

In order to evaluate the effectiveness of our proposed ADS, simulations are presented here. Experiments have been carried outon a desktop PC equipped with an Intel i5 CPU at 3.20 GHz, and 16GB RAM, associated with a 256GB SSD. The algorithm is implemented with winpython-64bit using programming Python language 2.7.9. Several valuable utilities for mine packaging and Python open source machine learning library are adopted [25]. In the feature selection phase, the experimental results are presented as follows: the classification accuracy and time cost. The algorithm inherent parameters are set as follows. We use 10% randomly-selected KDDCUP99. The selected discrete feature numbers are obtained from {2, 3, 4, 12}, and the selected contiguous feature numbers are selected from {1, 8, 10, 23, 24, 25, 26, 27, 28, 29, 32, 33}. The number of. Our experiment settings are described as follows:

1) Our unsupervised FS approach is compared with a set of counterparts. The counterparts chosen includes supervised FS, such as RFE, extra tree classifier (ETC). 2) Five classification algorithms are used to classify the network data. They are the decision tree classifier tree, extra trees classifier, random forest (RF) classifier, Adaboost-based classifier, and optimal profit based support vector machines (SVM). 3) We sampled three categories to obtain a balanced data set and the sample number is about 20,000 in total. Toward a fair comparison, we carried out 100 comparative experiments on the same machine. Average measurements are then obtained.

As the experimental results reported in Fig. 7, the following observations can be made. First, anomaly detection with the full features can achieve the near-best performance, as all the information are preserved. But our method is also very competitive, reflecting the necessity of exploiting feature relationships on manifold. Second, most FS methods can achieve performances close to original data. Noticeably, random forest and AdaBoost methods can achieve better detection accuracy compared with other model. Compared with other supervised feature selection, our designed feature selection acquire relatively high detection accuracy which is very close to the ExtraTreesClassifier. Moreover, UFS-MIC achieves remarkable performance gain over the supervised method RFE. The result shows that with absence of labels, performance of our FS is still comparable with supervised approaches.

At the same time, the computational time cost of each classifier is reported in Table 2. Our FS method effectively reduces the running time of the classification method. After conducting FS, the average anomaly detection period is significantly reduced, which clearly shows the advantage of our ADS.

Table 2 Computational Time Cost Comparison

Full size table

5 Conclusions and Future Work

In this paper, a novel ADS framework is proposed [22, 25, 36, 37, 42,43,44,45,46,47,48, 53,54,55,56]. The advantages are two-fold. First, a manifold-based FS algorithms is designed to obtain a succinct set of features to describe each network data. The FS algorithm is unsupervised and can optimally preserve the locality among neighboring samples Based on this, a high-performance dense subgraph mining algorithm is proposed to search the abnormal pattern from the affinity graph constructed using the refined features. Extensive experiments on two data sets demonstrate the efficiency and effectiveness of our system.

In the future, we plan to exploit the high-order relationships among network features, and further testify our ADS on larger-scale data sets.

Change history

10 June 2019
The Editor-in-Chief has retracted this article [3], which was published as part of special issue “Semantic Concept Discovery in MMData”, because the article shows substantial text overlap most notably with the articles cited [1, 2, 4, 5].

REFERENCES

Barbara D, Jajodia S (2002) Applications of Data Mining in Computer Security. Springer Science & Business Media, New York
Book Google Scholar
Camacho J, Macia-Fernandez G, Diaz-Verdejo J, et al (2014) Tackling the big data 4 vs for anomaly detection. In: 2014 I.E. Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp 500–505. IEEE
Cao B, Shen D, Sun J-T, Yang Q, Chen Z (2007) Feature selection in a kernel space. International Conference on Machine Learning, pp 745–770
Chang X, Yang Y (2016) Semi-supervised Feature Analysis by Mining Correlations among Multiple Tasks. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2016.2582746
Article MathSciNet Google Scholar
Chang X, Shen H, Nie F, Wang S, Yang Y, Zhou X (2016) Compound Rank-k Projections for Bilinear Analysis. IEEE Trans Neural Netw Learn Syst 27(7):1502–1513
Article MathSciNet Google Scholar
Chang X, Yu Y, Yang Y, Xing EP (2017) Semantic Pooling for Complex Event Analysis in Untrimmed Videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632
Article Google Scholar
Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-Level Semantic Representation Analysis for Multimedia Event Detection. IEEE Trans. Cybernetics 47(5):1180–1197
Google Scholar
Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature Interaction Augmented Sparse Learning for Fast Kinect Motion Detection. IEEE Trans Image Process 26(8):3911–3920
Article MathSciNet Google Scholar
Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information theoretic metric learning. In: Proceedings of the 24th international conference on Machine learning, vol 227. ACM, pp 209–216
Duda RO, Hart PE, Stock DG (1986) Pattern Classification. Addison-Wesley Publishing Company
Egilmez HE, Ortega A (2014) Spectral anomaly detection using graph-based filtering for wireless sensor networks. In: 2014 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 1085–1089
Eskin E, Arnold A, Prerau M et al (2002) A geometric framework for unsupervised anomaly detection. In: Barbará D, Jajodia S (eds) Applications of Data Mining in Computer Security. Springer, New York, pp 77–101
Chapter Google Scholar
Fisher R (1936) The use of multiple measurements in taxonomic problems. Annals of Eugnics, pp 179–188
Article Google Scholar
Fukunaga K (1990) Introduction to Statistical Pattern Recognition. Elsevier Academic Press, pp 101–109
He X, Cai D, Niyogi P (2005) Laplacian Score for Feature Selection. NIPS, pp 507–514
Heady R, Luger GF, Maccabe A et al (1990) The architecture of a network level intrusion detection system. Department of Computer Science, College of Engineering, University of New Mexico
Hu W, Hu W (2005) Network-based intrusion detection using Adaboost algorithm. In: The 2005 IEEE/WIC/ACM International Conference on Web Intelligence, Proceedings, pp 712–717. IEEE
Huang SY, Huang YN (2013) Network traffic anomaly detection based on growing hierarchical SOM. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, pp 1–2
Jiang W, Yao M, Yan J (2008) Intrusion detection based on improved fuzzy c-means algorithm. In: 2008 International Symposium on Information Science and Engineering, ISISE 2008, vol 2. IEEE, pp 326–329
Leibe B, Schiele B (2003) Analyzing Appearance and Contour Based Methods for Object Categorizationn. In: Proc. of CPVR, pp 409–415
Liu X, Song M, Tao D, Liu Z, Zhang L, Bu J, Chen C (2013) Semi-supervised Node Splitting for Random Forest Construction. IEEE Computer Vision and Pattern Recognition (CVPR), pp 492–499, (CCF A)
Liu X, Song M, Tao D, Zhang L, Bu J, Chen C (2014) Learning to Track Multiple Objects. IEEE Trans Neural Netw Learn Syst, (accepted, IF: 3.766, CCF B, JCR 1)
Luo YB, Wang BS, Sun YP et al (2013) FL-LPVG: an approach for anomaly detection based on flow-level limited penetrable visibility graph
Patcha A, Park JM (2007) An overview of anomaly detection techniques: existing solutions and latest technological trends. Comput Netw 51(12):3448–3470
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Quinlan J (1993) C4.5 Programs for machine learning. Morgan Kaufmann
Roesch M (1999) Snort: lightweight intrusion detection for networks. LISA 99(1):229–238
MathSciNet Google Scholar
Tenenbaum JB (1998) Mapping a manifold of perceptual observations. Neural information processing systems, pp 745–770
Tong X, Wang Z, Yu H (2009) A research using hybrid RBF/Elman neural networks for intrusion detection system secure model. Comput Phys Commun 180(10):1795–1801
Article Google Scholar
Wang G, Lu Y, Zhang L, Zimmermann R, Kim SH, Alfarrarjeh A, Cyrus S (2014) Active Key Frame Selection for 3D Model Reconstruction from Crowdsourced Geo-tagged Videos. International Conference on Multimedia and Expo (ICME), (CCF B)
Xia Y, Xu W, Zhang L, Shi X, Mao K (2015) Integrating 3D Structure into Traffic Scene Understanding with RGB-D Data. Neurocomputing 151:700–709
Article Google Scholar
Xu L, Liu H (1993) Feature selection for high-dimensional data. International Conference on Machine Learning, pp 745–770
Yin Y, Shen Z, Zhang L, Zimmermann R (2014) Spatial-Temporal Tag Mining for Automatic Geospatial Video Annotations. ACM Transactions on Multimedia Computing, Communications and Applications (TOMCCAP), (accepted, IF: 0.935, CCF B, JCR 3)
Zhang J, Zulkernine M (2006) Anomaly based network intrusion detection with unsupervised outlier detection. In: 2006 I.E. International Conference on Communications, ICC 2006, vol 5. IEEE, pp 2388–2393
Zhang L, Song M, Sun L, Liu X, Wang Y, Tao D, Bu J, Chen C (2012) Spatial Graphlet Matching Kernel for Recognizing Aerial Image Categories, International Conference on Pattern Recognition (ICPR), pp 2813–2816
Zhang L, Han Y, Yang Y, Song M, Yan S, Tian Q (2013) Discovering Discriminative Graphlets for Aerial Image Categories Recognition. IEEE Trans Image Process 22(12):5071–5084 (IF:3.199, CCF A, JCR 2)
Article MathSciNet Google Scholar
Zhang L, Song M, Zhao Q, Liu X, Bu J, Chen C (2013) Probabilistic Graphlet Transfer for Photo Cropping. IEEE Trans Image Process 21(5):803–815 (IF:3.199, CCF A, JCR 2)
MathSciNet MATH Google Scholar
Zhang L, Song M, Liu Z, Liu X, Bu J, Chen C (2013) Probabilistic Graphlet Cut: Exploring Spatial Structure Cue for Weakly Supervised Image Segmentation. IEEE Computer Vision and Pattern Recognition (CVPR), pp 1908–1915, (CCF A)
Zhang L, Tao D, Liu X, Song M, Chen C (2014) Grassmann Multimodal Implicit Feature Selection. Multimedia System 12(6):102–134
Article Google Scholar
Zhang L, Song M, Liu X, Jiajun B, Chen C (2013) Fast Multi-View Segment Graph Kernel for Object Classification. Signal Process 93(6):1597–1607
Article Google Scholar
Zhang L, Tao D, Liu X, Sun L, Song M, Chen C (2014) Grassmann multimodal implicit feature selection. Multimedia Syst 20(6):659–674
Article Google Scholar
Zhang L, Gao Y, Ji R, Dai Q, Li X (2014) Actively Learning Human Gaze Shifting Paths for Photo Cropping. IEEE Trans Image Process 23(5):2235–2245 (IF:3.199, CCF A, JCR 2)
Article MathSciNet Google Scholar
Zhang L, Gao Y, Zimmermann R, Tian Q, Li X (2014) Fusion of Multi-Channel Local and Global Structural Cues for Photo Aesthetics Evaluation. IEEE Trans Image Process 23(3):1419–1429 (IF:3.199, CCF A, JCR 2)
Article MathSciNet Google Scholar
Zhang L, Yang Y, Gao Y, Wang C, Yu Y, Li X (2014) A Probabilistic Associative Model for Segmenting Weakly-Supervised Images. IEEE Trans Image Process 23(9):4150–4159 (IF:3.199, CCF A, JCR 2)
Article MathSciNet Google Scholar
Zhang L, Gao Y, Hong C, Feng Y, Zhu J, Cai D (2014) Feature Correlation Hypergraph: Exploiting High-order Potentials for Multimodal Recognition. IEEE Trans Cybern 44(8):1408–1419 (IF:3.236, CCF B, JCR 1)
Article Google Scholar
Zhang L, Gao Y, Ji R, Ke L, Shen J (2014) Representative Discovery of Structure Cues for Weakly-Supervised Image Segmentation. IEEE Trans Multimed 16(2):470–479 (IF:1.754, CCF B, JCR 2)
Article Google Scholar
Zhang L, Song M, Yang Y, Zhao Q, Chen Z, Sebe N (2014) Weakly Supervised Photo Cropping. IEEE Trans Multimed 16(1):94–107 (IF:1.754, CCF B, JCR 2)
Article Google Scholar
Zhang L, Ji R, Xia Y, Li X (2014) Learning a Probabilistic Topology Discovering Model for Scene Categorization. IEEE Trans Neural Netw Learn Syst, (accepted, IF: 3.766, CCF B, JCR 1)
Zhang Y, Zhang L, Zimmermann R (2014) Aesthetics-Guided Summarization from Multiple User Generated Videos. ACM Transactions on Multimedia Computing, Communications and Applications (TOMCCAP), (accepted, IF: 0.935, CCF B, JCR 3)
Zhang L, Gao Y, Zhang C, Tian Q, Zimmermann R (2014) Perception-Guided Multimodal Aesthetics Discovery for Photo Quality Assessment. ACM Multimedia (Full paper). (accepted, CCF A)
Zhang L, Yang Y, Zimmermann R (2014) Discriminative Cellets Discovery for Fine-Grained Image Categories Retrieval. ACM ICMR, (CCF B)
Zhang L, Song M, Liu X, Jiajun B, Chen C (2014) Recognizing architecture styles by hierarchical sparse coding of blocklets. Inf Sci 254:41–154 (IF: 3.643, CCF B, JCR 1)
Article Google Scholar
Zhang L, Xia Y, Ji R, Li X (2015) Spatial-Aware Object-Level Saliency Prediction by Learning Graphlet Hierarchies. IEEE Trans Ind Electron 62(2):1301–1308 (IF: 5.165, JCR 1)
Article Google Scholar
Zhang L, Gao Y, Xia Y, Dai Q, Li X (2015) A Fine-Grained Image Categorization System by Cellet-Encoded Spatial Pyramid Modeling. IEEE Trans Ind Electro 62(1):564–571 (IF: 5.165, JCR 1)
Article Google Scholar
Zhang L, Xia Y, Mao K, Shan Z (2015) An Effective Video Summarization Framework Toward Handheld Devices. IEEE Trans Ind Electron 62(2):1309–1316 (IF: 5.165, JCR 1)
Article Google Scholar
Zhang L, Gao Y, Hong R, Hu Y, Ji R, Dai Q (2015) Probabilistic Skimlet Fusion for Summarizing Multiple Consumer Landmark Videos. IEEE Trans Multimed 71(1):40–49 (accepted, IF:1.754, CCF B, JCR 2)
Article Google Scholar
Zhou Q, Gu L, Wang C et al (2006) Using an improved C4.5 for imbalanced dataset of intrusion. In: Proceedings of the 2006 International Conference on Privacy, Security, Trust: Bridge the Gap Between PST Technologies and Business Services. ACM, p 67

Download references

Author information

Authors and Affiliations

State Grid Zhejiang Electric Power Company Information & Telecommunication Branch, Hangzhou, China
Fenghua Wang, Yuhui Ma, Ying Jiang & Yunye Wang
Hangzhou Dayou Science and Technology Development Co., Ltd., Hangzhou, China
Yanjuan Jin

Authors

Fenghua Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuhui Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yanjuan Jin
View author publications
You can also search for this author in PubMed Google Scholar
Ying Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yunye Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fenghua Wang.

Additional information

The Editor-in-Chief has retracted this article [3], which was published as part of special issue “Semantic Concept Discovery in MMData”, because the article shows substantial text overlap most notably with the articles cited [1, 2, 4, 5].

The authors have not responded to correspondence about this retraction.

1. Chang X, Ma Z, Lin M et al (2017) Feature interaction augmented sparse learning for fast Kinect motion detection. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2017.2708506

2. Nie X, He D, Chan S et al (2016) Network anomaly detection using unsupervised feature selection and density peak clustering. Lect Notes Comput Sci. https://doi.org/10.1007/978-3-319-39555-5_12

3. Wang F, Ma Y, Jin Y et al (2018) Discovering graphical visual features for abnormal semantic event detection. Multimed Tools Appl 77:3245. https://doi.org/10.1007/s11042-017-5057-3

4. Zhang L, Song M, Li N et al (2009) Feature selection for fast speech emotion recognition. In: Proceedings of the 17th ACM International Conference on Multimedia. https://doi.org/10.1145/1631272.1631405

5. Zhang L, Tao D, Liu X et al (2013) Grassmann multimodal implicit feature selection. Multimedia Systems. https://doi.org/10.1007/s00530-013-0317-1

About this article

Cite this article

Wang, F., Ma, Y., Jin, Y. et al. RETRACTED ARTICLE: Discovering Graphical Visual Features for Abnormal Semantic Event Detection. Multimed Tools Appl 77, 3245–3260 (2018). https://doi.org/10.1007/s11042-017-5057-3

Download citation

Received: 18 July 2017
Revised: 24 July 2017
Accepted: 25 July 2017
Published: 12 August 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s11042-017-5057-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

RETRACTED ARTICLE: Discovering Graphical Visual Features for Abnormal Semantic Event Detection

Abstract

Similar content being viewed by others

Network Anomaly Detection Using Unsupervised Feature Selection and Density Peak Clustering

Towards Network Anomaly Detection Using Graph Embedding

One-class graph neural networks for anomaly detection in attributed networks

1 Introduction