1 Introduction

The problem of detecting anomalies in data streams captured by large-scale sensor networks has received much interest (Aggarwal 2005; Chatzigiannakis et al. 2006; Huang et al. 2007; Noto et al. 2011; Thottan and Ji 2003) over the past decade. As large-scale networks become prevalent, there is an increasing need to develop approaches that can address the challenges arising from large amounts of data. The problem affects a wide range of applications as the data captured by sensor networks constitutes multimedia content from the web, video from surveillance camera networks, satellite imagery or typical network traffic.

Approaches to anomaly detection vary significantly in the scope of the detection, the underlying statistical methods, as well as the assumption about the data. As there are varying differences between the definition of anomalies in different settings, it is usually difficult to directly compare the methods in the literature. Examples include Bayesian methods (Janakiram et al. 2006; Jiang and Cooper 2010), SVM (Fujimaki 2008), example-based (Zhu et al. 2005) and spectral methods (Barnett and Lewis 1984), mixed-type data (Koufakou and Georgiopoulos 2010). A more complete survey of anomaly detection is documented in (Chandola et al. 2009).

We restrict our attention to spectral methods for anomaly detection, in particular to residual subspace analysis. This method was originally developed for control system theory (Jackson and Mudholkar 1979; Jackson 1959; Jackson 1980). It decomposes data into the principal subspace that characterizes the normal behavior of data, and the residual subspace where anomalies are to be found. Under the null hypothesis that the data is normal, the squared prediction error (SPE) which is the \({l}_{2}\)-norm of the residual vector, follows a non-central Chi-square distribution. Hence, the rejection of the null hypothesis can be based on whether the norm of the residual vector exceeds a certain threshold corresponding to a desired false alarm rate. The threshold is computed based on a statistical measure called \(Q\)-statistic (Jackson and Mudholkar 1979; Li et al. 2000; Huang et al. 2007), which can be computed as a function of residual eigenvalues. Recently, this method has found use in some network anomaly detection problems (Lakhina et al. 2004).

Let \(\mathbf{x }_i\in \mathbb R ^N\) be a \(N\)-dimensional vector that represents the status of a data network with \(N\) sensors at time \(i\), and denote as \(\mathbf{X }=[\mathbf{x }_1,\ldots ,\mathbf{x }_L]\) the network data matrix. We consider two cases:

Case 1: It is difficult to collect all rows of the data matrix \(\mathbf{X }\) to a central monitor. This could be because of limited bandwidth between the central node and the sensors such that only \(M\) readings can be sent over the communications link, where typically \(M\ll N\). In other circumstances, many sensors of the network may be physically far away from the central node and such direct communication with the central node may be impossible. Furthermore, the central node may not have storage for all sensors due to memory limitations. In this situation, it is desirable to summarize the status of the sensor network with an \(M\)-dimensional reading such that it meets both the bandwidth and memory constraints.

Case 2: It is difficult to collect all columns of the data matrix \(\mathbf{X }\) centrally. This is equivalent to sub-sampling the temporal stream. This is useful in situations where anomalies have to be found retrospectively. For example, the video data of a network of surveillance cameras may be fully available at remote nodes, and when an incident occurs, the authorities want to access the data centrally. It is however impossible to transmit the entire temporal stream to a central location. In this case, the temporal stream must be sub-sampled.

Inspired by the recent compressed sensing (CS) theory (Donoho 2006; Candes and Tao 2006), we propose a new framework for the detection of anomalies in such large-scale networks. The proposed framework consists of a strategy to handle large data using CS/random projection and a traditional spectral-based anomaly detection algorithm. We formalize the application of CS to effectively acquire the data in a compressed way to conform to the physical constraints. This compressed data acquisition permits either sub-sampling of the number of sensors, or the number of frames in a temporal stream and is then used for anomaly detection. Though for detection purpose, the acquisition part is essentially random projection (Vempala 2004), we note that the CS theory certainly enables post anomaly detection tasks such as finding the origins of detected anomalies. As the CS theory is more general, hereinafter we refer to the compressed data using random projection as CS data.

To address the issues, we propose to obtain the compressed data and then perform anomaly detection using residual subspace analysis on it. We show how such a linear transformation can be implemented in some large-scale networks. Our theoretical contributions are:

  • First, we extend the theory of random projection/CS by establishing the relation between the spectral properties of the original and compressed data. Specifically, we show that the principal subspace is approximately preserved under the random projection with high probability and we derive two-sided theoretical bounds. This demonstrates that the intrinsic structure of the data is preserved under a random projection, and yields the intuition that anomalies can be detected in the residual subspace of the compressed data.

  • Second, we derive the theoretical bounds on the false alarm rate with compressed data relative to complete data. The result shows that the bound is directly related to the dimension of random projection \(M\).

We emphasize that the strategy using CS/random projection that we analyze in this work is mostly suitable for spectral-based anomaly detection algorithms, which mainly depend on the singular values of the data. However, this scalability strategy might also be useful for other non-spectral anomaly detection algorithms. Such a study to examine the possible benefit is certainly beyond the scope of current work.

We validate our method by considering both network and surveillance anomalies. For the network data, we evaluate the proposed method on real traffic traces collected from the Abilene network (http://www.abilene.iu.edu/) over four weeks and synthetic network data. Our experiment verifies that on the real dataset, the proposed method using compressed data achieves equivalent performance as with complete data at a detection rate of more than 94 %. For synthetic data, our experiments show that the principal component analysis (PCA) technique performs even better in compressed domain than original domain for high dimensional data. Importantly, the proposed method requires less memory and storage and can be as much as 100 times faster than the original spectral method using raw data.

For the surveillance data, we validate our method on both a benchmark dataset (http://www.cvg.rdg.ac.uk/pets2007/data.html/) (64 MB) and real-world data collected from multiple surveillance cameras from three train stations over a whole week, resulting in over 83 GB of video. To the best of our knowledge, the latter is the largest dataset mentioned in the video surveillance literature. It contains anomalous events that were not artificially created and were ground-truthed in conjunction with the transport authorities. The satisfactory performance of our method over different datasets offers promise for deployment in real-world scenarios.Footnote 1

The significance of our contributions is the demonstration that spectral-based methods can be applied to CS data, and that anomaly detection can be effectively performed without an explicit reconstruction of the input signal. Thus, anomaly detection is equivalent to the uncompressed case, but with the advantage of working with lower number of measurements. Accordingly, the computational cost is also reduced.

In terms of novelty, the framework we present integrates anomaly detection and random projection into a deployable paradigm to overcome the problem with partial data, a reality in most real-world situations. Though it is intuitive from the approximate geometry preservation of random projection that the eigenvalues should be “similar”, establishing the precise bounds on eigenvalues and on the false alarm rates here is new and significant. The closest theoretical work on the bounds of eigenvalues due to random projection is given in (Vempala (2004), Sect. 8.2). However, Lemma 8.4 in Vempala (2004) only provides the upper bound, whilst our result provides both upper and lower bounds using the theory of invariant subspaces. Furthermore, their result in (Vempala (2004), Sect. 8.2) is not probabilistic, thus ignores the essence of random projections. Some similar theoretical results to Vempala (2004) are presented in Liu et al. (2006), but the application is concentrated on using random projection as a privacy preserving mechanism rather than anomaly detection in large-scale data. This work extends our preliminary investigation (Budhaditya et al. 2009) by theoretical results with detailed proofs and a more extensive experimental evaluation to validate our claims.

The paper is organized as follows. In Sec. 2 we discuss related prior work. Section 3 describes the problem in detail and provides some relevant background. Section 4 explains our proposed method and its analysis. Section 5 describes the data sets, experimental setup and results while conclusions are presented in Sect. 6.

2 Related work

There are mainly three major approaches to address scalability issues in large-scale networks that are generally applicable to many problems.

The first approach is column sampling (Drineas et al. 2004; Drineas et al. 2006), which is only suitable for static database applications. In this approach, an empirical distribution over the columns of \(\mathbf{X }\) is constructed and a small number of columns of \(\mathbf{X }\) are selected based on sampling from the empirical distribution. Due to the nature of having the full knowledge about the empirical distribution to do selective sampling, this approach appears unsuitable for online applications.

The second approach is decentralization where nodes in the network actively make decisions about communication and processing so as to reduce bandwidth consumption. For example, Huang et al. (2007) propose a decentralization method for streaming data in which the sensors only send information to the fusion point if the observed value falls outside the normal range, which is a typically pre-defined window. If a sensor does not send any data, the fusion point will assume a nominal value. The essence of Huang et al. (2007) is an optimized trade-off between the pre-defined window length (which implies the amount of reduction in communication overhead) and the changes in the detection performance of the matching spectral method. However, there is still a likelihood that the communication overhead exceeds the bandwidth and the fact that the central node would need to store a data matrix of the same size as \(\mathbf{X }\).

The third approach is dimensionality reduction, where the data is transformed to a (much) lower dimension. Within this approach, there are supervised methods that require a complex optimization problem to be solved such as Yan et al. (2006). A recent work (Giatrakos et al. 2010), which addresses a slightly different problem, uses a local sensitive hashing scheme to reduce the dimensionality of wireless sensors readings. Our proposed framework falls in the realm of unsupervised linear transformation, in particular that of random projection (Vempala 2004) (We note that from CS theory, some deterministic linear transformation might work equivalently but such a discussion is beyond the scope of this paper). This dimensionality reduction technique exploits a special statistical property of high-dimensional data wherein the geometry is approximately preserved under such transformation. Whilst the applications of random projection, more generally CS, have been found in a wide range of domains, such as preserving privacy (Liu et al. 2006) [see Vempala (2004) for a more comprehensive list], to the best of our knowledge there is no work in its application to anomaly detection, especially with residual subspace analysis.

3 Background

3.1 Residual subspace projection and anomaly detection

Let the information about a network be represented by a matrix \(\mathbf{X } = [\mathbf{x }_1,\mathbf{x }_2,\ldots ,\mathbf{x }_L]\) where each data instance \(\mathbf{x }_i\in \mathbb R ^{N}\). For notational simplicity, we assume the data matrix has been centralized. If \(\mathbf{X }\) is available, the residual method performs the eigenvalue decomposition of the sample covariance matrix as:

$$\begin{aligned} {\varvec{\Sigma }}_{\mathbf{x }} = (1/L)\mathbf{X }\mathbf{X }^T = \mathbf{U }{\varvec{\Lambda }}\mathbf{U }^T \end{aligned}$$
(1)

from which the \(K\) principal eigenvectors \(\mathbf{U }\) corresponding to the largest \(K\) eigenvalues can be found. The projection of any data instance \(\mathbf{x }\) onto the residual subspace is given as:

$$\begin{aligned} \mathbf{z } = (\mathbf{I }-\mathbf{U }\mathbf{U }^T)\mathbf{x }. \end{aligned}$$
(2)

In residual subspace analysis (Jackson and Mudholkar 1979), the error signal (\(\mathbf{z }\)) is assumed to be multivariate normally distributed and hence the SPE \(\Vert \mathbf{z }\Vert ^{2}_{2}\) follows a non-central Chi-square distribution under the null hypothesis that the data is normal. Hence, rejection of the null hypothesis can be based on whether norm of the error vector exceeds a certain threshold corresponding to a desired false alarm rate. The threshold is called \(Q\)-statistic, and it is a function of non-principal eigenvalues in residual subspace. For a significance level \(\beta \), the \(Q\)-statistic is usually computed as:

$$\begin{aligned} Q_{\beta } = \theta _{1}\left[\frac{c_{\beta }\sqrt{2\theta _{2}h_{0}^{2}}}{\theta _{1}} + 1 + \frac{\theta _{2}h_{0}(h_{0}-1)}{\theta ^{2}_{1}}\right]^{\frac{1}{h_{0}}}, \end{aligned}$$
(3)

where \(h_{0} = 1 - \frac{2\theta _{1}\theta _{3}}{3\theta _{2}^{2}},\theta _{i} = \sum _{j = K + 1}^{N}\lambda _{j}^{i}\) for \(i = 1, 2, 3,c_{\beta }\) is the (\(1 - \beta \)) percentile in a standard normal distribution, and \(\lambda _j, i=1,\ldots ,M\) are the eigenvalues of \({\varvec{\Sigma }}_{\mathbf{x }}\). An anomaly is detected when \(\Vert \mathbf{z }\Vert ^{2}_{2}>Q_{\beta }\). (see Fig. 1 for an illustration).

Fig. 1
figure 1

Anomaly detection with residual subspace analysis. Here the data samples \(\mathbf{x }_1\) to \(\mathbf{x }_6\) mostly align with the principal subspace as their projection to the residual subspace, which is illustrated by a plane, is small than a threshold. However, the projection of \(\mathbf{x }_7\) onto the residual subspace is large, implying an anomaly

In practice, it is important to select a suitable value for \(K\). Like most other spectral methods, the general principle for selecting \(K\) is the smallest number of principal components that capture most of the energy. For residual subspace methods, the selection of \(K\) is always a trade-off. Selecting small \(K\) makes the residual subspace large, and hence can improve detection but may increase false positives. On the other hand, selecting large \(K\) makes the residual subspace small, and hence reduces false positives but may increase false negatives. In our work, we select \(K\) to capture about 90 % of energy in the principal subspace.

3.2 From random projection to compressed sensing

It has been observed in the literature that though the dimension of the data may be large, the intrinsic dimension which carries most information about the data is typically much smaller. This has motivated a large number of works on dimensionality reduction. They all aim at yielding compressed data that is easier to work with. In most cases, this contains less noise than that with original data. A particular common class of dimensionality reduction is linear transformation, wherein the compressed data \(\mathbf{y }\) is obtained through a linear transformation \(\mathbf{y }={\varvec{\Phi }}\mathbf{x }\). Here \({\varvec{\Phi }}\in \mathbb R ^{M\times N}\) effectively reduces the dimension of the data from \(N\) to \(M\) and its columns are normalized to unit norm. If the intrinsic dimension is \(K\) then obviously \(M\ge K\). In CS theory, suppose that \(\mathbf{x }\) is a \(K\) sparse vector, then a linear transformation \({\varvec{\Phi }}\) is characterized by a so-called restricted isometry constant (RIC) \(\delta _{K}\), which satisfies

$$\begin{aligned} (1-\delta _K) \Vert \mathbf{x }\Vert _2^2 \le \Vert {\varvec{\Phi }}\mathbf{x }\Vert _2^2 \le (1+\delta _K) \Vert \mathbf{x }\Vert _2^2. \end{aligned}$$
(4)

These inequalities describe the approximate geometry preservation property of \({\varvec{\Phi }}\). Ideally, a good linear transformation corresponds to small \(\delta _{K}\). To achieve this, the columns of \({\varvec{\Phi }}\) need to be as close to orthogonal as possible. Under CS theory, \({\varvec{\Phi }}\) does not have to be a random matrix and in fact there are published works that construct \({\varvec{\Phi }}\) deterministically. However, it is found that many classes of random matrices often have small RIC and can be easily generated, such as Bernoulli random matrices, database friendly random matrices, and Gaussian random matrices (Candes and Tao 2006; Achlioptas 2001; Vempala 2004). It also follows from CS theory that the dimensions of random projection is \(M=\mathcal O (K\log N) \ll N\) for large \(N\). This implies that by using random projection, the compressed data could have a smaller dimension than the original data without losing its geometrical property. Because of the non-adaptive nature, such compression is suitable for large-scale problems [For further detail of these random matrices, see Candes and Tao (2006), Achlioptas (2001), and Vempala (2004)].

Before proceeding, we note that the above geometry preserving projection holds for data \(\mathbf{x }\) that is sparse via some unitary transform, i.e. \(\mathbf{x }={\varvec{\Psi }}{\varvec{\alpha }}\) and that \(\alpha \) is \(K\)-sparse. In this case \({\varvec{\Phi }}{\varvec{\Psi }}\) is effectively the linear projection. Because \({\varvec{\Phi }}\) is random and \({\varvec{\Psi }}\) is unitary, the statistical property remains unchanged (this can be proved easily). Secondly, though we only focus on anomaly detection in this work, the CS theory states that the original data \(\mathbf{x }\) can be reconstructed from the compressed data \(\mathbf{y }\) by solving the following optimization problem:

$$\begin{aligned} \hat{\mathbf{x }} = \arg \min _{\mathbf{x }} \Vert \mathbf{y }-{\varvec{\Phi }}\mathbf{x }\Vert _2^2 + \lambda \Vert \mathbf{x }\Vert _1. \end{aligned}$$
(5)

Here \(\lambda \) is a regularization parameter, which controls the sparsity of the solution, and is typically found by cross-validation techniques, the detail of which can be found more in the CS literature. From an anomaly detection point of view, this implies that post-processing tasks of anomaly detection, such as identification of anomalies might be possible under CS theory. However, we leave this for future work. In what follows, we focus on obtaining an equivalent random projection in large-scale networks by borrowing some concepts from the CS literature.

4 Proposed framework

4.1 System setup

In the first step of the proposed framework, we obtain compressed data using random projection. Mathematically speaking, we denote the complete data matrix as \(\mathbf{X } \in \mathbb R ^{N \times L}\) and the actual data matrix available for processing as \(\mathbf{Y }\in \mathbb R ^{N^{\prime }\times L^{\prime }}\) after applying random projection on a large network. The reduction in either \(N^{\prime }\) or \(L^{\prime }\) depends on whether this linear compression is deployed for reducing the feature dimension or time instances to meet the network constraints. We revisit the two cases considered previously:

Fig. 2
figure 2

Sensor subsampling. Here the synchronized sensors effectively perform cross-network compressed measurements via gossips, indicated by the red arrows. After a number of gossips, the final compressed measurements will arrive at the center node (Color figure online)

Case 1: Sensor sub-sampling: We seek a linear transformation on the data \(\mathbf{y } = {\varvec{\Phi }}\mathbf{x }\) where the random matrix \({\varvec{\Phi }}\in \mathbb R ^{M\times N}\) has entries as random variables. There are some known classes of random matrices suitable for large-scale networks. For example, in the database friendly matrices (Achlioptas 2001) the entries can take values of either 0 with probability 2/3 or \(\pm \)1 with probability 1/6. If all sensors have synchronized clocks and the same random generator, a rule can be set up so that the sensors send their pre-modulated reading with \(\pm 1\) depending on the value of the random generator. Alternatively, when the sensors cannot directly reach the central node, the random gossip algorithm (Rabbat et al. 2006) can be applied to propagate the projection \(\mathbf{y }\) to the central node (see Fig. 2 for an illustration). The additional advantage over the decentralization approach is that the central node can now perform the analysis using the residual method on the compressed data \(\mathbf{y }\). We show both theoretically and experimentally in subsequent sections that the performance of the detector is nearly as optimal as if the whole data matrix \(\mathbf{X }\) were available.

Case 2: Temporal stream frame sub-sampling: In this scheme, the operator can request the server to generate random numbers having values \(\pm 1\) and modulate the data with these random numbers, accumulating the values for \(L^{\prime }\) different iterations where \(L^{\prime }\ll L\) (see Fig. 3 for an illustration). We show that by doing sub-sampling, limited bandwidth and storage can be efficiently utilized to detect anomalies as successfully as if the full data \(\mathbf{X }\) is available.

In the second step, we perform anomaly detection using compressed data. Instead of using \(\mathbf{X }\) which is not available, we now apply the residual method on the compressed data \(\mathbf{Y }\), i.e. compute its eigenvalues and hence obtain the \(Q\)-statistic to determine the presence of an anomaly.

4.2 Theoretical analysis

With the following theoretical analysis, we aim at proving that doing anomaly detection with residual subspace analysis using compressed data obtained from random projection is approximately equivalent to that using complete data. Our theoretical analysis is based on relative performance to the complete data \(\mathbf{X }\). Even though this complete data \(\mathbf{X }\) is not practically available, a relative comparison can provide a guarantee on near optimal performance of any spectral-based detection method. To do this, we first study the changes in the eigenvalues (spectral properties) reflected in the compressed data as they constitute an important factor for detection as shown in (3). The bounds on the eigenvalues of compressed data then allow us to study the bounds on false alarm rates when the residual subspace method is applied to the compressed data for anomaly detection.

4.2.1 Case 1: \(M\) readings from \(N\) sensors

Random projection is used to compress the columns of \(\mathbf{X }\) from \(N\) to \(M\) and the its relation with the compressed data is given by

$$\begin{aligned} \mathbf{y }_{i} = \mathbf{\Phi }\mathbf{x }_{i}, \quad \mathbf{y }_{i} \in \mathbb R ^{M}, \quad i = 1 \ldots L \end{aligned}$$
(6)

Denote the eigenvalues of the complete data \(\mathbf{X }\) as \(\lambda _1,\ldots ,\lambda _N\), the eigenvalues of the compressed data \(\mathbf{Y }\) as \(\xi _i, i=1,\ldots ,M,K\) as the number of principal eigenvalues in the complete data \(\mathbf{X }\), such that \(K < M \ll N\). For simplicity, we assume that the CS matrix is a random Gaussian matrix. Similar results can also be obtained for other random matrices,

Fig. 3
figure 3

Temporal subsampling. Here the data \(\mathbf{x }_1,\ldots ,\mathbf{x }_7\) reside in a database. The data are \(\pm 1\)-randomly modulated through a random generator, and the summation is taken over all data points to obtain temporally-compressed data \(\mathbf{y_1 },\ldots \)

Theorem 1

With a probability of at least \(1 - \delta \), the changes in the eigenvalues are bound by

$$\begin{aligned} |\lambda _i - \xi _i| \le 4\sqrt{2} \lambda _{1}\left(\sqrt{\frac{K}{M}} + \sqrt{\frac{2\ln {\frac{1}{\delta }}}{M}}\right) \end{aligned}$$
(7)

for \(i=1,\ldots ,K\), where \(\lambda _1\) is the largest eigenvalue of \({\varvec{\Sigma }}_{\mathbf{x }}\).

The theorem is a direct consequence of the concentration property of Gaussian ensembles and the proof is detailed in the Appendix.

Theorem  1 suggests that as the principal subspace spanned by \(\mathbf{X }\) is approximately preserved in the compressed domain with high probability, and the intrinsic structure of the data in the original input domain is unchanged under random projection (see Fig. 4 for an illustration).

Fig. 4
figure 4

Illustration of Theorem 1. The top plot depicts the eigenvalue distribution of data in the original domain, whilst the bottom one shows that of the compressed domain. Theorem 1 essentially quantifies the variation of the distribution for the principal (largest) eigenvalues

We now discuss the implication of this result on anomaly detection in compressed data. The detection of volume anomalies using the residual subspace method is entirely based on the total power of the residuals, i.e. \(\Vert \mathbf{z }\Vert ^2\), rather on the actual residual subspace itself as long as it remains noise-like, i.e. no salient spectral features. It can be easily shown that when the random matrix \({\varvec{\Phi }}\) is normalized (each column to unit norm), the total power is unchanged. Thus, a small variation in the principal subspace directly translates to a small change in the total power of the residual subspace. It means that as far as the statistic \(t=\Vert \mathbf{z }\Vert ^2\) is concerned, its distribution will also experience a small change when the compressed data is used. This intuitive argument can be more formally stated by the following result, which forms the basis for our proposed framework.

Theorem 2

If the residual method is applied to the compressed data, with a probability of at least \(1-\delta \), the change in the false alarm rate is bounded by:

$$\begin{aligned} \varDelta FA\le \mathcal O \left(\sqrt{\frac{K}{M}}+ \sqrt{\frac{2\ln ({1}/{\delta })}{M}}\right). \end{aligned}$$
(8)

The proof is detailed in the Appendix.

We note that the result is based on a very rough estimate of the bound and its primary goal is to justify the use of compressed data in anomaly detection in an asymptotic sense. In other words, if we assume \(N\rightarrow \infty \) whilst \(K\) is kept fixed, then the deviation in the false alarm should be small if \(K/M\) is small. The constant in (8) is shown in (80) and for non-asymptotic values this might be large and thus the theoretical bound might be loose for the non-asymptotic case. This is partly because the bound is derived with minimal assumption on the distributional statistics of the data. It appears that if we can make further assumptions on the distribution of the eigenvalues then a tighter result might be obtained. This is discussed in the Appendix. However, as demonstrated in Sect. 5, the actual deviation in many realistic numerical studies is much smaller.

4.2.2 Case 2: sub-sampling the number of data instances

In the previous case, we used random projection to reduce the number of readings in data streams. Effectively, this reduces the number of rows in the data matrix \(\mathbf{X }\), which is useful when \(N\) is large. In a similar manner, we now show that the proposed framework can be applied to the case when the number of instances is large. Effectively, we use random projection to compress each \(L\)-dimensional row of the complete data matrix \(\mathbf{X }\) to a \(M\)-dimensional row of the matrix \(\mathbf{Y }\) using a random matrix \({\varvec{\Phi }}\in \mathbb R ^{M\times L}\), where \(M < L\). Mathematically, the relation between \(\mathbf{Y }\) and \(\mathbf{X }\) can be written as:

$$\begin{aligned} \mathbf{Y }^T = {\varvec{\Phi }}\mathbf{X }^T. \end{aligned}$$
(9)

In this case, \(N^{\prime }=N\) and \(L^{\prime }=M\). We now show that the results for the previous case are applicable in this case. To see this, we start from the basic result in linear algebra that:

$$\begin{aligned} \lambda _i(\mathbf{X }\mathbf{X }^T) = \lambda _i(\mathbf{X }^T\mathbf{X }), \quad i = 1,\ldots ,\min (N,L). \end{aligned}$$
(10)

This implies that the changes in the principal eigenvalues of \(\mathbf{Y }\mathbf{Y }^T\) relative to \(\mathbf{X }\mathbf{X }^T\) is the same as the changes in eigenvalues of \(\mathbf{Y }^T\mathbf{Y }\) relative to \(\mathbf{X }^T\mathbf{X }\) and as \(\mathbf{Y }^T\) and \(\mathbf{X }^T\) are related in a similar manner as shown in (9), the previous result applies. The only minor difference is that \(N\) should be replaced by \(L\) as the reduction is perform on the rows of \(\mathbf{X }\). The changes in the principal eigenvalues are bounded by:

$$\begin{aligned} |\lambda _i - \xi _i| \le 4\sqrt{2} \lambda _{1}\left(\sqrt{\frac{K}{M}} + \sqrt{\frac{2\ln {\frac{1}{\delta }}}{M}}\right), \end{aligned}$$
(11)

whilst the changes in the false alarm rate is bounded by:

$$\begin{aligned} \varDelta FA\le \mathcal O \left(\sqrt{\frac{K}{M}}+ \sqrt{\frac{2\ln ({1}/{\delta })}{M}}\right). \end{aligned}$$
(12)

with probability of at least \(1-\delta \).

4.2.3 Complexity analysis

If the complete data \(\mathbf{X }\) were available, the covariance matrix formation and eigenvalue decomposition requires a computational power of \(\mathcal O (N^{3})\) and memory storage of \(\mathcal O (N^{2})\) in the case of PCA. In a similar fashion, the complexity for SVD computation is \(\mathcal O (LN^{2} + L^{2}N)\). In contrast, the complexities for the proposed framework (both computational and storage) are only \(\mathcal O (M^{3})\) and \(\mathcal O (M^{2})\) respectively, where \(M = \mathcal O (K\log {N})\). As previously discussed, when the intrinsic dimension of the complete data is small relative to its size, significant reduction in both storage and complexity is achieved with the proposed method. We also note that if the data is sparse in the original domain, then sparse-SVD or PCA may be directly applicable. However, it is much more likely for the data to be sparse only through an (unknown) orthogonal transformation. Since data is generally dense in the original domain, sparse SVD is generally not applicable.

4.3 Justification of theoretical bounds

In what follows we examine whether the theoretical bounds derived previously are tight enough so that they can be a general guidance for practical purposes. In other words, are the bounds approximately  at the order of the real deviation?

As the theoretical bound on the false alarm is naturally dependent on the bound of the eigenvalues, we restrict the discussion to the bounds in Theorem 1. In this case, the Theorem states that such a deviation of the eigenvalues should not exceed \(4\sqrt{2}\lambda _1 \varDelta \) where \(\varDelta \) is dependent on \(M^{-0.5}\) as

$$\begin{aligned} \varDelta = \frac{\sqrt{K}+\sqrt{2\ln (1/\delta )}}{\sqrt{M}}. \end{aligned}$$
(13)

For a 90 % confidence, the term \(\sqrt{2\ln (1/\delta )}\) is 2.1460, which is small. Thus, the dependence is approximately \(\sqrt{K/M}\). This dependence provides the implications for the application of CS/random projection

  • The bounds only make sense if the intrinsic dimension \(K\) is sufficiently small compared with the reduced dimension \(M\)

  • Compression-error trade-off. For a fixed intrinsic dimension \(K\), better compression would be achieved with a smaller \(M\), however it also results in a larger possible deviation. On the contrary, error is made small when letting \(M/N \rightarrow 1\), but this defeats the goal of compression. This mean in practice a value of \(M\) such that \(K \ll M \ll N\) would provide a natural trade-off. In the CS theory, such a value of \(M\) is typically chosen as \(M = \mathcal O (K\ln (N))\), which appears to be suitable for the CS recovery problem. In large-scale network problem, there are two possible scenarios:

    • The designer is given a maximum affordable \(M\): in that case, the bounds serve as a rough estimate of the possible deviation due to using compressed data

    • The designer is given a tolerance on the deviation: in this case, the bounds serve as a rough estimate of the compression required to attain accuracy within the allowable tolerance.

The theoretical bounds are derived on the assumption of distinct principal components. In other words, the principal \(K\) eigenvalues are assume sufficiently larger than the residual eigenvalues. There are two practical issues that we emphasize:

  • How do we determine \(K\), especially since we are unable to work on the original data if it is too large. The theoretical results indicates that for sufficiently large \(M\), one may determine \(K\) using the compressed data instead. It means that one can start with the maximum affordable value of \(M\) and analyze the compressed covariance matrix to determine \(K\).

  • In practice, the eigenvalues may follow a decay distribution in many cases. When this happens, there is no clear choice of \(K\). As with the well-known PCA in statistics, one typically chooses a cut-off point at which at least, for example, 90 % energy is retained. This is what we use in our work and works rather well.

We now turn the discussion to the tightness of the eigenvalue deviation bound. Upon examining the proof, we found that the bounds are reasonably tight and it is difficult to improve any further. Our bounds depend on the concentration result of Gaussian random matrices which state that for a particular Gaussian random matrix \({\varvec{\Phi }}_K\) of size \(M\times K\) where each entry follows \(\mathcal N (0,1/\sqrt{M})\) the singular values are bounded by \(1\pm \varDelta \) with a probability of at least \(1-\delta \), where \(\varDelta \) is defined as (13). This implies that the singular values of \({\varvec{\Phi }}_K^T{\varvec{\Phi }}_K - {\varvec{I}}_K\) are bounded by \(1\pm 2\varDelta \). How tight is this theoretical bound in the literature? To do so, we study the case where \(K=5, N=10^4, \delta =0.1\) and vary \(M\). The theoretical and actual values of \(\Vert {\varvec{\Phi }}_K^T{\varvec{\Phi }}_K - {\varvec{I}}_K\Vert _2\) are shown in Fig. 5. From this plot, we observe that the theoretical bound appears to be reasonable, though it is slightly conservative. It is about twice the actual value on average. Thus, when using this several times, we expect the bounds on the eigenvalues are also a few more times larger than the actual values. To verify this, we consider the same setting and create a covariance matrix whose five principal eigenvalues are 100. Then we directly compute the average eigenvalues of the compressed covariance matrix obtained from random projection whilst \(M\) is varied.

Fig. 5
figure 5

Theoretical and actual values of \(\Vert {\varvec{\Phi }}_K^T{\varvec{\Phi }}_K - {\varvec{I}}_K\Vert _2\)

Fig. 6 shows the maximum average deviation of the compressed eigenvalues, whilst Fig. 7 shows the deviation of all principal eigenvalues at a particular \(M=N/2\). The plots indicate that the theoretical bounds are about 4–5 times larger than the actual one, but this is as expected. Obviously, if the bound on spectral norm of Gaussian random matrix is tighter, our theoretical bound will be also tighter. However, it is noted that we have used the best bounds available for Gaussian random matrices in the literature to date.

Fig. 6
figure 6

Maximum average principal eigenvalue deviations

Fig. 7
figure 7

Principal eigenvalue deviation at \(M/N=0.5\)

Whilst improving the bounds might be of future interest, the theoretical results importantly give us a justification to the scalable framework that deals with large-scale network data using random projection/CS via spectral methods. In Sect. 5, we shall illustrate this more clearly.

4.4 Discussion

From the above cases, one may also apply sub-sampling in both dimensions if they are both large. In this case, it is natural to assert that the bound of the change in the false alarm is the sum of the bounds derived in the above theorems. Thus, the bound is unavoidably increased for trade-off both in the feature and instances dimensions. Nevertheless, it is only linear to the reductions.

We note that our proposed method is rather general, and has not yet taken into account the specific structure of a particular network. Without specialization, it is more suited to the abnormality detection in general centralized networks. It is of interest to extend the general theory here to the case of particular network structure. For example, in the case of sensor networks where the sensors are organized in clusters or tree (Giatrakos et al. 2010), it might be more desirable to do in-network processing to further reduce the bandwidth. We thus leave the specialization as future extension of the developed theory.

Finally, even though it is intuitive that random projection approximately preserves the geometry, and thus one should expect the eigenvalues to follow suit, our theoretical results are the first in the literature to provide two-sided bounds on all principal eigenvalues, and thus this is a significant contribution to the literature. In comparison, we note the limitation of some previous work:

  • The text-book result (Vempala 2004) only provides a non-probabilistic one-sided bound for the eigenvalues. Here, the bound is expressed in terms of the \(\varepsilon \) deviation due to random projection, but this value is too difficult to compute for a given projection matrix.

  • The early KDD paper (Bingham and Mannila 2001) only gives practical demonstration that PCA works with random projection, but no theoretical justification is given as thoroughly as what we present here.

  • The recent work in (Fowler 2009) studies only the first eigenvalue, but with a deterministic orthogonal projection matrix. Here, we use random Gaussian matrices that are only approximately orthogonal, and that we provide probabilistic bounds for all principal eigenvalues.

Moreover, we also derive the bound on the deviation of the false alarm, which is specifically tailored to the anomaly detection problem.

5 Experimental results

We evaluate our framework using both real-world and synthetic datasets experiments from two application areas: network traffic analysis and video surveillance analysis. We evaluate the performance of our framework for the sensor sub-sampling case with network data while the temporal frame sub-sampling case performance is studied using video data.

In all cases, we compare the proposed anomaly detection using the proposed strategy (random projection) with the other relevant alternatives including the full data (Lakhina et al. 2004) and the decentralization approach of Huang et al. (2007). We emphasize the following:

  • As this work is concentrated on scalability strategy rather than outlier detection algorithms, we compare different strategies for the same outlier detection algorithm using residual subspace analysis to verify the scalability solution analyzed in this work. As mentioned earlier in the introduction, the analyzed random projection strategy might also work with other non-spectral algorithms, but it is beyond the scope of current work. Furthermore, some other non-spectral algorithms might be application specific, which requires additional assumptions and settings and comparing them directly is almost impossible. Thus, we restrict out attention to residual subspace analysis as the outlier detection algorithm for consistency.

  • Our main goal is to demonstrate that in a wide range of circumstances in data mining, the proposed framework is indeed useful. In other words, the trade-off between compression and deviation can be satisfactorily achieved. Thus, the compression parameter \(M\) does not mean as a tuning hyperparameter, but should be viewed as a trade-off parameter in the context. Also, for consistency, we determine the parameter \(K\) by using the 90 % principal energy as the guiding principle wherever we do not know \(K\) a priori.

5.1 Sensor subsampling results

The anomaly detection capability of the proposed framework is evaluated on a real-work benchmark data set Abilene (http://www.abilene.iu.edu/) and a synthetic dataset specifically designed to simulate a large network.

5.1.1 Experiment on the Abilene data

Abilene dataset. The Abilene dataset (http://www.abilene.iu.edu/) is a well-known dataset for network research, and captured from a real-work backbone network. Its first use for volume anomaly detection is documented in Lakhina et al. (2004). Here, we are primarily interested in anomalies resulting from abnormal changes of the network traffic. The changes arise because of events such as abnormal DNS transaction, network equipment failure, flash crowd occupancies, distributed denial-of-service (DDoS) attacks (Lakhina et al. 2004). Importantly, these changes cannot be detected from a simple thresholding due to the varying characteristics of normal traffic during a day. In Figs. 17a and 8, we plot the total network traffic for a period with and without volume anomalies (highlighted in red). As can be seen, volume anomalies are often hidden under normal network traffic. The preliminary investigation in Lakhina et al. (2004) reveals that the spectral approach is capable of detecting these anomalies. Our purpose in this experiment is to extend (Lakhina et al. 2004) to the case where the data is not complete and verify the proposed method.

Fig. 8
figure 8

Typical network link data: Normal (left) and Abnormal (right). Horizontal axis denotes the aggregated window index, vertical axis denotes the \(\ell _1\)-norm of the vector of total traffic (Color figure online)

Data collection. In the Abilene network, the traffic flow is the amount of traffic flowing in between each pair of ingress and egress nodes in the network. It is also known as an origin–destination (OD) flow, which is the traffic that enters the backbone at the origin point of presence (PoP) and exits at destination PoP (Lakhina et al. 2004). The Abilene dataset consists of the readings collected from 41 network links over a period of several months. The OD trace contains the measurement from each link for each 10-s interval. We use a subset of the data which covers a period of 2 weeks (1,008 measurements per week). Most of the data reflects normal network conditions with only six real anomalies (verified manually) in the original dataset. In addition, we inject 45 synthetic anomalies of different magnitudes following the procedure described in Lakhina et al. (2004). We use the first week data for training and the second week data for testing.

Improving random projection. As discussed earlier, the random projection is represented by a matrix \({\varvec{\Phi }}\) from which the compressed data is obtained via \(\mathbf{y }={\varvec{\Phi }}\mathbf{x }\). A good random projection matrix should have columns as close to orthogonal as possible (in other words small RIC) so that the geometry of data in high-dimensional space is preserved better in the low-dimensional space. In other words, the mutual coherence of the overcomplete system \({\varvec{\Phi }}=[{\varvec{\phi }}_1,\ldots ,{\varvec{\phi }}_N]\), which is defined as:

$$\begin{aligned} \mu ({\varvec{\Phi }}) = \max _{i\ne j} |\langle {\varvec{\phi }}_{j},{\varvec{\phi }}_{k}\rangle |, \end{aligned}$$
(14)

must be as small as possible. For a real-valued matrix \({\varvec{\Phi }}\), the lower bound on the mutual coherence is known as the Welch bound (Strohmer and Heath 2003):

$$\begin{aligned} \mu ({{\varvec{\Phi }}}) \ge \sqrt{(N-M)/(M(N-1))}. \end{aligned}$$
(15)

For many classes of random matrices, the mutual coherence can be small with high probability. In cases where the problem size is not very large, such as in this Abilen dataset with \(N\) is only 41, a random Gaussian matrix might not have good approximate orthogonality property, which necessitates improvement in practice. To further improve the random projection matrix, we start with a random Gaussian matrix and then apply the recently proposed algorithm by Elad (2007). This algorithm exploits the fact that the mutual coherence of \({\varvec{\Phi }}\), with each column normalized to unit norm, is the maximum magnitude of the off-diagonal elements of the Gram matrix \(\mathbf{G }={\varvec{\Phi }}^T{\varvec{\Phi }}\), where the Gram matrix has rank \(M\). Hence, by iteratively shrinking the entries of the Gram matrix, forcing its rank to \(M\), and taking its square root, a smaller mutual coherence for \({\varvec{\Phi }}\) with a specified rank \(M\) is achieved. Though the algorithm could be sensitive to the parameter setting and its convergence is yet to be studied, we found that in practice this method improves the mutual coherence considerably. In practice, the actual signal \(\mathbf{x }\) might not be sparse in the basis \(\mathbf{I }\) but in some \({\varvec{\Psi }}\). In this case, \(\mu ({\varvec{\Phi }}{\varvec{\Psi }})\) needs to be small instead. If \({\varvec{\Phi }}_\mathrm{o}\) is the optimal sensing matrix for the basis \(\mathbf{I }\), then the optimal matrix \({\varvec{\Phi }}\) for the basis \({\varvec{\Psi }}\) is found from \({\varvec{\Phi }}={\varvec{\Phi }}_\mathrm{o}{\varvec{\Psi }}^{-1}\), assuming that \({\varvec{\Psi }}\) is invertible. For the Abilene network data, \(N = 41,M = 16,K=6\) and the Welch bound of the sensing matrix is 0.1976. Using Elad’s algorithm (Elad 2007), we achieve a mutual coherence of \(0.36\) from the initial coherence of \(0.55\).

Anomaly detection. Residual subspace analysis is applied using different strategies: complete, compressed, and decentralized data. Figures 9 and 10 show the similarity between the principal eigenvalues and the residual vectors using these strategies. We then conduct experiments to obtain the receiver operating characteristics (ROC) curve which is shown in Fig. 11. We notice that the performance is very similar with all three strategies.

Fig. 9
figure 9

Residual vector plots for Abilene data

Fig. 10
figure 10

Eigenvalue distribution for Abilene data

Fig. 11
figure 11

ROC curves for the Abilene data

The performance on the compressed data is very close to that on the uncompressed and decentralized strategies. To further quantify this, we also compare the ROC curves using (i) the area under the ROC curve (AUC) and (ii) equal error rate (EER). An effective classifier should achieve an AUC close to 1 and small ERR. From the ROC curves, we determine that the AUC/EER values are 0.95/0.09, 0.96/0.11, and 0.95/0.10 for the original, compressed, and decentralized data respectively.

5.1.2 Experiment on synthetic network traffic data

Synthetic data generation. We generate synthetic network traffic data following the procedure in Lakhina et al. (2004). In particular, we consider a network where the number of local monitors \(N\) ranges from \(500\) to 2,000 and the number of time instances \(L=2{,}000\). The network traffic signal is modeled as \(\mathbf{x }\) as \(\mathbf{x } = \mathbf{s } + \mathbf{n }\) where \(\mathbf{x }\in \mathbb R ^N\). It consists of two parts: \(\mathbf{s }\) characterizes the long-term structure in the data and \(\mathbf{n }\) represents the local temporal variation. For the long-term network traffic signal, \(\mathbf{s } ={\varvec{\Psi }}_s{\varvec{\alpha }}_s\). Here, \({\varvec{\Psi }}_s\) is the basis for the intrinsic network data. Due to its daily periodic characteristics, we select the discrete cosine transform (DCT) matrix as the basis. The number of principal components is \(K=4\). We simply model the noise as zero-mean Gaussian with variance \(\sigma ^2=0.01\). To simulate abnormal network conditions, we inject 70 anomalies of different magnitudes.

Fig. 12
figure 12

Trade-off: computation (left) and error rate (right)

Anomaly detection. When specifying the dimension \(M\) for random projection, we need to consider the trade-off between performance and error rates. Selecting a smaller value of \(M\) reduces the computational complexity at the cost of a potentially lower performance due to the increase in the mutual coherence of the random projection matrix. In the CS literature, the value of \(\mathcal O (K\log N)\) has been frequently suggested. We set \(N=2{,}000\), vary \(M\) between \(100\) to \(1{,}000\) and measure the EER and computational time. The results are shown in Fig. 12a, b. Selecting \(M\) in the range 250–300 gives moderately low error rates at a large reduction in computational time. If \(M\) is too low, the error rate becomes larger. If \(M\) is too large, the reduction in error rate is not significant whilst the computational time increases somewhat quadratically. Therefore, we determine that suitable values of \(M\) are 118, 280, and 450 when number of nodes are 500, 1000 and 2000 respectively. The random projection matrices are improved from random Gaussian with a final mutual coherence of 0.37, 0.35 and 0.20 respectively.

Fig. 13
figure 13

Residual vector plots for synthetic data

Fig. 14
figure 14

Eigenvalue distribution for synthetic data

With the specified random projection matrix, we then examine the behavior of the residual vectors using complete and compressed data. Figs. 13 and 14 demonstrate the result. Once again, we observe that the patterns of the eigenvalue distribution and the residual vectors are similar in both complete and compressed data cases. We then explore the detection performance by comparing residual subspace analysis using compressed data with that using complete data (Lakhina et al. 2004), and the decentralized version presented in Huang et al. (2007). The ROC curves for these three cases are shown in Fig. 15. In terms of AUC, the detection with the compressed data is approximately equivalent (even slightly better than) to the other cases.

Fig. 15
figure 15

ROC performance on synthetic data

Fig. 16
figure 16

Communication, computational and storage complexity. The strategies are: complete data (1), decentralization (2), and compressed data (3). Setting \(N=1{,}000, M=280\)

Next, Fig. 16 compares these three strategies in terms of communication, computation and storage overhead. It shows that by using compressed data compared to other two strategies we can reduce the communication bandwidth by 45–60 %, computational cost by 80–90 %, and storage requirement 45–70 % as compared with other two strategies.

Table 1 Anomaly detection performance on synthetic data

Finally, we vary the size of the network from 100 to 2,000 and the result is shown in Table 1. The AUC and EER indicators are as competitive as those using complete data, with a better computational performance, it is from 6 to 1,000 times faster.

5.2 Temporal frame subsampling

5.2.1 Problem background and datasets

Next we evaluate the performance of our framework when dealing with temporal sampling (Sect. 4.2.2) and determine the suitability of our work in addressing a real-world surveillance problem faced of a local public transport authority. The local train network monitoring system consists of over 3,000 cameras operating 24 h a day. Constant human operator supervision of video is impossible. The problem of detecting anomalies in the video is challenging because (1) most anomalous patterns occur in the presence of normal patterns (a majority of people behave normally) and (2) there is no predefined description of anomalous behavior—the anomaly changes with the context of the scene (a person walking at normal pace on the rail tracks is considered as anomalous whereas the same behavior would be normal if it occurs on the station platform).

We use two video surveillance datasets: one provided by the local public transport authority and the PETS 2007 benchmark dataset (http://www.cvg.rdg.ac.uk/pets2007/data.html/). The PETS 2007 data was used to demonstrate the effectiveness of our work on an established dataset as the video sequences are freely available from the PETS archive (http://www.cvg.rdg.ac.uk/pets2007/data.html/). Both video datasets are pre-processed to extract motion features. In both cases, the ground truth data is available. Training is done offline and testing is performed on the incoming data streams.

5.2.2 Video data pre-processing and feature extraction

We use optical flow (Lucas and Kanade 1981) to define the motion in the scene. The advantage of such low-features, collected in a grid superimposed on the images, is that they provide good information about motion whilst alleviating the need for object tracking (Adam et al. 2008). The limitation of this type of feature is that it is limited to motion anomalies. In many practical situations, this is sufficient.

5.2.3 Motion-based representation

Each image is divided into grid-based regions and the motion statistics of each cell in an image is computed over a pre-defined time bin. The motion distribution of each cell is simply calculated as the number of optic flow vectors in that cell. Figure 17a, b shows the amalgamated motion flows over 300 frames for 2 sequences from the data collected from surveillance cameras in a train station. In Fig. 17a, we plot the volume of the motion flows, which is the squared norm of the vector of motion statistics from all cells. Each point shows the motion statistics amalgamated over 1 min. Generally, the characteristics of the normal motion volume is “high” if there is a train in the station, otherwise it remains “low” giving rise to periodic rise and fall.

Fig. 17
figure 17

Amalgamated optical flows over 1 min interval. Horizontal axis denotes aggregated window index, vertical axis denotes normalized \(\ell _2\)-norm of the aggregated motion count vector (Color figure online)

The majority of past relevant work (Brand et al. 1997; Medioni et al. 2001; Phung et al. 2005) has treated the whole scene as either “normal” or “abnormal”, but these examples suggest that a framework which can detect abnormality in presence of the normal behavior is needed. Our intuition is to capture the structure of the overall normal pattern in the principal subspace. We now give some analytical arguments to justify the proposed method. In the subway example, passengers would normally follow the “walk” path to enter and exit. The normal activities induce a distribution of motion vectors over the cells in the subway. Importantly, this distribution also signifies the relationship between the cells. For example, some cells tend to be highly correlated due to the average flow of the traffic through the cells. This dependence gives us important information about the structural pattern of normal events. Thus, if, for example, a person or a group crosses the subway tunnel in an unusual manner, the observed motion distribution will carry totally new structural information. This is the basis for our abnormality detection technique. Similarly, a loitering person is likely to lead to the observation that the cells covering the loitering trajectory become more correlated than normal. If this structural information is known, we can separate the normal activity by projecting the observed motion pattern onto the space induced by the structure, so that abnormal activities can be easily investigated in the residual subspace.

Fig. 18
figure 18

Schematic description of the proposed method

Localized bag-of-visual-words: We are motivated by Niebles et al. (2008) to use bag-of-visual words for representing the optic flow count in the cells. Niebles et al. (2008) derive visual words from the human activity in the spatio-temporal domain. Using a grid-based approach we extract optic flow counts for each cell. We consider each cell similar to a term and motion statistics (i.e. number of motion flows) of each cell as equivalent to word frequency. Hence, the number of terms is equal to the number cells in the image. We construct the feature-frame matrix in an analogous manner to the term-document matrix. Denote the number of cells as \(N\) and the motion statistics of cell \(i\) at frame \(l\) as \(x_i(l)\). The vector of motion statistics is defined as \(\mathbf{x }_l=[x_1(l),\ldots ,x_N(l)]^T\). For a sequence of \(L\) frames, the feature-frame matrix is defined as \(\mathbf{X }=[\mathbf{x }_1,\ldots ,\mathbf{x }_L]\). In document analysis, the semantic variables (topics) govern the probabilistic occurrence of the terms. Similarly in our case, the structural variables of the normal behavior govern the distribution of motion vectors over cells.

For detection, we project the compressed data onto the compressed residual subspace and use the \(Q\)-statistic as described in Sect. 3.1. Figure 18 shows the schematic description of the proposed method.

5.2.4 Results on the PETS2007 benchmark video data

The PETS2007 (http://www.cvg.rdg.ac.uk/pets2007/data.html/) dataset consists of video footage obtained from a multiple camera surveillance system. For the PETS datasets processing, the time granularity for aggregating the motion count was set to 100 frames. For training, we use the S0 sequences which consisted of 4,500 frames captured at a resolution of \(720\times 576\). The sequences containing no unusual events and no externally injected “actors”, and the crowd density typically (depending on the camera) is medium.

The first PETS2007 test sequence used was S3 captured by camera 1. The sequence consists of 2,970 frames and the anomaly was a theft event. It involved two actors walking normally towards the middle of the scene where after a brief stop, they proceed to pick a bag and leave the area. As this event takes place, there is a significant flow of people in the top part of the scene. Hence, for our approach to produce the correct results, it would need to detect the anomaly and correctly highlight the time interval over which the anomalous event take place. As the sequence is short, PCA was applied directly on the features extracted from the training sequence and the eigenvalues are plotted in Fig. 19. We chose the largest three eigenvalues (i.e. \(K = 3\)) for the principal subspace, while the rest of the eigenvectors span the residual subspace. The threshold \(Q_{\beta }\) was computed according to the \(1 - \beta \) confidence level and we chose \(\beta =0.005\). Figure 20i, ii shows the projection of each column of \(\mathbf{X }_\mathrm{Train}\) and \(\mathbf{X }_\mathrm{Test}\) into the residual subspace and the horizontal line denotes the threshold \(Q_{\beta }\). The theft event is clearly highlighted in the residual subspace as the threshold \(Q_{\beta }\) is exceeded. It should be noted that the plot for the residual domain shows two peaks, which correspond to events which are 50 frames apart and are thus considered to be part of the same anomalous event. A number of corresponding frames within the area where the anomaly takes place are shown in Fig. 21.

Fig. 19
figure 19

The magnitude of the eigenvalues computed from matrix \(\mathbf{X }\) for \(\mathbf{X }_\mathrm{Train}\) (sequence S0, camera 1)

Fig. 20
figure 20

Plots of the residual vectors over train data (subfigure i: sequence S0, camera 1) and test data (subfigure ii: sequence S3, camera 1) from the PETS 2007 dataset. Horizontal axis denotes aggregated window index, vertical axis denote squares of \(\ell _2\)-norm

Fig. 21
figure 21

The detected “anomalous” activity in sequence 3 from camera 1 (PETS2007 dataset)

Fig. 22
figure 22

Plots of the residual vectors over train data (subfigure i: sequence S0, camera 3) and test data (subfigure ii: sequence S3, camera 3) from the PETS 2007 dataset. Horizontal axis denotes aggregated window index, vertical axis denote squares of \(\ell _2\)-norm

Similar results were obtained for sequences S6 from camera 1 (anomaly involves more than three actors) and sequence S3 from camera 2. To test the robustness of the approach in different environmental conditions, we use sequence S3 captured by camera 3 as a test set (with the sequence S0 from camera 3 used for training). Figure 22 show the residual vectors of the training and testing sets. Despite the different lighting and camera angles, our approach detects the anomalous event successfully. The results from the four sequences are summarized in the top four entries of Table 2.

5.2.5 Results on public transport authority dataset

The third set of experiments use video footage from four different train stations. The combined set consists of surveillance video data collected over a week. Importantly, it contains several types of anomalous events that were not artificially created and were ground-truthed in conjunction with the transport authorities. In the previous experiment on the PETS dataset, we have shown the effectiveness of volume anomaly detection framework. Here, we demonstrate the scalability of our proposed approach for this type of data.

For the first evaluation, we used the video data captured from the corridors of the train station in the peak hours of the day (7 a.m. to 11 a.m.) over a week. The 25 fps video data at resolution \(570\times 720\) is collected by two different cameras at the entry and exit points of the train station. For the training set \(\mathbf{X }_\mathrm{Train}\) we used video from five consecutive days where each day has 4 h continuous video and day \(6\)th (\(\mathbf{X }_\mathrm{Test1}\)) and \(7\)th (\(\mathbf{X }_\mathrm{Test2}\)) are used for testing only. For training, the original number of aggregated time bins is \(L=7{,}200\), the number of grid cells is \(N=100\), and the window length is 10s.

As mentioned Sect. 4.2.2, we have sub-sampled the temporal stream data, so that the number of snapshots is reduced to \(M\) when the length of the snapshots (\(L\)) is large and \(M \ll L\). The challenge was to select the value of \(M\) for an optimal performance. Figure 23 shows the plots for the false positive rate (FPR) and the rate of anomaly detection (both were normalized to 1), when \(M\) varies from \(100\) to \(300\) for the above mentioned datasets. When \(M\) is in the range of 190–230, the FPR is at a minimum and detection rate is maximized. Hence, we have used \(M = 220\) for this experiment.

Table 2 Summary of anomaly detection results: temporal subsampling case
Fig. 23
figure 23

Plots for FPR and the rate of anomaly detection, when \(M\) varies from \(100\) to \(300\) for the above mentioned full datasets

Fig. 24
figure 24

Eigenvalue distribution plot using PTA dataset with different strategies

We first examine the behavior of the eigenvalue distribution and the residual vectors using different strategies to verify the theoretical contributions. Figure 24 shows the results for complete, compressed, and decentralized data. We select \(K=4\) to cover 90 % energy in the principal subspace respectively. The threshold \(Q_{\beta }\) was computed in a similar way to the previous experiment with the desired false alarm being \(\beta = 0.005\). The residual plots are similar in all three cases (Fig. 25). We detect two real anomalies out of three from the test data with the detected anomalies corresponding to (1) an adult rubbing a small child against the wall and (2) a group of people loitering (Fig. 26). The missing anomaly was due to the fact that it take place far away from the camera and the motion features are unreliable. We repeat the same experiment with the second test set (\(\mathbf{X }_\mathrm{Test2}\)) and detect the anomalous event “group loitering” (shown in Fig. 26) which occurred during “off-peak” hours.

Fig. 25
figure 25

Squared residual plot using PTA dataset with different strategies

Fig. 26
figure 26

Anomaly detection in public surveillance data: leaning on the wall (left), hanging out in groups (center) and group loitering (right)

For the second evaluation, we use five video sequences captured from cameras covering the stairs (one sequence), an automated vending machine (one sequence) and the rail tracks from two different stations (three sequences). Both the stairs and vending machine sequences are long (8 and 16 h respectively). In the case of the rail tracks data, two of the train and test sequences are short, while the third sequence is again very long (18 h). In all cases the video was captured at 25 frames per second with a resolution of \(570\times 720\). For the training set \(\mathbf{X }_\mathrm{Train2}\), we use a total of 27 h of continuous video (without any anomaly) and 55 h of video for testing (\(\mathbf{X }_\mathrm{Test3}\)—some of the video which involve zoom action was removed as we restrict the evaluation to static views).

For \(\mathbf{X }_\mathrm{Train2}\), the parameters were \(L = 36,294\) and \(N = 100\) while the \(M\) value is set to 220 for the long video sequences. For the two shorter video sequences, the value of \(M\) is set to 50. For all five sequence, the threshold \(Q_{\beta }\) is computed in a similar way to the previous experiment with \(\beta = 0.005.\) A total of 14 anomalies are present in the video streams and our approach is able to identify 13 anomalies correctly while producing one false negative and eight false positives. One false positive is due to difficulty of differentiating between a person breaking into the vending machine (who opened the machine with a cordless drill) and the maintenance person (who also opened the vending machine with a drill). The false negative is due to the movement in the camera far field.

6 Conclusions

We have presented a framework for detecting anomalies in data streams captured by large-scale sensor networks. The work addresses a key problem of dealing with incomplete data because of the physical constraints imposed in large-scale networks. The framework further establishes the usefulness of random projection/CS as an effective solution for anomaly detection for both the case when the number of sensors or the number of data instances exceeds the communication bandwidth in a sensor network. The work exploits the fact that the intrinsic dimension of the data in typical sensor network applications is generally small relative to the raw dimension and the fact that CS is capable of capturing most information with few measurements. We have shown that spectral methods used for anomaly detection can be directly applied to the compressed data with guarantees on performance and we have demonstrated the effectiveness of the framework using both real and synthetic datasets.

7 Proof of Theorem 1

First, we can assume without loss of generality that the covariance matrix in the original domain \({\varvec{\Sigma }}_{\mathbf{x }}\) is diagonal. Indeed, suppose that the its eigenvalue decomposition is

$$\begin{aligned} {\varvec{\Sigma }}_{\mathbf{x }} = {\varvec{\Psi }}{\varvec{\Lambda }}{\varvec{\Psi }}^T, \end{aligned}$$
(16)

then the covariance matrix in the compressed domain is

$$\begin{aligned} {\varvec{\Sigma }}_{\mathbf{y }}&= {\varvec{\Phi }}{\varvec{\Sigma }}_{\mathbf{x }}{\varvec{\Phi }}^T \nonumber \\&= {\varvec{\Phi }}{\varvec{\Psi }}{\varvec{\Lambda }}{\varvec{\Psi }}^T{\varvec{\Phi }}^T \nonumber \\&= \left({\varvec{\Phi }}{\varvec{\Psi }}\right){\varvec{\Lambda }}\left({\varvec{\Phi }}{\varvec{\Psi }}\right)^T. \end{aligned}$$
(17)

But as \({\varvec{\Psi }}\) is an unitary matrix due to the definition of the eigenvalue decomposition, it follows that \({\varvec{\Phi }}{\varvec{\Psi }}\) is also a random Gaussian matrix with the same statistical properties as \({\varvec{\Phi }}\) due to Lemma 1. Thus, in the study of the eigenvalues in the compressed domain, we can safely assume \({\varvec{\Psi }}=\mathbf{I }\) to simplify the maths. This means we can express

$$\begin{aligned} {\varvec{\Sigma }}_{\mathbf{x }} = \left[\begin{array}{cc} {\varvec{\Lambda }}_{K}&\mathbf{0 }\\ \mathbf{0 }&{\varvec{\Lambda }}_{R} \end{array}\right] \end{aligned}$$
(18)

where \({\varvec{\Lambda }}_{K} = \text{ diag}(\lambda _1,\ldots ,\lambda _K)\) is the diagonal of \(K\) principal eigenvalues in the original domain, where \(\lambda _1\ge \lambda _2\cdots \ge \lambda _K\), and \({\varvec{\Lambda }}_{R}\) is the diagonal sub-matrix of the residual eigenvalues. Assume that the residual eigenvalues are sufficiently smaller compared with the principal eigenvalues, and denote the first \(K\) columns of the projection matrix \({\varvec{\Phi }}\) as \({\varvec{\Phi }}_K\), then the compressed covariance matrix can be written as

$$\begin{aligned} {\varvec{\Sigma }}_{\mathbf{y }} = {\varvec{\Phi }}_K {\varvec{\Lambda }}_{K} {\varvec{\Phi }}_K^T. \end{aligned}$$
(19)

We now shall show that \({\varvec{\Sigma }}_{\mathbf{y }}\) also has a matching principal subspace in a sense that the \(K\) principal eigenvalues of \({\varvec{\Sigma }}_{\mathbf{y }}\), which we denote by \(\xi _1,\ldots ,\xi _K\) are close to the \(K\) principal eigenvalues of \({\varvec{\Sigma }}_{\mathbf{x }}\), or equivalently \({\varvec{\Lambda }}_K\), while the rest is small. To do so we define the intermediate covariance matrix

$$\begin{aligned} {\varvec{\Sigma }}_{\mathbf{z }}&= {\varvec{\Phi }}_K^T {\varvec{\Sigma }}_{\mathbf{y }} {\varvec{\Phi }}_K \nonumber \\&= {\varvec{\Phi }}_K^T {\varvec{\Phi }}_K \ \ {\varvec{\Lambda }}_{K} \ \ {\varvec{\Phi }}_K^T {\varvec{\Phi }}_K. \end{aligned}$$
(20)

Denote as \(\kappa _1,\kappa _2,\ldots ,\kappa _K\) the eigenvalues of \({\varvec{\Sigma }}_{\mathbf{z }}\). Our strategy is first to show that \(|\xi _i-\kappa _i|, i=1,\ldots ,K\) are small, and \(|\lambda _i-\kappa _i|, i=1,\ldots ,K\) are also small. Then we deduce the bound on \(|\xi _i-\lambda _i|, i=1,\ldots ,K\).

The mathematical foundation of our proof consists of

  • The concentration bound of the spectral norm of Gaussian random matrices [see Geman (1980) and Candes et al. (2006) for example]. It follows from the theory that for the Gaussian random matrix \({\varvec{\Phi }}_K\) of size \(M\times K\) where each entry follows \(\mathcal N (0,1/\sqrt{M})\), the extreme singular values satisfy for some \(t>0\):

    $$\begin{aligned}&\Pr (\sigma _\mathrm{max}({\varvec{\Phi }}_K)) < 1 + \sqrt{K/M} + t) \ge 1 - e^{-M\frac{t^{2}}{2}}, \end{aligned}$$
    (21)
    $$\begin{aligned}&\Pr (\sigma _\mathrm{min}({\varvec{\Phi }}_K))> 1 - \sqrt{K/M} - t) \ge 1 - e^{-M\frac{t^{2}}{2}}. \end{aligned}$$
    (22)

    Let \(\delta = e^{-N\frac{t^{2}}{2}}\) or equivalently \(t = \sqrt{2\ln (1/\delta )/N}\), then the following hold with a probability of at least \(1 - \delta \),

    $$\begin{aligned}&\sigma _\mathrm{max}({\varvec{\Phi }}_K) \le 1 + \sqrt{K/M} + \sqrt{2\ln (1/\delta )/M}, \end{aligned}$$
    (23)
    $$\begin{aligned}&\sigma _\mathrm{min}({\varvec{\Phi }}_K) \ge 1 - \sqrt{K/M} - \sqrt{2\ln (1/\delta )/M}. \end{aligned}$$
    (24)

    Denote as \(\varDelta = \sqrt{K/M} + \sqrt{2\ln (1/\delta )/M}\), then the concentration bound implies that the Gaussian random matrix \({\varvec{\Phi }}_K\) is approximately unitary in a sense that the singular values are close to 1 when \(K\ll M\), with a probability of at least \(1 - \delta \)

    $$\begin{aligned} 1 - \varDelta \le \sigma _\mathrm{min}({\varvec{\Phi }}_K) \le \sigma _\mathrm{max}({\varvec{\Phi }}_K) \le 1 +\varDelta . \end{aligned}$$
    (25)

    It is also useful to note that when \(K\ll M\), the variation \(\varDelta \) is sufficiently small and thus we can deduce

    $$\begin{aligned} 1 - 2\varDelta \le \sigma _\mathrm{min}({\varvec{\Phi }}_K^T{\varvec{\Phi }}_K) \le \sigma _\mathrm{max}({\varvec{\Phi }}_K^T{\varvec{\Phi }}_K) \le 1 +2\varDelta . \end{aligned}$$
    (26)
  • The approximate invariant subspace Theorem 8.1.11 in Golub Loan (1996). This theorem governs the bound on the singular values of a covariance matrix when projected from high to low using an approximate unitary transformation. Suppose \({\varvec{\Sigma }}_1 \in \mathbb R ^{n\times n}\) is a symmetric matrix, and \(\mathbf{T } \in \mathbb R ^{n\times k}\) is an approximate unitary transformation matrix. Then the \(k\) largest singular values of \({\varvec{\Sigma }}_2 = {\varvec{T}}^T{\varvec{\Sigma }}_1{\varvec{T}}\) are close to those of \({\varvec{\Sigma }}_1\) by the following

    $$\begin{aligned} |\sigma _i({\varvec{\Sigma }}_2 ) - \sigma _i({\varvec{\Sigma }}_1 ) | \le \sqrt{2}\left( \frac{\Vert {\varvec{\Sigma }}_1{\varvec{T}}-{\varvec{T}}{\varvec{\Sigma }}_2 \Vert _2 }{\sigma _k({\varvec{T}})} + \Vert {\varvec{T}}^T{\varvec{T}} - {\varvec{I}}_k\Vert _2 \Vert {\varvec{\Sigma }}_1 \Vert _2 \right), \end{aligned}$$
    (27)

    where \(\Vert \bullet \Vert _2\) denotes the matrix norm.

With the above results, we are now ready to obtain the bounds as follows.

For  \({\varvec{\Sigma }}_{{\varvec{y}}}\) and \({\varvec{\Sigma }}_{{\varvec{z}}} = {\varvec{\Phi }}_K^T {\varvec{\Sigma }}_{{\varvec{y}}} {\varvec{\Phi }}_K \):

$$\begin{aligned} |\xi _i-\kappa _i| \le \sqrt{2} \left(\frac{\Vert {\varvec{\Sigma }}_{{\varvec{y}}}{\varvec{\Phi }}_K-{\varvec{\Phi }}_K{\varvec{\Sigma }}_{{\varvec{z}}}\Vert _2 }{\sigma _K({\varvec{\Phi }}_K)} + \Vert {\varvec{\Phi }}_K^T{\varvec{\Phi }}_K-{\varvec{I}}_K\Vert _2 \Vert {\varvec{\Sigma }}_{{\varvec{y}}}\Vert _2 \right). \end{aligned}$$
(28)

Using (25) and (25), we bound each term with a probability of at least \(1-\delta \) as follows

$$\begin{aligned} \Vert {\varvec{\Sigma }}_{{\varvec{y}}}{\varvec{\Phi }}_K-{\varvec{\Phi }}_K{\varvec{\Sigma }} _{{\varvec{z}}}\Vert _2&= \Vert {\varvec{\Phi }}_K{\varvec{\Lambda }}_K{\varvec{\Phi }}_K^T{\varvec{\Phi }} _K-{\varvec{\Phi }}_K{\varvec{\Phi }}_K^T{\varvec{\Sigma }}_{{\varvec{y}}}{\varvec{\Phi }}_K \Vert _2 \end{aligned}$$
(29)
$$\begin{aligned}&= \Vert {\varvec{\Phi }}_K{\varvec{\Lambda }}_K{\varvec{\Phi }}_K^T{\varvec{\Phi }}_K-{\varvec{\Phi }}_K{\varvec{\Phi }}_K^T{\varvec{\Phi }}_K{\varvec{\Lambda }}_K{\varvec{\Phi }}_K^T{\varvec{\Phi }}_K \Vert _2 \end{aligned}$$
(30)
$$\begin{aligned}&= \Vert {\varvec{\Phi }}_K ({\varvec{I}}_K-{\varvec{\Phi }}_K^T{\varvec{\Phi }}_K ){\varvec{\Lambda }}_K {\varvec{\Phi }}_K^T{\varvec{\Phi }}_K\Vert _2 \end{aligned}$$
(31)
$$\begin{aligned}&\le \Vert {\varvec{\Phi }}_K\Vert _2 \Vert ({\varvec{I}}_K-{\varvec{\Phi }}_K^T{\varvec{\Phi }}_K )\Vert _2 \Vert {\varvec{\Lambda }}_K \Vert _2 \Vert {\varvec{\Phi }}_K^T{\varvec{\Phi }}_K\Vert _2 \end{aligned}$$
(32)
$$\begin{aligned}&\le (1+\varDelta ) \times 2\varDelta \times \lambda _1 \times (1+2\varDelta ) \approx 2\varDelta \lambda _1, \end{aligned}$$
(33)
$$\begin{aligned} \Vert {\varvec{\Phi }}_K^T{\varvec{\Phi }}_K-{\varvec{I}}_K\Vert _2&\le 2&\varDelta , \end{aligned}$$
(34)
$$\begin{aligned} \Vert {\varvec{\Sigma }}_{{\varvec{y}}}\Vert _2&= \Vert {\varvec{\Phi }}_K{\varvec{\Lambda }}_K{\varvec{\Phi }}_K^T \Vert _2 \end{aligned}$$
(35)
$$\begin{aligned}&\le \Vert {\varvec{\Phi }}_K \Vert _2 \Vert {\varvec{\Lambda }}_K \Vert _2 \Vert {\varvec{\Phi }}_K^T \Vert _2 \end{aligned}$$
(36)
$$\begin{aligned}&\le (1+\varDelta ) \times \lambda _1 \times (1+\varDelta ) \approx (1+2\varDelta )\lambda _1, \end{aligned}$$
(37)
$$\begin{aligned} \Vert \sigma _K({\varvec{\Phi }}_K)\Vert _2&\ge 1 - \varDelta . \end{aligned}$$
(38)

Using these bounds, it follows that with a probability of at least \(1-\delta \)

$$\begin{aligned} |\xi _i-\kappa _i| \le \sqrt{2} \left(\frac{2\varDelta \lambda _1 }{1-\varDelta }+2\varDelta \lambda _1 \ \right) \approx 4\sqrt{2}\varDelta \lambda _1, \ \ \ i=1,2,\ldots ,K. \end{aligned}$$
(39)

For  \({\varvec{\Sigma }}_{{\varvec{x}}}={\varvec{\Lambda }}_K\) and \({\varvec{\Sigma }}_{{\varvec{z}}} = {\varvec{\Phi }}_K^T{\varvec{\Phi }}_K {\varvec{\Lambda }}_K {\varvec{\Phi }}_K^T{\varvec{\Phi }}_K \): Let \({\varvec{T}} ={\varvec{\Phi }}_K^T{\varvec{\Phi }}_K\) then

$$\begin{aligned} |\lambda _i-\kappa _i| \le \sqrt{2} \left(\frac{ \Vert {\varvec{\Lambda }}_K {\varvec{T}} - {\varvec{T}} {\varvec{\Sigma }}_{{\varvec{z}}}\Vert _2}{\sigma _K({\varvec{T}})} + \Vert {\varvec{T}}^T{\varvec{T}}-\mathbf{I }_K \Vert _2 \Vert {\varvec{\Lambda }}_K \Vert _2 \right). \end{aligned}$$
(40)

Again, we bound each term with a probability of at least \(1-\delta \) as follows (note that \(\mathbf{T }^T=\mathbf{T }\))

$$\begin{aligned} \Vert {\varvec{\Lambda }}_K {\varvec{T}} - {\varvec{T}} {\varvec{\Sigma }}_{{\varvec{z}}}\Vert _2&= \Vert {\varvec{\Lambda }}_K {\varvec{T}} - {\varvec{T}}{\varvec{T}} {\varvec{\Lambda }}_K{\varvec{T}} \Vert _2 \end{aligned}$$
(41)
$$\begin{aligned}&= \Vert (\mathbf{I }_K - \mathbf{T }^T\mathbf{T }){\varvec{\Lambda }}_K{\varvec{T}}\Vert _2 \end{aligned}$$
(42)
$$\begin{aligned}&\le \Vert (\mathbf{I }_K - \mathbf{T }^T\mathbf{T }) \Vert _2 \Vert {\varvec{\Lambda }}_K\Vert _2 \Vert {\varvec{T}}\Vert _2 \end{aligned}$$
(43)
$$\begin{aligned}&\le 4\varDelta \times \lambda _1 \times (1+2\varDelta ) \approx 4\varDelta \lambda _1, \end{aligned}$$
(44)
$$\begin{aligned} \Vert \sigma _K({\varvec{T}}) \Vert _2&\ge 1-2\varDelta , \end{aligned}$$
(45)
$$\begin{aligned} \Vert {\varvec{T}}^T{\varvec{T}}-\mathbf{I }_K \Vert _2&\le 4\varDelta . \end{aligned}$$
(46)

Using these bounds, it follows that with a probability of at least \(1-\delta \)

$$\begin{aligned} |\lambda _i-\kappa _i| \le \sqrt{2} \left(\frac{4\varDelta \lambda _1 }{1-2\varDelta }+4\varDelta \lambda _1 \ \right) \approx 8\sqrt{2}\varDelta \lambda _1, \ \ \ i=1,2,\ldots ,K. \end{aligned}$$
(47)

Thus, from (39) and (47) we can deduce the bound on \(|\lambda _i-\xi _i|, i=1,\ldots ,K\). We note that a triangle inequality immediately gives

$$\begin{aligned} |\lambda _i-\xi _i| \le |\lambda _i-\kappa _i| + |\xi _i-\kappa _i|. \end{aligned}$$
(48)

However, such bound can be still improved. This is because we note that if we take \({\varvec{\Lambda }}_K\) as a reference, then the singular values of \({\varvec{\Sigma }}_{\mathbf{y }}\) are also the singular values of \({\varvec{\Phi }}_K^T \left[ \begin{array}{cc}{\varvec{\Lambda }}_K&{\varvec{0}} \\ {\varvec{0}}&{\varvec{0}} \end{array} \right] {\varvec{\Phi }}_K\). Thus, every time we perform an approximate orthogonal projection by \({\varvec{\Phi }}_K\), the singular values are moved further away from those of the original matrix. Also, due to the construction of \({\varvec{\Sigma }}_{\mathbf{z }} = {\varvec{\Phi }}_K^T {\varvec{\Sigma }}_{\mathbf{y }} {\varvec{\Phi }}_K = {\varvec{\Phi }}_K^T{\varvec{\Phi }}_K {\varvec{\Lambda }}_K{\varvec{\Phi }}_K^T{\varvec{\Phi }}_K\), we conclude that the singular values of \({\varvec{\Sigma }}_{\mathbf{z }}\) even move further away from those of \({\varvec{\Lambda }}_K\). This can be obviously seen with the largest singular values, where we have shown that

$$\begin{aligned} \xi _1&\le (1+2\varDelta )\lambda _1, \end{aligned}$$
(49)
$$\begin{aligned} \kappa _1&\le (1+2\varDelta )\xi _1 \le (1+4\varDelta )\lambda _1. \end{aligned}$$
(50)

Thus, every time an approximate orthogonal transformation is applied, the bound on the singular values becomes larger. This implies that the tightest bound on the singular values of \(|\lambda _i-\xi _i|\) can be obtained by the difference between the bound on \(|\lambda _i-\kappa _i|\) and \(|\xi _i-\kappa _i|\). It follows that with a probability of at least \(1-\delta \)

$$\begin{aligned} |\lambda _i-\xi _i| \le 4\sqrt{2} \lambda _1 \left(\sqrt{\frac{K}{M}} + \sqrt{\frac{2\ln (1/\delta )}{M}} \right) \ \ \ i=1,2,\ldots ,K. \end{aligned}$$
(51)

8 Proof of Theorem 2 (bound on false alarm rate)

The residual statistics has normal distribution and the false alarm rate depends strongly on the tail. The previous proof has shown that the principal eigenvalues experience a small change under a random projection. Next, we show that there is also a small deviation in the false alarm rate. Our strategy is based on perturbation analysis of the tail of the distribution of the decision statistic. Figure 27 illustrates the tail behavior in the original and compressed domains. Here, \(Z_{\beta }^X\) denotes the normalized statistic in the original domain. The false alarm for original data is the area under the distribution curve from \(z_{\beta }^X\) to \(\infty \). Due to compression, suppose that the normalized statistic associated with the compressed data moves by \(z_{\beta }^X\pm \delta z_{\beta }^X\). Thus, the change in the false alarm can be calculated as the change of the tail area, which gives

$$\begin{aligned} \varDelta \Pr (FA) = \mathcal N (z_{\beta }^{x})\delta z_{\beta }^{x}. \end{aligned}$$
(52)

In what follows, we use the results of the previous proof to evaluate such changes in the false alarm. We note in the above expression, \(\mathcal N (z_{\beta }^{x})\) is the value of the normal distribution at \(z_{\beta }^{x}\) and is assumed known for a given desired false alarm \(\beta \). For example, with a desired false alarm of 1 %, this is approximately \(0.0267\). Then it remains to compute the change \(\delta z_{\beta }^{x}\) due to random projection to obtain compressed data.

Fig. 27
figure 27

Tail behavior for complete data and compressed data. The solid curve represents the tail of the distribution of the normalized statistic when complete data is used. The two dashed curved represent two possible tails of the distribution of the normalized statistic but when compressed data is used instead. The shaded areas represent the change in the false alarm in the two cases

We start with the result in Sect. 3 of Jackson and Mudholkar (1979), which states that the residual statistics \(Q\) has a normal distribution as follows

$$\begin{aligned} \left(\frac{Q}{\theta _{1}}\right)^{h_{0}} = Z \sim \mathcal N \left(\mu ,\sigma \right), \end{aligned}$$
(53)

where

$$\begin{aligned} \mu = 1 + \frac{\theta _{2}h_{0}(h_{0} - 1)}{\theta _{1}^{2}}, \ \ \sigma ^2 = \frac{2\theta _2 h_0^2}{\theta _1^2} \end{aligned}$$
(54)

and \(\theta _{1} = \sum _{i=K+1}^{N}{\lambda _{i}}, \theta _{2} = \sum _{i= K+1}^{N}{\lambda _{i}^{2}}, \theta _{3} = \sum _{i=K+1}^{N}{\lambda _{i}^{3}}, h_{0} = 1 - \frac{2\theta _{1}\theta _{3}}{3\theta _{2}^{2}} \) and \(K\) being the number of principal components.

Denote as \(z_{\beta }^x, \mu _x, \sigma _x\) and \(z_{\beta }^y, \mu _y, \sigma _y\) the detection threshold, mean, and standard deviation of the respected distributions in the original and compressed domain, and \(C_{\beta }\) as the \(1-\beta \) percentile of the normal distribution with an desired false alarm \(\beta \). Here,

$$\begin{aligned} \mu _{x} = 1 + \frac{\theta _{2}^{x}h_{0}^{x}(h_{0}^{x} - 1)}{(\theta _{1}^{x})^{2}},&\mu _{y} = 1 + \frac{\theta _{2}^{y}h_{0}^{y}(h_{0}^{y} - 1)}{(\theta _{1}^{y})^{2}},\nonumber \\ \sigma _{x} = \frac{2\theta _{2}^{x}(h_{0}^{x})^{2}}{(\theta _{1}^{x})^{2}},&\sigma _{y} = \frac{2\theta _{2}^{y}(h_{0}^{y})^{2}}{(\theta _{1}^{y})^{2}} \end{aligned}$$
(55)

Then it follows that

$$\begin{aligned} \frac{z_{\beta }^{x} - \mu _{x}}{\sigma _{x}} = \frac{z_{\beta }^{y} - \mu _{y}}{\sigma _{y}} = C_{\beta }. \end{aligned}$$
(56)

By writing \(Z_{\beta }^{x} - Z_{\beta }^{y} = ( \mu _{x} - \mu _{y}) + C_{\beta }(\sigma _{x} - \sigma _{y})\) and applying the triangle inequality, we obtain

$$\begin{aligned} \delta z_{\beta }^{x} = |Z_{\beta }^{x} - Z_{\beta }^{y}| \le |( \mu _{x} - \mu _{y})| + C_{\beta }|(\sigma _{x} - \sigma _{y})|. \end{aligned}$$
(57)

We next bound each term in (57). Once again, we use perturbation analysis by considering the functions

$$\begin{aligned} f_{\mu }(\theta _1,\theta _2,\theta _3)&= 1 + \frac{\theta _2 h_0 (h_0-1)}{\theta _1^2} = 1 - \frac{\theta _2}{\theta _1^2}\left(1-\frac{2\theta _1\theta _3}{3\theta _2^2}\right)\frac{2\theta _1\theta _3}{3\theta _2^2} \end{aligned}$$
(58)
$$\begin{aligned}&= 1 - \frac{2\theta _3}{3\theta _1\theta _2} + \frac{4\theta _3^2}{9\theta _2^3}, \end{aligned}$$
(59)
$$\begin{aligned} f_{\sigma }(\theta _1,\theta _2,\theta _3)&= \frac{\sqrt{2\theta _2}h_0}{\theta _1} =\frac{\sqrt{2\theta _2}}{\theta _1}-\frac{2\sqrt{2}\theta _3}{3\theta _2^{3/2}}. \end{aligned}$$
(60)

These functions allow the computation of the changes in the mean and standard deviation. For example,

$$\begin{aligned} |( \mu _{x} - \mu _{y})| \approx \sum _{i=1}^{3} \left|\frac{\partial f_{\mu }(\theta _1,\theta _2,\theta _3)}{ \partial \theta _i} \right| |\theta _i^x-\theta _i^y|. \end{aligned}$$
(61)

We note that the partial derivatives are straightforward, so it remains to derive \(|\theta _i^x-\theta _i^y|\). To do so, we use the result in Jackson and Mudholkar (1979) that

$$\begin{aligned} \theta _{i}^{x} = \mathsf{tr }(\mathbf{\Sigma _{x} })^{i} - \sum _{j=1}^{K} (\lambda _{j}^{x})^{i} \ \ i = 1,2,3. \end{aligned}$$
(62)
$$\begin{aligned} \theta _{i}^{y} = \mathsf{tr }(\mathbf{\Sigma _{y} })^{i} - \sum _{j=1}^{K} (\xi _{j}^{y})^{i} \ \ i = 1,2,3. \end{aligned}$$
(63)

From (62) and (63), we obtain

$$\begin{aligned} \delta _{1} = \theta _{1}^{x} - \theta _{1}^{y} = (\mathsf{tr }(\mathbf{\Sigma _{x} })- \mathsf{tr }(\mathbf{\Sigma _{y}) }) + \sum _{j=1}^{K}((\lambda _{j}^{x}) - (\xi _{j}^{y})). \end{aligned}$$
(64)

We now attempt to bound each RHS term of (64). First, for the trace terms we recall that

$$\begin{aligned} \mathsf{tr }(\mathbf{\Sigma _{x} })- \mathsf{tr }(\mathbf{\Sigma _{y} }) = \mathsf{tr }(\mathbf{\Sigma _{x} })- \mathsf{tr }(\mathbf{\Phi }\mathbf{\Sigma _{x} }\mathbf{\Phi }^{T}) \end{aligned}$$
(65)

From the previous remark, we can assume \(\mathbf{\Sigma }_{\mathbf{x }}\) is a diagonal matrix. \(\mathbf{\Sigma _{x} }= \text{ diag}(\lambda _1,\ldots ,\lambda _N)\). Let \({\varvec{\phi }}_{i}\) be the \(i\)th column of the matrix \(\mathbf{\Phi }^{T}\), then

$$\begin{aligned} \mathsf{tr }(\mathbf{\Phi }\mathbf{\Sigma _{x} }\mathbf{\Phi ^{T} }) =\sum _{i=1}^{M}\left(\sum _{j=1}^{N}{\phi }_{ij}^{2}\lambda _{j}\right) = \sum _{j=1}^{N}\left[ \lambda _{j} \left(\sum _{i=1}^{M}{\phi }_{ij}^{2}\right) \right] \end{aligned}$$
(66)

As each column of the matrix \(\mathbf{\Phi }\) is normalized to unity then,

$$\begin{aligned} \mathsf{tr }(\mathbf{\Phi }\mathbf{\Sigma _{x} }\mathbf{\Phi ^{T} }) = \mathsf{tr }(\mathbf{\Sigma _{x} }). \end{aligned}$$
(67)

Thus, the trace term in (64) is zero. Meanwhile, it follows from (51) with a probability of at least \(1-\delta \) that

$$\begin{aligned} |\theta _{1}^{x} - \theta _{1}^{y}| \le K 4\sqrt{2} \varDelta \lambda _1, \end{aligned}$$
(68)

where \(\varDelta = \sqrt{K/M} + \sqrt{2\ln (1/\delta )/M}\) is a small number. Similarly, for the second- and third-order terms

$$\begin{aligned} |\theta _{2}^{x} - \theta _{2}^{y}|&\le K 4\sqrt{2} \times 2\varDelta \lambda _1^2 = 8\sqrt{2}K\lambda _1^2 \varDelta , \end{aligned}$$
(69)
$$\begin{aligned} |\theta _{3}^{x} - \theta _{3}^{y}|&\le K 4\sqrt{2} \times 3\varDelta \lambda _1^3 = 12\sqrt{2}K\lambda _1^3 \varDelta . \end{aligned}$$
(70)

Thus, we can bound the variation in the mean as follows

$$\begin{aligned} |\mu _x -\mu _y| \le 4\sqrt{2}K\lambda _1 \left( \left|\frac{\partial f_{\mu }}{\partial \theta _1}\right| + 2\left|\frac{\partial f_{\mu }}{\partial \theta _2}\right|\lambda _1 + 3\left|\frac{\partial f_{\mu }}{\partial \theta _3}\right|\lambda _1^2 \right) \varDelta , \end{aligned}$$
(71)

where the partial derivatives are

$$\begin{aligned} \frac{\partial f_{\mu }}{\partial \theta _1}&= \frac{2\theta _3}{3\theta _1^2\theta _2} \end{aligned}$$
(72)
$$\begin{aligned} \frac{\partial f_{\mu }}{\partial \theta _2}&= \frac{2\theta _3}{3\theta _1\theta _2^2}- \frac{4\theta _3^2}{3\theta _2^4} \end{aligned}$$
(73)
$$\begin{aligned} \frac{\partial f_{\mu }}{\partial \theta _3}&= \frac{2}{3\theta _1\theta _2}- \frac{8\theta _3}{9\theta _2^3} \end{aligned}$$
(74)

Similarly,

$$\begin{aligned} |\sigma _x -\sigma _y| \le 4\sqrt{2}K\lambda _1 \left( \left|\frac{\partial f_{\sigma }}{\partial \theta _1}\right| + 2\left|\frac{\partial f_{\sigma }}{\partial \theta _2}\right|\lambda _1+ 3\left|\frac{\partial f_{\sigma }}{\partial \theta _3}\right|\lambda _1^2 \right) \varDelta . \end{aligned}$$
(75)

where the partial derivatives are

$$\begin{aligned} \frac{\partial f_{\sigma }}{\partial \theta _1}&= -\frac{\sqrt{2\theta _2}}{\theta _1^2}\end{aligned}$$
(76)
$$\begin{aligned} \frac{\partial f_{\sigma }}{\partial \theta _2}&= \frac{1}{\theta _1\sqrt{2\theta _2}} + \frac{\sqrt{2}\theta _3}{\theta _2^{5/2}} \end{aligned}$$
(77)
$$\begin{aligned} \frac{\partial f_{\sigma }}{\partial \theta _3}&= -\frac{2\sqrt{2}}{\theta _2^{3/2}}. \end{aligned}$$
(78)

As \(\theta _i,C_{\beta },K\) are constants when we study the bound on the change in the false alarm as a function of \(M\), all the derivatives are constant. The above results indicate that the change in the false alarm depends linearly on \(\varDelta \), which implies with a probability of at least \(1-\delta \):

$$\begin{aligned} \varDelta \Pr (FA) \le C\left(\sqrt{\frac{K}{M}} + \sqrt{\frac{2\ln (1/\delta )}{M}} \right), \end{aligned}$$
(79)

where the constant \(C\) is

$$\begin{aligned} C&= \mathcal N (z_{\beta }^x)\times 4\sqrt{2}K\lambda _1 \left( \left|\frac{\partial f_{\mu }}{\partial \theta _1}\right| + 2\left|\frac{\partial f_{\mu }}{\partial \theta _2}\right|\lambda _1 + 3\left|\frac{\partial f_{\mu }}{\partial \theta _3}\right|\lambda _1^2 \right. \nonumber \\&\left. + C_{\beta }\left|\frac{\partial f_{\sigma }}{\partial \theta _1}\right| + 2C_{\beta }\left|\frac{\partial f_{\sigma }}{\partial \theta _2}\right|\lambda _1 + 3C_{\beta }\left|\frac{\partial f_{\sigma }}{\partial \theta _3}\right|\lambda _1^2 \right) \nonumber \\&= \mathcal N (z_{\beta }^x)\times 4\sqrt{2}K\lambda _1 \left( \frac{2\theta _3}{3\theta _1^2\theta _2} + 2\left|\frac{2\theta _3}{3\theta _1\theta _2^2}- \frac{4\theta _3^2}{3\theta _2^4} \right|\lambda _1 + 3\left|\frac{2}{3\theta _1\theta _2}- \frac{8\theta _3}{9\theta _2^3} \right|\lambda _1^2 \right. \nonumber \\&\left. + C_{\beta }\left|\frac{\sqrt{2\theta _2}}{\theta _1^2} \right| + 2C_{\beta }\left|\frac{1}{\theta _1\sqrt{2\theta _2}} + \frac{\sqrt{2}\theta _3}{\theta _2^{5/2}} \right|\lambda _1 + 3C_{\beta }\left| \frac{2\sqrt{2}}{\theta _2^{3/2}}\right|\lambda _1^2 \right), \nonumber \\&\end{aligned}$$
(80)

where the following quantities are as defined previously

  • \(\mathcal N (z_{\beta }^x)\) is the value of the normal distribution at \(z_{\beta }^x\) for a given desired false alarm rate \(\beta \).

  • \(\lambda _1,\lambda _2,\ldots ,\lambda _N\) are the \(N\) eigenvalues of the data covariance matrix, and \(\theta _i, i=1,2,3\) are defined as

    $$\begin{aligned} \theta _{1} = \sum _{i=K+1}^{N}{\lambda _{i}}, \ \ \theta _{2} = \sum _{i=K+1}^{N}{\lambda _{i}^{2}}, \ \ \theta _{3} = \sum _{i=K+1}^{N}{\lambda _{i}^{3}}. \end{aligned}$$
    (81)
  • \(C_{\beta }\) is the \(1-\beta \) percentile of the normal distribution.

We note that the bound derived might not be tight because the applications in (69) and (70) are very rough. We show that tighter bounds can be obtained if we can make the following assumptions

  • All principal eigenvalues are the same and all residual eigenvalues are the same. Furthermore, we assume that \(\lambda _1=\lambda _2=\cdots =\lambda _K=1\) (because scaling the data does not change the result) and \(\lambda _{K+1}=\cdots \lambda _{N} = \sigma ^2_X\).

  • The energy is concentrated on the principal subspace such that \(K\lambda _1 = \rho (N-K)\sigma ^2\). Here, \(\rho \) is the ratio of the principal energy to the residual energy. For example, with 80 % energy in the principal subspace, we have \(\rho =4\). We assume that \(\rho \gg 1\).

Denote as \(\sigma _X^2\) the residual eigenvalue of the complete data, and \(\sigma _Y^2\) the residual eigenvalue of the compressed data. It follows by definition of \(\rho \) that \(K=\rho (N-K)\sigma _X^2\). Thus \(\sigma _X^2=\frac{K}{N\rho }\). We next obtain a bound on \(\sigma _Y^2\). To do so, we rely on (68) which yields

$$\begin{aligned} |(N-K)\sigma _X^2-(M-K)\sigma _Y^2| \le K 4\sqrt{2}\varDelta , \end{aligned}$$
(82)

where \(\varDelta \) is as defined in the proof of Theorem 1. Under the assumption that \(K\ll M\) and \(K \ll N\), we can approximate this as \(|N\sigma _X^2-M\sigma _Y^2| \le K 4\sqrt{2}\varDelta \), from which we deduce the bound on \(\sigma _Y^2\) as

$$\begin{aligned} \frac{K}{M}(1-4\sqrt{2}\varDelta \rho ) \le \sigma _Y^2 \le \frac{K}{M}(1+4\sqrt{2}\varDelta \rho ). \end{aligned}$$
(83)

This allows us to derive better bounds on \(|\theta _2^X-\theta _2^Y|\) and \(|\theta _3^X-\theta _3^Y|\). In particular,

$$\begin{aligned} |\theta _2^X-\theta _2^Y|&= |(N-K)\sigma _X^4-(M-K)\sigma _Y^4| \end{aligned}$$
(84)
$$\begin{aligned}&\approx |N\sigma _X^4 - M \sigma _Y^4| \end{aligned}$$
(85)
$$\begin{aligned}&\le \max \left\{ \left|\frac{K^2}{N\rho ^2} - \frac{K^2}{M\rho ^2}(1\pm 4\sqrt{2}\varDelta \rho )^2\right| \right\} , \end{aligned}$$
(86)
$$\begin{aligned} |\theta _3^X-\theta _3^Y|&= |(N-K)\sigma _X^6-(M-K)\sigma _Y^6| \end{aligned}$$
(87)
$$\begin{aligned}&\approx |N\sigma _X^6 - M \sigma _Y^6| \end{aligned}$$
(88)
$$\begin{aligned}&\le \max \left\{ \left|\frac{K^3}{N^2\rho ^3} - \frac{K^3}{M^2\rho ^3}(1\pm 4\sqrt{2}\varDelta \rho )^3\right| \right\} . \end{aligned}$$
(89)

This yields a better probabilistic bound as

$$\begin{aligned}&\!\!\!\!\!\varDelta \Pr (FA) \le \mathcal N (z_{\beta }^x) \left( \left(\frac{2\theta _3}{3\theta _1^2\theta _2}+ \frac{\sqrt{2\theta _2}}{\theta _1^2}\right) K4\sqrt{2}\varDelta \right. \nonumber \\&\quad + \left(\left|\frac{2\theta _3}{3\theta _1\theta _2^2}- \frac{4\theta _3^2}{3\theta _2^4} \right|+ \left|\frac{1}{\theta _1\sqrt{2\theta _2}} + \frac{\sqrt{2}\theta _3}{\theta _2^{5/2}} \right| \right)\nonumber \\&\quad \times \max \left\{ \left|\frac{K^2}{N\rho ^2} - \frac{K^2}{M\rho ^2}(1\pm 4\sqrt{2}\varDelta \rho )^2\right| \right\} \nonumber \\&\quad + \left. \left(\left|\frac{2}{3\theta _1\theta _2}- \frac{8\theta _3}{9\theta _2^3} \right| + \left| \frac{2\sqrt{2}}{\theta _2^{3/2}}\right| \right) \max \left\{ \left|\frac{K^3}{N^2\rho ^3} - \frac{K^3}{M^2\rho ^3}(1\pm 4\sqrt{2}\varDelta \rho )^3\right| \right\} \right).\nonumber \\ \end{aligned}$$
(90)

Unfortunately, this expression is a rather complex function of \(M\) and it is not easy to see its dependency on \(M\) as clear as the previous bound. However, we note that it is actually decreasing. In Fig. 28 we plot the actual average deviation in the false alarm rates over 10 runs for a synthetic case where \(\rho =4\) (80 % principal energy), \(N=2{,}000\), and desired false alarm rate of 1 %. We note the following the dependency of the theoretical bound gets tighter as \(M\) increases, and best if \(M\approx N\).

Fig. 28
figure 28

Theoretical vs actual false alarm rate deviation using compressed data

9 Preservation of Gaussianity under unitary transformation

The following lemma may appear in a standard statistical text. For completeness, the result and its proof is given to support the claim in the main theorem.

Lemma 1

Suppose that \({\varvec{\Phi }}\in \mathbb R ^{M\times N}\) is an iid random matrix whose entries follow a zero-mean Gaussian distribution with variance \(\sigma ^2\). Let \(\mathbf{U }\in \mathbb R ^{N\times N}\) be a unitary matrix. Then \({\varvec{\Phi }}^{\prime }={\varvec{\Phi }}\mathbf{U }\) is also an iid Gaussian random matrix with the same variance \(\sigma ^2\).

First we prove that \(\mathsf{E }[\phi _{ij}^{\prime }]=\mathsf{E }[\phi _{ij}]=0\) and \(\mathsf{Var }[\phi _{ij}^{\prime }]=\sigma ^2\). We start from \(\phi _{ij}^{\prime } = \sum _{k=1}^{n}\phi _{ik}u_{kj}.\) Thus \( \mathsf{E }[\phi _{ij}^{\prime }] = \sum _{k=1}^{N}\mathsf{E }[\phi _{ik}]u_{kj} = 0\), whilst due to iid assumption

$$\begin{aligned} \mathsf{Var }[\phi _{ij}^{\prime }] = \sum _{k=1}^{N}\mathsf{Var }[\phi _{ik}]u_{kj} = \sum _{k=1}^{N}\sigma ^2u_{kj}^2 = \sigma ^2\sum _{k=1}^{n}u_{kj}^2=\sigma ^2. \end{aligned}$$
(91)

Next, we prove the iid in a similar way, i.e.

$$\begin{aligned} \mathsf{E }[\phi _{ij}^{\prime } \phi _{mn}]&= \mathsf{E }\left[\sum _{k=1}^{N}\phi _{ik}u_{kj} \sum _{k^{\prime }=1}^{N}\phi _{mk^{\prime }}u_{k^{\prime }n}\right] \end{aligned}$$
(92)
$$\begin{aligned}&= \sum _{k=1}^{N} \sum _{k^{\prime }=1}^{N}\mathsf{E }[\phi _{mk^{\prime }}\phi _{ik}]u_{k^{\prime }n}u_{kj} \end{aligned}$$
(93)
$$\begin{aligned}&= 0. \end{aligned}$$
(94)