Keywords

1 Introduction

Anomaly detection (AD) is one of the crucial tasks of data mining, besides clustering and classification, which detects anomalous data points in a data set automatically. Anomalies (also refers to as outliers) are data points that are significantly different from the other points in the data set. AD has applications in various domains such as fraud detection, medical or public health, intrusion detection, and machine fault detection [2, 13]. While there are wide range of algorithms proposed in the literature to detect anomalies/outliers, they cannot explain why those outliers are flagged.

Lately, as an attempt to provide such explanation, researchers are interested in Outlying Aspect Mining (OAM) [3, 9,10,11,12, 14, 15], where the task is to identify on what aspects (subset of features) a given anomaly/outlier exhibits the most outlying characteristics. In other words, OAM is the task of identifying feature subset(s), where a given query is significantly inconsistent with the rest of the data.

Fig. 1.
figure 1

University students performance on SSC %, HSC %, Degree %, MBA % and Employment test %. The red square point represents candidate A. (Color figure online)

OAM has many real-world applications. For example, when evaluating job applications, recruitment team wants to know strengths/weaknesses of each candidate, i.e., they want to know in what aspect a candidate is outstanding among the applicants. Let’s have a look at the example of 215 candidates dataFootnote 1 with their scores in percentage (%) in their secondary school (SSC), higher secondary (HSC), undergraduate degree, MBA and Employment test. As shown in Fig. 1(c), Candidate A scored very high percentage in the Employment test while having quite low percentage in SSC.

Another example, assume that you are a football coach or commentator and you want to highlight the strengths and/or weaknesses of a player in the most recent game. Moreover, when the doctor wants to know in which aspect a given patient’s condition is not normal [10].

OAM algorithm requires two techniques – (i) scoring measure and (ii) subspace search technique, to detect outlying aspect of a given query. Distance or density-based scoring measures are widely used in the OAM algorithms. The main drawback of these types of measures in OAM is that such measures are computationally expensive and dimensionality biased. Vinh et al. [14] proposed to use Z-score normalization to make density-based scoring measures dimensionality unbiased so that they can be compared to rank subspaces with different dimensionality. It requires computing outlier scores of each data point in each subspace. It adds significant computational overhead making OAM algorithms making them infeasible to run in large and/or high-dimensional datasets.

To summarise, most OAM scoring measures do not work well in practical application due to the following main reasons:

  • High time complexity: Existing scoring measure uses distance or density, and the known weakness of these measures is, that they are computationally expensive.

  • Dimensionality unbiasedness: In OAM, subspaces with different dimensionalities are compared to find the best subspace. Thus, we need a measure that is dimensionally unbiased to rank those subspaces. Existing density or distance-based measures are dimensionally biased.

This paper makes the following contributions:

  • Propose a new scoring measure for outlying aspect mining algorithm based on sGrid density estimator. We extend the sGrid density estimator and make it dimensionally unbiased measure, thus it does not require any additional normalization. We called the proposed measure sGrid++.

  • Compare sGrid++ against three existing OAM scoring measures using synthetic and real-world datasets.

  • Through our empirical evaluation, we demonstrate that sGrid++ is a dimensionality unbiased measure. In addition to that, it is faster than existing scoring measures.

The rest of the paper is organized as follows. Section 2 provides a summary of previous work on outlying aspect mining. The proposed scoring measure is presented in Sect. 3. Empirical evaluation results are provided in Sect. 4. Finally, conclusions are provided in Sect. 5.

2 Related Work

Let \(\mathcal {O} = \{o_1, o_2, \cdots , o_N\}\) be a collection of N data objects in M-dimensional real domain. Each object \(\textbf{o}\) is represented as M-dimensional vector \(\langle o.1, o.2, \cdots , o.M \rangle \). The feature set \(\mathcal {F}\) = \(\{F_1, F_2, \cdots , F_{M}\}\) denotes the full feature space and \(\mathcal {S}_{\mathcal {F}}\) = \(\{ S_1, S_2, \cdots , S_n \}\) is the set of all possible subspaces (i.e., \(|\mathcal {S}_{\mathcal {F}}|=2^M\)). The problem of OAM is to identify \(S_i \in \mathcal {S}_{\mathcal {F}}\) in which a given query object \(o_i \in \mathcal {O}\) is significantly different from the rest of the data.

2.1 Problem Formulation

Definition 1 (Problem definition)

Given a set of N instances \(\mathcal {O}\) (\(\Vert \mathcal {O}\Vert \) = N) in M dimensional space, a query \(\textbf{q} \in \mathcal {O}\), a subspace S is called outlying aspect of \(\textbf{q} \) iff,

  • outlyingness of \(\textbf{q}\) in subspace S is higher than other subspaces; and

  • there is no other subspace with same or higher outlyingness.

The main aim of outlying aspect mining is to identify a minimal outlying aspect of a given query.

2.2 Outlying Aspect Mining Techniques

Duan et al. (2015) [3] employs depth-first search [8] with kernel density estimation (KDE) based outlying measure, called OAMiner. For a given query \({\textbf {q}}\) and data set \(\mathcal {O}\), OAMiner computes outlyingness as follows:

$$ f_{S}({\textbf {q}}) = \frac{1}{N(2 \pi )^{\frac{m}{2}} \prod \limits _{i \in S} h.{i}} \sum \limits _{x \in \mathcal {O}} e^ {- \sum \limits _{i\in S} \frac{({\textbf {q}}.i - x.i)^2}{2 {h.i}^2}} $$

where, \(f_S({\textbf {q}})\) is a kernel density estimation of \({\textbf {q}}\) in subspace S (\(|S|=m\)), h.i is the kernel bandwidth in dimension i.

OAMiner first ranks a data point based on the density of each data point in each subspace. After ranking the data point in each subspace, it sorts the subspace in ascending order based on the score. Lastly, it returns the top-ranked subspace(s) as an outlying aspect of a given query \({\textbf {q}}\).

Vinh et al. (2016) [14] discussed the issue of using density rank as an outlying measure in OAM and provided some examples of where it can be counterproductive. They suggested to use Z-score normalized density to compare subspaces of different dimensionalities. Z-score computes the outlyingness of query \({\textbf {q}}\) in subspace S as:

$$ \text{ Z } (f_S({\textbf {q}})) \triangleq \frac{f_S({\textbf {q}}) -\mu _{f_S}}{\sigma _{f_S}} $$

where \(\mu _{f_S}\) and \(\sigma _{f_S}\) are the mean and standard deviation of densities of all data instances in subspace S, respectively.

They formulate the concept of dimensionality unbiasedness and proposed to use Z-score normalization to convert any dimensionality-biased measure to unbiased one. Authors have combined the Beam search with the density Z-score measure to identify outlying aspects of a given query.

Wells and Ting (2019) [15] proposed sGrid density estimator, which is a smoothed variant of the traditional grid-based estimator (a.k.a histogram). They also used Z-score normalization to make the score dimensionality unbiased. Because sGrid density can be computed faster than KDE, it allows Beam search OAM to run orders of magnitude faster.

Samariya et al. (2020) [9] proposed a Simple Isolation score using Nearest Neighbor Ensemble (SiNNE in short) measure. SiNNE constructs t ensemble of models (\(\mathcal {M}_1, \mathcal {M}_2, \cdots , \mathcal {M}_t\)). Each model \(\mathcal {M}_i\) is constructed from randomly chosen sub-samples (\(\mathcal {D}_i \subset \mathcal {O}, |\mathcal {D}_i| = \psi < N)\). Each model has \(\psi \) hyperspheres, where radius of hypersphere is the euclidean distance between a (\(a \in \mathcal {D}_i)\) to its nearest neighbor in \(\mathcal {D}_i\).

The outlying score of \(\textbf{q}\) in model \(\mathcal {M}_i\), \(I({\textbf {q}}\Vert \mathcal {M}_i) = 0\) if \({\textbf {q}}\) falls in any of the balls and 1 otherwise. The final outlying score of \(\textbf{q}\) using t models is:

$$ \text{ SiNNE }({\textbf {q}}) = \frac{1}{t} \sum \limits _{i=1}^t I({\textbf {q}}\Vert \mathcal {M}_i) $$

2.3 Desired Properties of Outlying Scoring Measure

In this section, we provide some desired properties for an ideal outlying aspect mining scoring measure. In Table  1, we summarize the desired properties of existing scoring measures.

Dimensionality Unbiasedness. As we are comparing subspaces with different dimensionality, a scoring measure needs to be unbiased w.r.t. dimensionality. An example of a dimensionally bias scoring measure is the density measure, which decreases as dimension increases. As a result, density is biased towards higher-dimensional subspaces.

Efficiency. To find an outlying aspect of a given query, OAM algorithms are expected to search through a large number of subspaces. Thus, it is essential to have an efficient scoring measure to evaluate subspace efficiently. The efficiency of the scoring measure can be analyzed in terms of time complexity.

Effectiveness. To find the most outlying aspects of a given query, the OAM algorithm ranks each subspace based on its score. Thus, the scoring measure should be effective for getting better outlying aspects. The effectiveness of the scoring measure can be analyzed in terms of the quality of discovered subspaces.

Table 1. Summary of desired properties for scoring measures. Time complexity of estimating one query in a subspace is presented. (N = data size; m = dimensionality of subspace; \(\psi \) = sub-sample size; t = number of set; w = block size for bit set operation).

3 sGrid++: The New Proposed Measure

Wells and Ting (2019) [15] introduced a simple and effective alternative of kernel density estimator called sGrid, which allows systematic search method to run faster in outlying aspect mining domain. sGrid is smoothed variant of the grid (a.k.a histogram) based density estimator. sGrid computes the density of multi-dimensional subspace as a multi-dimensional grid. The grid’s width is set based on the bin width in one dimension.

sGrid computes the outlying score of a given data point based on a grid in which the point falls and its neighboring grids. sGrid measure is two orders of magnitude faster than kernel density estimation [15]. However, sGrid is dimensionally biased, i.e., sGrid density tends to decreases as dimensions increases. Thus, the authors used Z-score normalization on top of sGrid, which makes sGrid computationally expensive. In addition to that, Samariya et al. [9] shows that Z-normalization is biased towards subspace having high variance. Moreover, Z-score normalization adds additional computation overhead on a measure. Thus, sGrid with Z-score normalization is not effective and due to dimensionality biasedness, it can not be used directly.

Motivated by these limitations, we proposed sGrid++, a simple yet effective variant of sGrid which is dimensionally unbiased in its raw form thus it does not require any additional normalization.

The proposed method consists of two stages. In the first stage (training stage), g number of grids are generated. The second stage is the evaluation stage, which evaluates the outlyingness of a given data point \(\textbf{q}\) in each subspace. Let \(\mathcal {O}\) be a M dimensional data set in \(\Re ^M\), and \(S \subset \mathcal {S}_\mathcal {F}\) be a subspace of m, where \(m = |S|\) dimensions and \(m \le M\).

For each dimension, the proposed measure first creates equal-width univariate bins. We used the Freedman-Diaconis rule [4] to set the bin width and number of bins in each dimension automatically for a given data set, which means sGrid++ creates b equal width bins overvalue range.

Definition 2 (Histogram)

A histogram is a set of b equal width bins, \(H = \{B_1, B_2, \cdots , B_b\}\).

Once a histogram is created, the proposed measure calculates the mass of each histogram.

Definition 3

A mass is defined as the number of data points that falls into the region.

Definition 4

A mass of a data instance \(o \in \Re ^M\) with respect to \(B_i\) is estimated as follows.

$$ \bar{mass}\Big (B_i(o)\Big ) = \Bigg \{ \left. \begin{aligned} \textbf{mass},\\ 0, \end{aligned} \right. \begin{aligned} \textit{where}\,\,{\textbf{mass}}\,\textit{is}\,\textit{the}\,\textit{mass}\,\textit{of}\,\textit{bin}\,B_i\,\textit{in}\,\textit{which}\,o\,\textit{falls}\,\textit{into}\\ \textit{otherwise} \end{aligned} $$

sGrid++ computes density as follows.

$$\begin{aligned} \text{ sGrid++ }(\textbf{q}) = \frac{ \bar{mass}(\mathcal {G}_S({\textbf {q}}))}{v_(\mathcal {G}_S({\textbf {q}}))} \end{aligned}$$
(1)

where \(\mathcal {G}_S({\textbf {q}})\) is a mass of grid (cell) in which \({\textbf {q}}\) falls into and \(v(\mathcal {G}_S({\textbf {q}}))\) is a volume of grid \(\mathcal {G}_S({\textbf {q}})\).

Assuming that, data is normalized and in the range of [0, 1]. sGrid++ creates equal width histogram in each dimension thus it creates \(b^m\) grids (cells) of the same volume because they are in same range ([0, 1]). So, volume of each cell in subspace S is

$$\begin{aligned} v(\mathcal {G}_S({\textbf {q}})) = 1/ b^m \end{aligned}$$
(2)

Plugging Eq. 2 into Eq. 1, sGrid++ density of a query \(\textbf{q}\) is estimated as follows.

$$\begin{aligned} \text{ sGrid++ }({\textbf {q}}) = \bar{mass}(\mathcal {G}_S({\textbf {q}})) \cdot b^m \end{aligned}$$
(3)

Let \(\mathcal {G}_{(i,j,k)}(\textbf{q})\) is a grid in which \({\textbf {q}}\) falls into and has indices of \((i,j,k) \in \{(1,1,1), \cdots , (b_1, b_2, b_3)\}\) in their respective dimensions in 3-dimensional space. To estimate the final outlying score of query \(\textbf{q}\), sGrid++ uses the mass of the grid in which \({\textbf {q}}\) falls into. The mass of grid in which \({\textbf {q}}\) falls is computed by bit set intersection operation \(\mathcal {G}_{(i,j,k)} = b_i \cap b_j \cap b_k\).

$$\begin{aligned} \bar{mass}(\mathcal {G}_S({\textbf {q}})) = \bigcap \limits _{i \in m} B_i({\textbf {q}}) \end{aligned}$$
(4)

The final outlying score of query \({\textbf {q}}\) in subspace S is computed as:

$$\begin{aligned} grid_S({\textbf {q}}) = \bar{mass}(\mathcal {G}_S({\textbf {q}})) \cdot b^m \end{aligned}$$
(5)

where \(\mathcal {G}_S(\textbf{q})\) is grid in which \(\textbf{q}\) falls into subspace S. \(\bar{mass}(\mathcal {G}_S(\cdot ))\) is the mass of grid \(\mathcal {G}_S(\cdot )\), which is computed as shown in Eq. 4.

A working example of the proposed measure in a two-dimensional space is shown in Fig. 2. The \(grid_S (x)\) is 3 while \(grid_S (y)\) is 1. Thus, point y is considered more outlying.

Fig. 2.
figure 2

Example of the new proposed method. The red highlighted regions shows the bins used to estimate outlyingness of two instances x and y. (Color figure online)

Proposition 1

If point q falls within grid \(\mathcal {G}_S\) which has higher mass then subspace S is not an outlying aspect of a query q.

Theorem 1

The proposed measure \(grid_S({\textbf {q}})\) is dimensionally-unbiased as per dimensionality unbiasedness definition [14, Definition 4].

Proof

Given a data set \(\mathcal {O}\) of N data instances drawn from a uniform distribution \(\mathcal {U}([0,1]^M)\).

As data is drawn from the uniform distribution, each grid has the same mass,

$$ \bar{mass}(\cdot ) = \frac{N}{g} $$

where g is total number of grid (\(g = b^M)\).

If we substitute mass in Eq. 5, for query \({\textbf {q}}\), final outlying score is,

$$ grid({\textbf {q}}) = \frac{N}{g} \cdot b^M $$
$$ = \frac{N}{g} \cdot g = N $$

Thus, an average value of the sGrid++ scoring measure is,

$$ E[grid({\textbf {q}})|{\textbf {q}}\in \mathcal {O}] = \frac{1}{N} \sum \limits _{{\textbf {q}}\in \mathcal {O}} grid({{\textbf {q}}}) $$
$$ = \frac{1}{N} \sum \limits _{{\textbf {q}}\in \mathcal {O}} N $$
$$ = \frac{1}{N} N \cdot N = N, constant \ w.r.t \ |S| $$

The proposed measure is scalable to both huge datasets and high dimensions. We will prove this by our empirical evaluation.

4 Experiments

A series of experiments were performed to answer the following questions:

  • Dimensionality unbiasedness: Does the proposed measure dimensionally unbiased?

  • Effectiveness: How accurate is the proposed method?

  • Efficiency: Does the proposed method scalable compared to its competitor w.r.t. data set size and dimensionality?

We first provided the experimental setup details before detailing our findings.

4.1 Experimental Setup

Algorithm Implementation and Parameters. All measures were implemented in Java using WEKA [5]. We implemented the proposed measure by making the required changes in the Java implementation of sGrid provided by the authors. We used sGrid and SiNNE Java implementations made available by [15] and [9], respectively.

Z(KDE) is performed by using a Gaussian kernel with default bandwidth. sGrid uses the default recommended parameter block size for a bit set operation w as 64. In terms of SiNNE, the sub-sample size \(\psi \) = 8 and ensemble size t = 100. sGrid++ also uses parameter block size for bit set operation w = 64.

All experiments were conducted in a macOS machine with a 2.3 GHz 8-core Intel Core i9 processor and 16 GB memory running on macOS Monterey 12.4. We run all jobs for 24 h and killed all uncompleted jobs.

Fig. 3.
figure 3

Dimensionality unbiasedness.

4.2 Desired Properties

Dimensionality Unbiasedness.

We generated 19 synthetic datasets, each data set contains 1000 data points from uniform distribution \(\mathcal {U}([0,1]^M)\), where M varied from 2 to 20. We computed the average score of all instances using sGrid++, SiNNE, sGrid, and KDE. The result is presented in Fig. 3. The flat line for the proposed measure and SiNNE shows that both measures are dimensionally unbiased, whereas sGrid and KDE (without Z-score normalization) are not. Note that, in [14], it is shown that using ranks and Z-score normalization, makes any score dimensionally unbiased. Hence, we did not include them in our experiment.

Table 2. Comparison of the proposed measure and its three contenders on five synthetic datasets. \(\textbf{q}\)-id represents query index, GT represents ground truth. The numbers in a bracket are feature indices (i.e. subspaces).

4.3 Mining Outlying Aspects on Synthetic Datasets

We evaluate the performance of sGrid++ and three contending scoring measures on 5 synthetic datasetsFootnote 2, where number of instances (N) are 1,000 and number of dimensions (M): 10 to 100.

Table 2 summarised the discovered subspaces of three queriesFootnote 3 by the contending measures on five synthetic datasets. In terms of exact matches, sGrid++ and SiNNE are the best performing measures that detect the ground truth of all 15 queries. Whereas Z(sGrid) and Z(KDE) produced exact matches for 14 queries.

Table 3. Comparison of the proposed measure and its three contenders on student performance data set. \({\textbf {q}}\)-id represents query index. The numbers in the bracket are feature indices (i.e. subspaces).
Table 4. Visualization of discovered subspaces by sGrid++, Z(KDE), Z(sGrid) and SiNNE in the student performance data set.

4.4 Mining Outlying Aspects on Student Performance Data Set

While evaluating job applications, the recruiting team wants to know in which aspects an applicant is different than other applicants. Considering this example as a case study, we detect outlying aspects of all top k outlier/anomaly studentsFootnote 4. We used campus placement data setFootnote 5 which has 15 features, and we removed non-numerical features and all data points with a missing value.

For each query \({\textbf {q}}\), we apply the Beam search strategy with four different scoring measures. Table 3 summarizes the outlying aspects found by four different scoring measures on the student data set.

In absence of better quality measures for outlying aspects, we visually present 3 queries in Table 4. Visually, we can say that the sGrid++ and SiNNE detects better subspaces than Z-score based scoring measures – Z(KDE) and Z(sGrid).

4.5 Mining Outlying Aspects on NBA2020 Data Set

Table 5. Comparison of sGrid++ and its three contenders on NBA 2020 technical statistics. \({\textbf {q}}\)-id represents query index. The numbers in the bracket are feature indices (i.e. subspaces).
Table 6. Visualization of discovered subspaces by sGrid++, Z(KDE), Z(sGrid) and SiNNE in the NBA2020 data set.

Let’s assume that, you are an NBA coach, commentator, or agent and you may want to know the strengths or weaknesses of a particular player, using the OAM application one can detect and find that easily. We mine data from FoxsportsFootnote 6 to prepare technical statistics on shooting, assists and defence stats of NBA 2020.

Table 5 summarizes the outlying aspects found by sGrid++ and three contending measures on 3 NBA datasets – assists, defence and shooting. We visually present the results of one query from each data set in Table 6. Visually we can say that out of three sGrid++ detects better subspaces whereas Z-score based measures Z(KDE) and Z(sGrid) are unable to detect best subspaces. SiNNE also detects better subspace as sGrid++. However, SiNNE is slower than sGrid++.

4.6 Scale-up Test

We conducted a scale-up test of these four measures w.r.t. (i) increasing data sizes (N) and (ii) increasing dimensionality (M), using synthetic datasets. We generated three equal-sized Gaussian’s with random mean \(\mu \) = [-10, 10] and variance \(\sigma \) = 1.0 in each dimension using Python Scikit-Learn [7] library. For each data set, we randomly pick 10 data points as queries and presented the average runtime. Note that, for a fair comparison we set the maximum dimensionality of subspace (\(\ell \)) = 3.

Increasing Data Size. In this scale-up test, we examined the efficiency of the contending scoring measures w.r.t. the number of data sizes (N). A wide range of N values from 100 to 5 million is used and dimension M is fixed to 5. Figure 4a shows the average runtime on 10 queries of the contending measures w.r.t. increasing data set sizes. Note that the runtime and data set size is plotted using a logarithmic scale. sGrid++ and SiNNE are the only measures to finish scale-up test for each data set. However, SiNNE is order of magnitude slower than sGrid++. Z(sGrid) is unable to finish for data set having 5 million data points in 24hrs, whereas Z(Beam) is able to finish upto data set having 50 thousand data points. Overall, sGrid++ is order of magnitude faster than SiNNE, two orders of magnitude faster than Z(sGrid) and four orders of magnitude faster than Z(KDE).

Fig. 4.
figure 4

Scale-up test.

Increasing Dimensionality. In this scale-up test, we examined the efficiency of the contending measures w.r.t. the increasing number of dimensions. A wide range of M values from 2 to 1000 and a data set size of 10,000 was used. Figure 4b shows the average runtime on 10 queries of the contending measure w.r.t. increasing data set dimensions. Except Z(KDE) all other measures are able to finish scale-up test for each data set with in 24 h. sGrid++ is the fastest measure compare to its contenders followed by SiNNE and Z(sGrid). sGrid++ is order of magnitude faster than SiNNE and Z(sGrid).

5 Conclusion

In this paper, we discussed the issue of existing scoring measures, specifically the existing density estimator sGrid. We proposed a simple yet effective solution for making existing dimensionally biased measure to unbiased. sGrid++ creates univariate histograms in each dimension of subspace. Afterwards, the mass of the grid in which a given data point falls is used to compute the outlyingness of the query in that subspace. Our extensive experiments shows that the proposed scoring measure is dimensionally unbiased and is the fastest measure compared to all its competitors.