1 Introduction

In many application scenarios, a user may wish to investigate a specific object, in particular, the aspects where the object is most unusual compared to the rest of the data. For example, when a commentator mentions an NBA player, the commentator may want to name the most distinguishing features of the player, though the player may not be top ranked on those aspects or on any others among all players. Take the technical statistics of the 220 guards on assist, personal foul and points/game in the NBA Season 2012–2013 as an exampleFootnote 1 (Fig. 1), an answer for Joe Johnson may be “the most distinguishing feature of Joe Johnson is his scoring ability with respect to his performance on personal foul” (by comparing Fig. 1a, b and c based on the notion of density).

Fig. 1
figure 1

Performance of NBA guards on assist, personal foul and points/game in the 2012–2013 Season (the solid circle \((\bullet )\) represents Joe Johnson)

As another example, when evaluating an applicant to a university program, who herself/himself may not necessarily be outstanding among all applicants, one may want to know the strength or weakness of the applicant, such as “the student’s strength is the combination of GPA and volunteer experience, ranking her/him in the top \(15\,\%\) using these combined aspects”. Moreover, in an insurance company, a fraud analyst may collect the information about various aspects of a claim, and wonder in which aspects the claim is most unusual. Furthermore, in commercial promotion, when designing an effective advertisement, it may be useful for marketers to know the most distinctive set of features characterizing the product. Similar examples can easily be found in other analytics applications.

The questions illustrated in the above examples are different from traditional outlier detection. Specifically, instead of searching for outliers from a data set, here we are given a query object and want to find the outlying aspects whereby the object is most unusual. The query object itself may or may not be an outlier in the full space or in any specific subspaces. In this problem, we are not interested in other individual outliers or inliers. The outlying aspect finding questions cannot be answered by the existing outlier detection methods directly.

We emphasize that investigating specific objects is a common practice in anomaly and fraud detection and analysis. Specifying query objects is an effective way to explicitly express analysts’ background knowledge about data. Moreover, finding outlying aspects extends and generalizes the popular exercise of checking a suspect of anomaly or fraud. Currently, more often than not an analyst has to check the features of an outlying object one by one to find outlying features, but still cannot identify combinations of features where the object is unusual.

Motivated by these interesting applications about analyzing outlying aspects of a query object, in this paper, we model and tackle the problem of mining outlying aspects on numeric data, which is related to, but critically different from traditional outlier detection. Specifically, traditional outlier detection finds outlier objects in a set, while the problem of outlying aspect mining studied in this paper finds the subspaces best manifesting the unusualness of a specified query object, using the other objects as the background in comparing different subspaces. We address several technical challenges and make solid contributions on several fronts.

First, we identify and formulate the problem of outlying aspect mining on numeric data. Although Angiulli et al. (2009, 2013) recently studied detecting outlying properties of exceptional objects, their methods find contextual rule based explanations. We will discuss the differences between our model and theirs in detail in Sect. 3. As illustrated, outlying aspect mining has immediate applications in data analytics practice.

Second, how can we compare the outlyingness of an object in different subspaces? While comparing the outlyingness of different objects in the same subspace is well studied and straightforward, comparing outlyingness of the same object in different subspaces is subtle, since different subspaces may have different scales and distribution characteristics. We propose a simple yet principled approach. In a subspace, we rank all objects in the ascending order of probability density. A smaller probability density and thus a better rank indicates that the query object is more outlying in the subspace. Then, we compare the rank statistics of the query object in different subspaces, and return the subspaces of the best rank as the outlying aspects of the query object. To avoid redundancy, we only report minimal subspaces. That is, if a query object is ranked the same in subspaces \(S\) and \(S'\) such that \(S\) is a proper subspace of \(S'\) (i.e., \(S \subset S'\)), then \(S'\) is not reported since \(S\) is more general. Our model can be extended to many outlyingness measures other than probability density, which we leave for future work.

Third, how can we compute the outlying aspects fast, particularly on high dimensional data sets? A naïve method using the definition of outlying aspects directly has to calculate the probability densities of all objects and rank them in every subspace. This method incurs heavy cost when the dimensionality is high. On a data set of 100 dimensions, \(2^{100}-1=1.27 \times 10^{30}\) subspaces have to be examined, which is unfortunately computationally prohibitive using the state-of-the-art technology. To tackle the problem, we systematically develop a heuristic method that is capable of searching data sets with dozens of dimensions efficiently. Specifically, we develop pruning techniques that can avoid computing the probability densities of many objects in many subspaces. These effective pruning techniques enable our method to mine outlying aspects on data sets with tens of dimensions, as demonstrated later in our experiments.

Last, to evaluate outlying aspect mining, we conduct an extensive empirical study on both real and synthetic data sets. We illustrate the characteristics of discovered outlying aspects, and justify the value of outlying aspect mining. Moreover, we examine the effectiveness of our pruning techniques and the efficiency of our methods.

The rest of the paper is organized as follows. We formulate the problem of outlying aspect mining in Sect. 2, and review related work in Sect. 3. In Sect. 4, we recall the basics of kernel density estimation, which is used to estimate the probability density of objects, and present the framework of our method OAMiner (for Outlying Aspect Miner). In Sect. 5, we discuss the critical techniques in OAMiner. We report a systematic empirical study in Sect. 6, and conclude the paper in Sect. 7.

2 Problem Definition

Let \(D=\{D_1, \ldots , D_d\}\) be a \(d\)-dimensional space, where the domain of \(D_i\) is \({\mathbb {R}}\), the set of real numbers. A subspace \(S \subseteq D\,(S \ne \emptyset )\) is a subset of \(D\). We also call \(D\) the full space.

Consider a set \(O\) of \(n\) objects in space \(D\). For an object \(o \in O\), denote by \(o.D_i\) the value of \(o\) in dimension \(D_{i}\,(1 \le i \le d)\). For a subspace \(S=\{D_{i_1}, \ldots , D_{i_l}\} \subseteq D\), the projection of \(o\) in \(S\) is \(o^S=(o.D_{i_1}, \ldots , o.D_{i_l})\). The dimensionality of \(S\), denoted by \(|S|\), is the number of dimensions in \(S\).

In a subspace \(S \subseteq D\), we assume that we can define a measure of outlyingness degree \(OutDeg(\cdot )\) such that for each object \(o \in O\), \(OutDeg(o)\) measures the outlyingness of \(o\). Without loss of generality, we assume that the lower the outlyingness degree \(OutDeg(o)\), the more outlying the object \(o\).

In this paper, we assume the generative model. That is, the set of objects \(O\) are generated (i.e., sampled) from an often unknown probability distribution. Thus, we can use the probability density of an object \(o\), denoted by \(f(o)\), as the outlyingness degree. The smaller the value of \(f(o)\), the more outlying the object \(o\). We discuss how to estimate the probability densities in Sect. 4.1.

How can we compare the outlyingness of an object in different subspaces? Unfortunately, we cannot compare the outlyingness degree or probability density values directly, since the outlyingness degree and the probability density values depend on the properties of specific subspaces, such as their scales. For example, it is well known that probability density tends to be low in subspaces of higher dimensionality, since such subspaces often have a larger “volume” and thus are sparser.

To tackle this issue, we propose to use rank statistics. Specifically, in a subspace \(S\), we rank all objects in \(O\) in their outlyingness degree ascending order. For an object \(o \in O\), we denote by

$$\begin{aligned} rank_S(o)=|\left\{ o'\mid o' \in O, OutDeg(o') < OutDeg(o)\right\} |+1 \end{aligned}$$
(1)

the outlyingness rank of \(o\) in subspace \(S\). The smaller the rank value, the more outlying the object is comparing to the other objects in \(O\) in subspace \(S\). We can compare the outlyingness of an object \(o\) in two subspaces \(S_1\) and \(S_2\) using \(rank_{S_1}(o)\) and \(rank_{S_2}(o)\). Object \(o\) is more outlying in the subspace where it has the smaller rank. Apparently, in Eq. 1, for objects with the same outlyingness degree (probability density value), their outlyingness ranks are the same.

Suppose for object \(o\) there are two subspaces \(S\) and \(S'\) such that \(S \subset S'\) and \(rank_S(o)=rank_{S'}(o)\). Since \(S\) is more general than \(S'\), \(S\) is more significant in manifesting the outlyingness of \(o\) at the rank of \(rank_S(o)\) relative to the other objects in the data set. Therefore, \(S'\) is redundant given \(S\) in terms of outlying aspects. Note that we use rank statistics instead of the absolute outlyingness degree values to compare the outlyingness of an object in different subspaces.

Rank statistics allows us to compare outlyingness in different subspaces, which is an advantage. At the same time, in high dimensional subspaces where the probability density values of objects are very small, comparing the ranks may not be reliable, since the subtle differences in probability density values may be due to noise or sensitivity to parameter settings in the density estimation. Ranking such objects may be misleading. Moreover, more often than not, users do not want to see high dimensional subspaces as answers, since high dimensional subspaces are hard to understand. Thus, we assume a maximum dimensionality threshold \(\ell > 0\), and consider only subspaces whose dimensionality are not greater than \(\ell \). Please note that the problem cannot be solved using a minimum density threshold, since the density values are space and dimensionality sensitive, as explained before.

Based on the above discussion, we formalize the problem as follows.

Definition 1

(Problem definition) Given a set of objects \(O\) in a multidimensional space \(D\), a query object \(q \in O\) and a maximum dimensionality threshold \(0 < \ell \le |D|\), a subspace \(S \subseteq D\,(0 < |S| \le \ell )\) is called a minimal outlying subspace of \(q\) if

  1. 1.

    (Rank minimality) there does not exist another subspace \(S' \subseteq D\,(S' \ne \emptyset )\), such that \(rank_{S'}(q) < rank_{S}(q)\); and

  2. 2.

    (Subspace minimality) there does not exist another subspace \(S'' \subset S\) such that \(rank_{S''}(q) = rank_{S}(q)\).

The problem of outlying aspect mining is to find the minimal outlying subspaces of \(q\).

Apparently, given a query object \(q\), there exists at least one, and may be more than one minimal outlying subspace.

3 Related work

Outlier analysis is a well studied subject in data mining. A comprehensive review of the abundant literature on outlier analysis is clearly beyond the capacity of this paper. Several recent surveys on the topic (Aggarwal 2013; Chandola et al. 2009; Zimek et al. 2012), as well as dedicated chapters in classical data mining textbooks (Han et al. 2011) provide thorough treatments.

Given a set of objects, traditional outlier detection focuses on finding outlier objects that are significantly different from the rest of the data set. There are different ways to measure the differences between an object and the other objects, such as proximity, distance, and probability density. Many existing methods, such as (Knorr and Ng 1999; Ramaswamy et al. 2000; Bhaduri et al. 2011), only return outliers, without focusing on explaining why those objects are outlying.

Recently, some studies attempt to explain outlying properties of outliers. The explanation may be a byproduct of outlier detection. For example, Böhm et al. (2013) and Keller et al. (2012) proposed statistical approaches CMI and HiCS to select subspaces for a multidimensional database, where there may exist outliers with high deviations. Both CMI and HiCS are fundamentally different from our method. They choose highly contrasting subspaces for all possible outliers in a data set, while our method chooses subspaces based on the query object.

Kriegel et al. (2009) introduced SOD, a method to detect outliers in axis-parallel subspaces. There are two major differences between SOD and our work. First, SOD is still an outlier detection method, and the hyperplane is a byproduct of the detection process. Our method does not detect outliers at all. Second, the models to identify the outlying subspaces in the two methods are very different. When calculating the outlyingness score, SOD only considers the nearest neighbors as references in the full space. Our method considers all objects in the database and their relationship with the query object in subspaces.

Müller et al. (2012b) presented a framework, called OutRules, to find explanations for outliers in different contexts. For each outlier, OutRules finds a set of rules \(A \rightarrow B\), where \(A\) and \(B\) are subspaces, and the outlier is normal in subspace \(A\) but deviates substantially in subspace \(B\). The deviation degree can be computed using some outlier score, such as LOF (Breunig et al. 2000). Then, a ranked list of rules is output as the explanation of the outlier. Tang et al. (2013) proposed a framework to identify contextual outliers in a given multidimensional database. Only categorical data is considered. The methods in (Müller et al. 2012b; Tang et al. 2013) find outliers and their explanations at the same time, and are not designed for finding outlying aspects for an arbitrary query object. Moreover, those two methods focus on finding “conditional outliers”, while our method does not assume this constraint.

Müller et al. (2012a) computed an outlier score for each object in a database, providing a single global measure of how outlying an object is across different subspaces. The method ranks different outliers instead of the outlying behavior of one query object. In contrast, our approach investigates all possible subspaces for an object and identifies the minimal ones where the object has the lowest density rank (where it appears most unusual), and does not use the notion of subspace clusters.

Given a multidimensional categorical database and an object, which is preferably an outlier in the database, Angiulli et al. (2009) found the top-\(k\) subsets of attributes (i.e., subspaces) from which the outlier receives the highest outlyingness scores. The outlyingness score for a given object in a subspace is calculated based on the frequency of the value that the outlier takes in the subspace. It tries to find subspaces \(E\) and \(S\) such that the outlier is frequent in one and much less frequent than expected in the other. Searching all such rules is computationally costly. To reduce the cost within a manageable scope, the method takes two parameters, \(\sigma \) and \(\theta \), to constrain the frequencies of the given object in subspaces \(E\) and \(S\), respectively. Therefore, if a query object is not outlying compared to the other objects, no outlying properties may be detected.

To the best of our knowledge, (Angiulli et al. 2009, 2013) are the only studies on finding explanation of outlying aspects and thus are most relevant to our paper. There are several essential differences between (Angiulli et al. 2009, 2013) and this study. First, (Angiulli et al. 2009, 2013) find contextual rule based explanations, while our method returns individual subspaces where the query object is mostly outlying comparing to the other subspaces. The meaning of the two types of explanation is fundamentally different. Second, (Angiulli et al. 2009) focuses on categorical data, and our method targets on numeric data. Although (Angiulli et al. 2013) considers numeric data, its mining target is substantially different from this work. Specifically, given a set of objects \(O\) in a multi-dimensional space \(D\) and a query object \(q \in O\),  (Angiulli et al. 2013) finds the pairs \((E, d)\) satisfying \(E \subseteq D\) and \(d \in D \setminus E\), such that there exists a subset \(O' \subseteq O\), including \(q\), in which objects are similar on \(E\) (referred to as explanation), while \(q\) is essentially different from the other objects in \(O'\) on \(d\) (referred to as property). Besides one-dimensional attributes, our method can find outlying subspaces with arbitrary dimensionality.

To some extent, outlyingness is related to uniqueness and uniqueness mining. Paravastu et al. (2008) discovered the feature-value combinations that make a particular record unique. Their task formulation is reminiscent of infrequent itemset mining, and uses a level-wise Apriori enumeration strategy (Agrawal and Srikant 1994). It needs a discretization step. Our method is native for continuous data.

Müller et al. (2011) proposed the OUTRES approach, which aims to assess the contribution of some selected subspaces where an object deviates from its neighborhood. OUTRES employs kernel density estimation. Different from our approach, OUTRES uses the Epanechnikov kernel rather than the Gaussian kernel. Our approach calibrates densities across subspaces using rank statistics, rather than using an adaptive neighborhood. The emphasis of OUTRES is mainly on finding outliers, rather than exploring subspaces where a query object may or may not be an outlier. Consequently, OUTRES only considers subspaces that satisfy a statistical test for non-uniformity. Moreover, for a chosen object, OUTRES computes an aggregate outlier score that incorporates only the contribution of subspaces where the object has significantly low density.

Our method uses probability density to measure outlying degree in a subspace. There are a few density-based outlier detection methods, such as (Breunig et al. 2000; Kriegel et al. 2008; He et al. 2005; Aggarwal and Yu 2001). Our method is inherently different from those, since we do not find outlier objects at all.

4 The framework

In this section, we first review the essentials of kernel density estimation techniques. Then, we present the framework of our OAMiner method.

4.1 Kernel density estimation

We use kernel density estimation (Scott 1992; Silverman 1986) to estimate the probability density given a set of objects \(O\). Given a random sample \(\{o_1, o_2, \ldots , o_n\}\) drawn from some distribution with an unknown probability density \(f\) in space \({\mathbb {R}}\), the probability density \(f\) at a point \(o \in {\mathbb {R}}\) can be estimated by

$$\begin{aligned} \hat{f}_h(o) = \frac{1}{n} \sum \limits ^n_{i=1} K_h(o-o_i) = \frac{1}{nh}\sum \limits ^n_{i=1}K\left( \frac{o-o_i}{h}\right) \end{aligned}$$

where \(K(\cdot )\) is a kernel, and \(h\) is a smoothing parameter called the bandwidth. A widely adopted approach to estimate the bandwidth is Silverman’s rule of thumb (Silverman 1986), which suggests \(h = 1.06 \sigma n^{-\frac{1}{5}}\), \(\sigma \) being the standard deviation of the sample. To further reduce the sensitivity to outliers, in this work, we use a better rule of thumb (Härdle 1990) and set

$$\begin{aligned} h=1.06\min \left\{ \sigma , \frac{R}{1.34}\right\} n^{-\frac{1}{5}} \end{aligned}$$
(2)

where \(R=X_{[0.75n]}-X_{[0.25n]}\), and \(X_{[0.25n]}\) and \(X_{[0.75n]}\), respectively, are the first and the third quartiles.

For the \(d\)-dimensional case \((d \ge 2)\), \(o=(o.D_1, \ldots , o.D_d)^T\), and \(o_i=(o_i.D_1, \ldots , o_i.D_d)^{T}\,(1 \le i \le n)\). Then, the probability density of \(f\) at point \(o \in {\mathbb {R}}^d\) can be estimated by

$$\begin{aligned} \hat{f}_H(o) = \frac{1}{n} \sum \limits ^n_{i=1} K_H(o-o_i) \end{aligned}$$

where \(H\) is a bandwidth matrix.

The product kernel, which consists of the product of one-dimensional kernels, is a good choice for multivariate kernel density estimator in practice (Scott 1992; Härdle et al. 2004). We have

$$\begin{aligned} \hat{f}_H(o) = \frac{1}{n\prod \limits ^d_{j=1} h_{D_j}}\sum \limits ^n_{i=1}\left\{ \prod \limits ^d_{j=1}K\left( \frac{o.D_j - o_i.D_j}{h_{D_j}}\right) \right\} \end{aligned}$$
(3)

where \(h_{D_i}\) is the bandwidth of dimension \(D_{i}\,(1 \le i \le d)\).

Note that the product kernel does not assume that the dimensions are independent. Otherwise, the density estimation would be

$$\begin{aligned} \hat{f}_H(o) = \prod \limits ^d_{j=1}\left( \frac{1}{n\cdot h_{D_j}} \sum \limits ^n_{i=1}K\left( \frac{o.D_j - o_i.D_j}{h_{D_j}}\right) \right) \end{aligned}$$

In this paper, we adopt the Gaussian kernel, which has been popularly used. The distance between two objects is measured by Euclidean distance. The kernel function is

$$\begin{aligned} K\left( \frac{o-o_i}{h}\right) = \frac{1}{\sqrt{2\pi }}e^{-\frac{(o-o_i)^2}{2h^2}} \end{aligned}$$
(4)

Note that other kernel functions and distance functions may be used in our framework.

Plugging Eq. 4 into Eq. 3, the density of a query object \(q \in O\) in subspace \(S\) can be estimated as

$$\begin{aligned} \hat{f}_S(q) = \hat{f}_S\left( q^S\right) = \frac{1}{n(2\pi )^{\frac{|S|}{2}}\prod \limits _{D_i \in S} h_{D_i}}\sum \limits _{o \in O} e^{-\sum \limits _{D_i \in S} \frac{(q.D_i-o.D_i)^2}{2 h_{D_i}^2}} \end{aligned}$$
(5)

Since we are interested in only the rank of \(q\), that is, \(rank_S(q)\), and

$$\begin{aligned} c=\frac{1}{n(2\pi )^{\frac{|S|}{2}}\prod \limits _{D_i \in S} h_{D_i}} \end{aligned}$$
(6)

is a factor common to every object in subspace \(S\) and thus does not affect the ranking at all, we can rewrite Eq. 5 as

$$\begin{aligned} \hat{f}_S(q) \sim \tilde{f}_S(q) = \sum \limits _{o \in O} e^{-\sum \limits _{D_i \in S} \frac{(q.D_i-o.D_i)^2}{2h_{D_i}^2}} \end{aligned}$$
(7)

where symbol “\(\sim \)” means equivalence for ranking.

For the sake of clarity, we call \(\tilde{f}_S(q)\) the quasi-density of \(q\) in \(S\). Please note that, using \(\tilde{f}_S(q)\) instead of \(\hat{f}_S(q)\) not only simplifies the description, but also saves computational cost for calculating \(rank_S(q)\). We will illustrate the details in Sect. 5.

We can show an interesting property – invariance of ranking under linear transformation. The proof can be found in Appendix 1.

Proposition 1

(Invariance) Given a set of objects \(O\) in space \(S=\{D_1, \ldots , D_d\}\), define a linear transformation \(g(o)=(a_1 o.D_1+b_1, \ldots , a_d o.D_d+b_d)\) for any \(o \in O\), where \(a_1, \ldots , a_d\) and \(b_1, \ldots , b_d\) are real numbers. Let \(O'=\{g(o)|o \in O\}\) be the transformed data set. For any objects \(o_1, o_2 \in O\) such that \(\tilde{f}_S(o_1) > \tilde{f}_S(o_2)\) in \(O\), \(\tilde{f}_S(g(o_1)) > \tilde{f}_S(g(o_2))\) if the product kernel is used and the bandwidths are set using Härdle’s rule of thumb (Eq. 2).

Using quasi-density estimation (Eq. 7), we can have a baseline algorithm for computing the outlyingness rank in a subspace \(S\), as shown in Algorithm 1. The baseline method estimates the quasi-density of each object in a data set, and ranks them. Let the total number of objects be \(n\). The baseline method essentially has to compute the distance between every pair of objects in every dimension of \(S\). Therefore, the time complexity is \(O(n^2|S|)\) in each subspace \(S\).

figure a

4.2 The framework of OAMiner

figure b

To reduce the computational cost, we present Algorithm 2, the framework of our method OAMiner (for Outlying Aspect Miner).

First of all, OAMiner removes the dimensions where all values of objects are identical, since no object is outlying in such dimensions. As a result, the standard deviations of all dimensions involved for outlying aspect mining are greater than 0.

In order to ensure that OAMiner can find the most outlying subspaces, we have to enumerate all possible subspaces in a systematic way. Here, we adopt the set enumeration tree approach (Rymon 1992), which has been popularly used in many data mining methods. Conceptually, a set enumeration tree takes a total order on the set, the dimensions in the context of our problem, and then enumerates all possible combinations systematically. For example, Fig. 2 shows a set enumeration tree that enumerates all subspaces of space \(D=\{D_1, D_2, D_3, D_4\}\).

Fig. 2
figure 2

A set enumeration tree

OAMiner searches subspaces by traversing the subspace enumeration tree in a depth-first manner. Given a set of objects \(O\), a query object \(q \in O\), and a subspace \(S\), if \(rank_S(q)=1\), then every super-space of \(S\) cannot be a minimal outlying subspace and thus can be pruned.

Pruning Rule 1

If \(rank_S(q) = 1\), according to the dimensionality minimality condition in the problem definition (Definition 1), all super-spaces of \(S\) can be pruned.

In the case of \(rank_S(q) > 1\), OAMiner prunes subspaces according to the current best rank of \(q\) in the search process. More details will be discussed in Sect. 5.3.

Heuristically, we want to find subspaces early where the query object \(q\) has a low rank, so that the pruning techniques take better effect. Motivated by this observation, we compute the outlyingness rank of \(q\) in each dimension \(D_i\), and order all dimensions in the ascending order of \(rank_{D_i}(q)\).

In general, the outlyingness rank does not have any monotonicity with respect to subspaces. That is, for subspaces \(S_1 \subset S_2\), neither \(rank_{S_1}(q) \le rank_{S_2}(q)\) nor \(rank_{S_1}(q) \ge rank_{S_2}(q)\) holds in general. Example 1 illustrates this situation with a toy data set.

Example 1

Given a set of objects \(O = \{o_1, o_2, o_3, o_4\}\) with 2 numeric attributes \(D_1\) and \(D_2\). The values of each object in \(O\) are listed in Table 1. Using Eq. 7, we estimate the quasi-density values of each object on different subspaces (Table 2). We can see that \(\tilde{f}_{\{D_1\}}(o_2) > \tilde{f}_{\{D_1\}}(o_4)\) and \(\tilde{f}_{\{D_2\}}(o_2) >\tilde{f}_{\{D_2\}}(o_4)\), which indicate \(rank_{\{D_1\}}(o_2) > rank_{\{D_1\}}(o_4)\) and \(rank_{\{D_2\}}(o_2) > rank_{\{D_2\}}(o_4)\). However, for subspace \(\{D_1, D_2\}\), since \(\tilde{f}_{\{D_1, D_2\}}(o_2) < \tilde{f}_{\{D_1, D_2\}}(o_4)\), \(rank_{\{D_1, D_2\}}(o_2) < rank_{\{D_1, D_2\}}(o_4)\).

Table 1 A numeric data set example
Table 2 Quasi-density values of objects in Table 1

To make the situation even more challenging, probability density itself does not have any monotonicity with respect to subspaces. Given a query object \(q\), and subspaces \(S_1 \subset S_2\). According to Eq. 5, we have

$$\begin{aligned} \frac{\hat{f}_{S_1}(q)}{\hat{f}_{S_2}(q)}&= \frac{\sum \limits _{o \in O} e^{-\sum \limits _{D_i \in S_1} \frac{(q.D_i-o.D_i)^2}{2 h_{D_i}^2}}}{n(2\pi )^{\frac{|S_1|}{2}}\prod \limits _{D_i \in S_1} h_{D_i}} \Bigg / \ \frac{\sum \limits _{o \in O} e^{-\sum \limits _{D_i \in S_2} \frac{(q.D_i-o.D_i)^2}{2 h_{D_i}^2}}}{n(2\pi )^{\frac{|S_2|}{2}}\prod \limits _{D_i \in S_2} h_{D_i}} \\&= (2\pi )^{\frac{|S_2|-|S_1|}{2}} \ \prod \limits _{D_i \in S_2 \setminus S_1}h_{D_i} \ \frac{\sum \limits _{o \in O} e^{-\sum \limits _{D_i \in S_1} \frac{(q.D_i-o.D_i)^2}{2 h_{D_i}^2}}}{\sum \limits _{o \in O} e^{-\sum \limits _{D_i \in S_2} \frac{(q.D_i-o.D_i)^2}{2 h_{D_i}^2}}} \end{aligned}$$

Since \(S_1 \!\subset \! S_2\), \(\sum \limits _{o \in O} e^{-\sum \limits _{D_i \in S_1} \frac{(q.D_i-o.D_i)^2}{2 h_{D_i}^2}}{/} \sum \limits _{o \in O} e^{-\sum \limits _{D_i \in S_2} \frac{(q.D_i-o.D_i)^2}{2 h_{D_i}^2}} \!\ge \! 1\) and \((2\pi )^{\frac{|S_2|\!-\!|S_1|}{2}} \!>\! 1\). However, in the case \(\prod \limits _{D_i \in S_2 \setminus S_1}h_{D_i} < 1\), there is no guarantee that \(\frac{\hat{f}_{S_1}(q)}{\hat{f}_{S_2}(q)} > 1\) always holds. Thus, neither \(\hat{f}_{S_1}(q) \le \hat{f}_{S_2}(q)\) nor \(\hat{f}_{S_1}(q) \ge \hat{f}_{S_2}(q)\) holds in general.

5 Critical techniques in OAMiner

In this section, we present a bounding-pruning-refining algorithm to efficiently compute the outlyingness rank of an object in a subspace, and discuss the critical techniques to prune subspaces.

Table 3 lists the frequently used notations in this section.

Table 3 Summary of notations

5.1 Bounding probability density

In order to obtain the rank statistics about outlyingness, OAMiner has to compare the density of the query object with the densities of other objects. To speed up density estimation of objects, we observe that the contributions from remote objects to the density of an object are very small, and the density of an object can be bounded. Technically, we can derive upper and lower bounds of the probability density of an object using a neighborhood. Again, we denote by \(\tilde{f}_S(o)\) the quasi-density of object \(o\) in subspace \(S\).

For the sake of clarity, we introduce two notations at first. Given objects \(o, o' \in O\), a subspace \(S\), and a subset \(O' \subseteq O\), we denote by \(dc_S(o, o')\) the quasi-density contribution of \(o'\) to \(o\) in \(S\), and \(\tilde{f}^{O'}_S(o)\) the sum of quasi-density contributions of objects in \(O'\) to \(o\). That is,

$$\begin{aligned} dc_S\left( o, o'\right)&= e^{-\sum \limits _{D_i \in S} \frac{\left( o.D_i-o'.D_i\right) ^2}{2h_{D_i}^2}}\\ \tilde{f}_{S}^{O'}(o)&= \sum \limits _{o' \in O'} e^{-\sum \limits _{D_i \in S} \frac{\left( o.D_i-o'.D_i\right) ^2}{2h_{D_i}^2}} \end{aligned}$$

To efficiently estimate the bounds of \(\tilde{f}_S(o)\), we define two kinds of neighborhoods. For an object \(o \in O\), a subspace \(S\), and \(\{\epsilon _{D_i}\mid \epsilon _{D_i}>0, D_i \in S\}\), the \(\epsilon \) -tight neighborhood of \(o\) in \(S\), denoted by \(TN_S^{\epsilon ,o}\), is \(\{o' \in O \mid \forall D_i \in S, |o.D_i - o'.D_i| \le \epsilon _{D_i}\}\), the \(\epsilon \) -loose neighborhood of \(o\) in \(S\), denoted by \(LN_S^{\epsilon ,o}\), is \(\{o' \in O \mid \exists D_i \in S, |o.D_i - o'.D_i| \le \epsilon _{D_i}\}\). An object is called as an \(\epsilon \) -tight (loose) neighbor if it is in the \(\epsilon \)-tight (loose) neighborhood. We will illustrate how to efficiently compute \(TN_S^{\epsilon ,o}\) and \(LN_S^{\epsilon ,o}\) in Sect. 5.2.

According to the definitions of \(TN_S^{\epsilon ,o}\) and \(LN_S^{\epsilon ,o}\), we obtain the following properties.

Property 1

\(TN_S^{\epsilon ,o} \subseteq LN_S^{\epsilon ,o}\).

Property 2

\(TN_S^{\epsilon ,o} = LN_S^{\epsilon ,o}\) if \(|S| = 1\).

Based on \(TN_S^{\epsilon ,o}\) and \(LN_S^{\epsilon ,o}\), \(O\) can be divided into three disjoint subsets: \(TN_S^{\epsilon ,o}\), \(LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}\) and \(O \setminus LN_S^{\epsilon ,o}\). For any object \(o' \in O\), we obtain a lower bound and an upper bound of \(dc_S(o,o')\) as follows.

Theorem 1

(Single quasi-density contribution bounds) Given an object \(o \in O\), a subspace \(S\), and a set \(\{\epsilon _{D_i}\mid \epsilon _{D_i}>0, D_i \in S\}\). Then, for any object \(o' \in TN_S^{\epsilon ,o}\),

$$\begin{aligned} dc_S^\epsilon \le dc_S\left( o, o'\right) \le dc^{max}_S(o) \end{aligned}$$

for any object \(o' \in LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}\),

$$\begin{aligned} dc^{min}_S(o) \le dc_S\left( o, o'\right) \le dc^{max}_S(o) \end{aligned}$$

for any object \(o' \in O \setminus LN_S^{\epsilon ,o}\),

$$\begin{aligned} dc^{min}_S(o) \le dc_S\left( o, o'\right) < dc_S^\epsilon \end{aligned}$$

where

$$\begin{aligned} dc_S^\epsilon&= e^{- \sum \limits _{D_i \in S} \frac{\epsilon _{D_i}^2}{2h_{D_i}^2}} \\ dc^{max}_S(o)&= e^{- \sum \limits _{D_i \in S} \frac{\min \limits _{o' \in O}\left\{ |o.D_i - o'.D_i|\right\} ^2}{2h_{D_i}^2}}\\ dc^{min}_S(o)&= e^{- \sum \limits _{D_i \in S} \frac{\max \limits _{o' \in O}\left\{ |o.D_i - o'.D_i|\right\} ^2}{2h_{D_i}^2}} \end{aligned}$$

The proof of Theorem 1 is given in Appendix 2.

Using the size of \(TN_S^{\epsilon ,o}\) and \(LN_S^{\epsilon ,o}\), we obtain a lower bound and an upper bound of \(\tilde{f}_S(o)\) as follows. The proof can be found in Appendix 3.

Corollary 1

(Bounds by neighborhood size) For any object \(o \in O\),

$$\begin{aligned}&|TN_S^{\epsilon ,o}| \ dc_S^\epsilon + \left( |O| - |TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \le \tilde{f}_{S}(o)\\&\tilde{f}_{S}(o) \le |LN_S^{\epsilon ,o}| \ dc_S^{max}(o) + \left( |O| - |LN_S^{\epsilon ,o}|\right) \ dc_S^\epsilon \end{aligned}$$

Corollary 1 allows us to compute the quasi-density bounds of an object without computing the quasi-density contributions of other objects to it.

Moreover, by Theorem 1, we can obtain following corollaries.

Corollary 2

(Bounds by \(\epsilon \)-tight neighbors) For any object \(o \in O\) and \(O' \subseteq TN_S^{\epsilon ,o}\),

$$\begin{aligned}&\tilde{f}_{S}^{O'}(o)+ \left( |TN_S^{\epsilon ,o}| - |O'|\right) \ dc_S^\epsilon + \left( |O| - |TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \le \tilde{f}_{S}(o)\\&\tilde{f}_{S}(o) \le \tilde{f}_{S}^{O'}(o)+ \left( |LN_S^{\epsilon ,o}| - |O'|\right) \ dc_S^{max}(o) + \left( |O| - |LN_S^{\epsilon ,o}|\right) \ dc_S^\epsilon \end{aligned}$$

Corollary 3

(Bounds by \(\epsilon \)-loose neighbors) For any object \(o \in O\) and \(TN_S^{\epsilon ,o} \subset O' \subseteq LN_S^{\epsilon ,o}\),

$$\begin{aligned}&\tilde{f}_{S}^{O'}(o)+ (|O|-|O'|) \ dc_S^{min}(o) \le \tilde{f}_{S}(o)\\&\tilde{f}_{S}(o) < \tilde{f}_{S}^{O'}(o)+ \left( |LN_S^{\epsilon ,o}| - |O'|\right) \ dc_S^{max}(o) + \left( |O| - |LN_S^{\epsilon ,o}|\right) \ dc_S^\epsilon \end{aligned}$$

Corollary 4

(Bounds by a supper set of \(\epsilon \)-loose neighbors) For any object \(o \in O\) and \(LN_S^{\epsilon ,o} \subset O' \subseteq O\),

$$\begin{aligned}&\tilde{f}_{S}^{O'}(o)+ (|O|-|O'|) \ dc_S^{min}(o) \le \tilde{f}_{S}(o)\\&\tilde{f}_{S}(o) \le \tilde{f}_{S}^{O'}(o)+ (|O|-|O'|) \ dc_S^\epsilon \end{aligned}$$

The proofs of Corollary 2, Corollary 3 and Corollary 4 can be found in Appendices 4,  5 and  6, respectively.

Since the density of \(o\) is the sum of the density contributions of all objects in \(O\), and the density contribution decreases with the distance, OAMiner first computes the quasi-density contributions from the objects in \(TN_S^{\epsilon ,o}\), then from the objects in \(LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}\), and last from the objects in \(O \setminus LN_S^{\epsilon ,o}\).

By computing the bounds of \(\tilde{f}_S(o)\), OAMiner takes a bounding-pruning-refining method, shown in Algorithm 3, to efficiently perform density comparison in subspace \(S\). Initially, OAMiner estimates the quasi-density of query object \(q\), which is denoted by \(\tilde{f}_S(q)\). Then, for an object \(o\), OAMiner first computes the bounds of \(\tilde{f}_S(o)\) by the sizes of \(TN_S^{\epsilon ,o}\) and \(LN_S^{\epsilon ,o}\) (Corollary 1), and compares the bounds with \(\tilde{f}_S(q)\) (Steps 1–8). If the relation between \(\tilde{f}_S(q)\) and the bounds can be determined, that is, either \(\tilde{f}_S(q) < \tilde{f}_S(o)\) or \(\tilde{f}_S(q) > \tilde{f}_S(o)\), then Algorithm 3 ends. Otherwise, OAMiner updates the lower and upper bounds of \(\tilde{f}_S(o)\) by involving the quasi-density contributions of objects in \(TN_S^{\epsilon ,o}\) (Steps 10–20), in \(LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}\) (Steps 21–31), and in \(O \setminus LN_S^{\epsilon ,o}\) (Steps 32–42) one by one, and repeatedly compares the updated bounds with \(\tilde{f}_S(q)\), until the relationship between \(\tilde{f}_S(q)\) and \(\tilde{f}_S(o)\) is fully determined.

figure c

In OAMiner, the neighborhood distance in dimension \(D_i\), denoted by \(\epsilon _{D_i}\), is defined as \(\alpha \sigma _{D_i}\), where \(\sigma _{D_i}\) is the standard deviation in dimension \(D_i\), and \(\alpha \) is a parameter. Our experiments show that \(\alpha \) is not sensitive, and can be set in the range of \(0.8 \sim 1.2\), by which OAMiner runs efficiently. It is still an open question about how to set the best neighborhood distance for bounding—this is a future research problem. Theorem 2 guarantees that no matter how to set the neighborhood distance, the ranking results keep unchanged.

Theorem 2

Given an object \(o \in O\), and a subspace \(S\), for any neighborhood distances \(\epsilon _1\) and \(\epsilon _2\), \(rank^{\epsilon _1}_S(o) = rank^{\epsilon _2}_S(o)\), where \(rank^{\epsilon _1}_S(o)\) \((rank^{\epsilon _2}_S(o))\) is the outlyingness rank of \(o\) in \(S\) computed using \(\epsilon _{1}\,(\epsilon _2)\).

The proof of Theorem 2 can be found in Appendix 7.

5.2 Efficiently estimating density bounds

In this subsection, we present strategies in Algorithm 3 that efficiently estimate the lower and upper bounds of quasi-density.

Consider a candidate subspace \(S \subseteq D\), and an object \(o \in O\). To estimate lower and upper bounds of \(\tilde{f}_S(o)\), OAMiner has to compute \(TN_S^{\epsilon ,o}\), \(LN_S^{\epsilon ,o}\), \(dc_S^\epsilon \), \(dc_S^{min}(o)\), \(dc_S^{max}(o)\) and \(dc_S(o,o')\) , where \(o' \in O\).

In the case \(|S| = 1\), we compute \(TN_S^{\epsilon ,o}\), \(dc_S^\epsilon \), \(dc_S^{min}(o)\), \(dc_S^{max}(o)\) and \(dc_S(o,o')\) based on their definitions directly. As pointed out in Sect. 5.1, \(TN_S^{\epsilon ,o} = LN_S^{\epsilon ,o}\) in this case. Moreover, the density contribution is symmetrical, so that the computational cost for \(dc_S(o',o)\) can be saved if \(dc_S(o,o')\) is available.

Please recall that OAMiner searches subspaces by traversing the subspace enumeration tree in a depth-first manner. For a subspace \(S\) satisfying \(|S| \ge 2\), denote by \(par(S)\) the parent subspace of \(S\). Suppose \(S \setminus par(S) = D'\,(|D'| = 1)\). Then, we have

$$\begin{aligned} TN_S^{\epsilon ,o}&= TN_{par(S)}^{\epsilon ,o} \cap TN_{D'}^{\epsilon ,o} \end{aligned}$$
(8)
$$\begin{aligned} LN_S^{\epsilon ,o}&= LN_{par(S)}^{\epsilon ,o} \cup LN_{D'}^{\epsilon ,o} \end{aligned}$$
(9)
$$\begin{aligned} dc_S^\epsilon&= e^{- \sum \limits _{D_i \in S} \frac{\epsilon _{D_i}^2}{2h_{D_i}^2}} = e^{- \left( \sum \limits _{D_i \in par(S)} \frac{\epsilon _{D_i}^2}{2h_{D_i}^2} + \frac{\epsilon _{D'}^2}{2h_{D'}^2}\right) } = dc_{par(S)}^\epsilon \cdot dc_{D'}^\epsilon \end{aligned}$$
(10)
$$\begin{aligned} dc^{min}_S(o)&= e^{- \left( \sum \limits _{D_i \in par(S)} \frac{\max \limits _{o' \in O}\left\{ |o.D_i - o'.D_i|\right\} ^2}{2h_{D_i}^2} + \frac{\max \limits _{o' \in O}\left\{ |o.D' - o'.D'|\right\} ^2}{2h_{D'}^2}\right) } \nonumber \\&= dc^{min}_{S \setminus par(S)}(o) \cdot dc^{min}_{D'}(o) \end{aligned}$$
(11)
$$\begin{aligned} dc^{max}_S(o)&= e^{- \left( \sum \limits _{D_i \in par(S)} \frac{\min \limits _{o' \in O}\left\{ |o.D_i - o'.D_i|\right\} ^2}{2h_{D_i}^2} + \frac{\min \limits _{o' \in O}\left\{ |o.D' - o'.D'|\right\} ^2}{2h_{D'}^2}\right) } \nonumber \\&= dc^{max}_{S \setminus par(S)}(o) \cdot dc^{max}_{D'}(o) \end{aligned}$$
(12)
$$\begin{aligned} dc_S\left( o,o'\right)&= e^{-\sum \limits _{D_i \in S} \frac{\left( o.D_i-o'.D_i\right) ^2}{2h_{D_i}^2}} = e^{-\left( \sum \limits _{D_i \in par(S)} \frac{\left( o.D_i-o'.D_i\right) ^2}{2h_{D_i}^2} + \frac{(o.D'-o'.D')^2}{2h_{D'}^2}\right) } \nonumber \\&= dc_{par(S)}\left( o,o'\right) \cdot dc_{D'}\left( o,o'\right) \end{aligned}$$
(13)

Thus, it is efficient for OAMiner to estimate the bounds of \(\tilde{f}_S(o)\) using \(par(S)\) and \(S \setminus par(s)\).

5.3 Subspace pruning

Recall that OAMiner searches subspaces by traversing the subspace enumeration tree in a depth-first manner. During the search process, let \({\mathcal {S}}_1\) be the set of subspaces that OAMiner has searched, and \({\mathcal {S}}_2\) the set of subspaces that OAMiner has not searched yet. Clearly, \(|{\mathcal {S}}_1 \cup {\mathcal {S}}_2| = 2^{|D|} - 1\). Given a query object \(q\), let \(r_{best} = \min \limits _{S \in {\mathcal {S}}_1}\{rank_S(q)\}\) be the best rank that \(q\) has achieved so far. We can use \(r_{best}\) to prune some subspaces not searched yet. Specifically, for a subspace \(S \in {\mathcal {S}}_2\), once we can determine that \(rank_S(q) > r_{best}\), then \(S\) cannot be an outlying aspect, and thus can be pruned.

Observation 1

When subspace \(S\) is met in a depth-first search of the subspace set enumeration tree, let \(r_{best}\) be the best rank of \(q\) in all the subspaces searched so far. Given object \(q\) with \(rank_S(q) \ge 1\), if for every proper super-space \(S' \supset S\), \(rank_{S'}(q) > r_{best}\), then all proper super-spaces of \(S\) can be pruned.

For the case that \(rank_S(q) = 1\), all super-spaces of \(S\) can be pruned directly due to the dimensionality minimality condition in the problem definition (Pruning Rule 1). Thus, we only consider the case \(rank_S(q) > 1\) here.

To implement Observation 1, in a subspace \(S\) where \(rank_S(q) > 1\), we check whether there are at least \(r_{best}\) objects that are ranked better than \(q\) in every super-space of \(S\). If so, all the super-spaces of \(S\) can be pruned. Please note that the condition OAMiner checks is sufficient, but not necessary.

Recall that the common factor \(c\) (Eq. 6) does not affect the outlyingness rank. For simplicity, OAMiner computes the quasi-density \(\tilde{f}_S(o)\) (Eq. 7) instead of probability density \(\hat{f}_S(o)\) (Eq. 5) for ranking. Then, we have the following monotonicity of \(\tilde{f}_S(o)\) with respect to subspaces.

Lemma 1

Consider a set of objects \(O\), and two subspaces \(S\) and \(S'\) satisfying \(S' \supset S\). Let \(D_i \in S'\!\setminus \! S\). If the standard deviation of \(O\) in \(D_i\) is greater than 0, then for any object \(o \in O\), \(\tilde{f}_S(o) > \tilde{f}_{S'}(o)\).

Proof

Consider \(D_i \in S'\!\setminus \! S\), for any object \(o' \in O\), we have \(\frac{(o.D_i-o'.D_i)^2}{2h_{D_i}^2} \ge 0\). Since the standard deviation of \(O\) in \(D_i\) is greater than 0, there exists at least one object \(o'' \in O\), such that \(\frac{(o.D_i-o''.D_i)^2}{2h_{D_i}^2} > 0\), that is, \(e^{-\frac{(o.D_i-o''.D_i)^2}{2h_{D_i}^2}} < 1\). Thus,

$$\begin{aligned} \tilde{f}_S(o)&= \sum \limits _{o' \in O} e^{-\sum \limits _{D_i \in S} \frac{\left( o.D_i-o'.D_i\right) ^2}{2h_{D_i}^2}} \\&> \sum \limits _{o' \in O} e^{- \left( \sum \limits _{D_i \in S} \frac{\left( o.D_i-o'.D_i\right) ^2}{2h_{D_i}^2} + \sum \limits _{D_i \in S'\setminus S} \frac{\left( o.D_i-o'.D_i\right) ^2}{2h_{D_i}^2} \right) } = \tilde{f}_{S'}(o) \end{aligned}$$

\(\square \)

Recall that OAMiner removes the dimensions with standard deviation 0 in the preprocessing step (Step 2 in Algorithm 2). Thus, the standard deviation of any dimension \(D_i \in S' \setminus S\) is greater than 0.

OAMiner sorts all dimensions in \(D\) in the ascending order of \(rank_{D_i}(q)\,(D_i \in D)\), and traverses the subspace set enumeration tree in the depth-first manner. Denote by \(R\) the ascending order of \(rank_{D_i}(q)\). For a subspace \(S = \{D_{i_1}, \dots , D_{i_m}\}\), listing in \(R\), let \(R(S) = \{D_j \mid D_j\) is behind \(D_{i_m}\) in \(R\}\). By Lemma 1, for any subspace \(S'\) such that \(S \subset S' \subseteq S \cup R(S)\), the minimum quasi-density of \(q\), denoted by \(\tilde{f}^{min}_{sup(S)}(q)\), is \(\tilde{f}_{S \cup R(S)}(q)\). An object \(o \in O\) is called a competitor of \(q\) in \(S\) if \(\tilde{f}_S(o) < \tilde{f}^{min}_{sup(S)}(q)\). The set of competitors of \(q\) in \(S\) is denoted by \(Comp_S(q)\). Clearly, for any \(o \in Comp_S(q)\), by Lemma 1 we have \(\tilde{f}_{S'}(o) < \tilde{f}_S(o) < \tilde{f}^{min}_{sup(S)}(q) \le \tilde{f}_{S'}(q)\). Thus, \(rank_{S'}(o) < rank_{S'}(q)\). Moreover, we have the following property of \(Comp_S(q)\).

Property 3

Given a query object \(q\) and a subspace \(S\), for any subspace \(S'\) such that \(S \subset S'\), \(Comp_S(q) \subseteq Comp_{S'}(q)\).

Proof

Since \(S \subset S'\), by Lemma 1, for any \(o \in Comp_S(q)\),

$$\begin{aligned} \tilde{f}_{S'}(o) < \tilde{f}_S(o) < \tilde{f}^{min}_{sup(S)}(q). \end{aligned}$$

Since \(\tilde{f}^{min}_{sup(S)}(q) \le \tilde{f}^{min}_{sup(S')}(q)\), we have

$$\begin{aligned} \tilde{f}_{S'}(o) < \tilde{f}_S(o) < \tilde{f}^{min}_{sup(S)}(q) \le \tilde{f}^{min}_{sup(S')}(q). \end{aligned}$$

Thus, \(o \in Comp_{S'}(q)\). That is, \(Comp_S(q) \subseteq Comp_{S'}(q)\). \(\square \)

Correspondingly, OAMiner performs subspace pruning based on the number of competitors.

Pruning Rule 2

When \(S\) is met in a depth-first search of the subspace set enumeration tree, let \(r_{best}\) be the best rank of \(q\) in all the subspaces searched so far. If there are at least \(r_{best}\) competitors of \(q\) in \(S\), \(i.e.\), \(|Comp_S(q)| \ge r_{best}\), then all proper super-spaces of \(S\) can be pruned.

Next, we discuss how to compute \(\tilde{f}^{min}_{sup(S)}(q)\) when the maximum dimensionality threshold of an outlying aspect, \(\ell \), is less than \(|S| + |R(S)|\). In this situation, \(|S| < |S'| \le \ell < |S| + |R(S)|\). Clearly, it is unsuitable to use \(\tilde{f}_{S \cup R(S)}(q)\) as \(\tilde{f}^{min}_{sup(S)}(q)\). Intuitively, we can set \(\tilde{f}^{min}_{sup(S)}(q)\) to \(\min \{\tilde{f}_{S'}(q) \mid |S'| = \ell , S \subset S' \subset S \cup R(S)\}\). However, the computational cost may be high, since the number of candidates is \(\left( {\begin{array}{c}|R(S)|\\ \ell -|S|\end{array}}\right) \). Alternatively, we suggest a method to efficiently compute \(\tilde{f}^{min}_{sup(S)}(q)\), which uses a lower bound of \(\tilde{f}_{S'}(q)\).

For object \(o'\), the quasi-density contribution of \(o'\) to \(q\) in \(S\), denoted by \(\tilde{f}_S(q, o')\), is \(e^{-\sum \limits _{D_i \in S}\frac{(q.D_i - o'.D_i)^2}{2h_{D_i}^2}}\). Let \(R(S, o')\) be the set of \((\ell -|S|)\) dimensions in \(R(S)\) with the largest values of \(\frac{|q.D_j - o'.D_j|}{h_{D_j}}\,(D_j \in R(S))\). Then, the minimum quasi-density contribution of \(o'\) to \(q\) in \(S'\) (\(S \subset S'\)) is \(\tilde{f}_{S \cup R(S, o')}(q, o')\). Since \(\tilde{f}_{S'}(q) = \sum \limits _{o' \in O} \tilde{f}_{S'}(q, o')\), we have \(\tilde{f}^{min}_{sup(S)}(q) = \sum \limits _{o' \in O}\tilde{f}_{S \cup R(S, o')}(q, o') \le \tilde{f}_{S'}(q)\).

Please note that if we compare \(\tilde{f}^{min}_{sup(S)}(q)\) with the quasi-density values of all objects in \(O\), the computational cost for density estimation is considerably high. Especially, when the size of \(O\) is large, for the sake of efficiency, we make a tradeoff between subspace pruning and object pruning. Specifically, when we are searching a subspace \(S\), once we can determine that \(rank_S(q) > r_{best}\), then we terminate the search of \(S\) immediately.

Algorithm 4 gives the pseudo-code of computing outlyingness rank and pruning subspaces in OAMiner. Theorem 3 guarantees that Algorithm 4 can find all minimal outlying subspaces.

figure d

Theorem 3

(Completeness of OAMiner) Given a set of objects \(O\) in a multi-dimensional space \(D\), a query object \(q \in O\) and a maximum dimensionality threshold \(0 < \ell \le |D|\), OAMiner finds all minimal outlying subspaces of \(q\).

The proof of Theorem 3 is given in Appendix 8.

6 Empirical evaluation

In this section, we report a systematic empirical study using several real data sets and synthetic data sets to verify the effectiveness and efficiency of our method. All experiments were conducted on a PC with an Intel Core i7-3770 3.40 GHz CPU and 8 GB main memory, running the Windows 7 operating system. The algorithms were implemented in Java and compiled by JDK 7. Since it may likely be too hard for the user to understand the meaning of subspaces with dimensionality more than 5, we set \(\ell = 5\) and \(\alpha = 1.0\) as default in OAMiner.

6.1 Mining outlying aspects on real data sets

NBA coaches, sport agents, and commentators may want to know in which aspects a player is most unusual. Using this application scenario as a case study, we first investigate the outlying aspects of all NBA guards, forwards and centers in the 2012–2013 Season. We collect the technical statistics on 20 numerical attributes from http://sports.yahoo.com/nba/stats. Table 4 shows the names of dimensions. The statistics for centers on 3-points (items 6, 7 and 8) are removed since the statistics for most centers are 0. Besides, we apply OAMiner to several real world data sets from the UCI repository (Bache and Lichman 2013). In our experiments, we remove non-numerical attributes and all instances containing missing values. Table 5 shows the data characteristics.

Table 4 The 20 technical statistics
Table 5 Data set characteristics

For each data set, we take each record as a query object \(q\), and apply OAMiner to discover the outlying aspects of \(q\). Figure 3 shows the distributions of the best outlyingness ranks of objects on the data sets. Surprisingly, the best outlyingness ranks of most objects are small, that is, most objects are ranked very good in outlyingness in some subspaces. For example, 90 guards \((40.9\,\%)\), 81 forwards \((50.6\,\%)\) and 32 centers \((69.6\,\%\)) have an outlyingness rank of 5 or better. Most players have some subspaces where they are substantially different from the others. The observation justifies the need for outlying aspect mining.

Figure 4 shows the distributions of the number of the minimal outlying subspaces where the objects achieve the best outlyingness rank on the data sets. For most objects, the number of outlying aspects is small, which is also surprising. As shown in Fig. 4a, 150 (68.2 %) objects in Guards have only 1 outlying aspect. This indicates that most objects can be distinguished from the others using a small number of factors.

Fig. 3
figure 3

Distributions of outlyingness ranks \((\ell = 5)\). (a) Guards, (b) Forwards, (c) Centers, (d) Breast cancer, (e) Climate model, (f) Concrete slump, (g) Parkinsons, and (h) Wine

Fig. 4
figure 4

Distributions of total number of outlying aspects \((\ell = 5)\). (a) Guards, (b) Forwards, (c) Centers, (d) Breast cancer, (e) Climate model, (f) Concrete slump, (g) Parkinsons, and (h) Wine

Table 6 summarizes the mining results of OAMiner on real data sets when \(\ell = 4, 5, 6\), respectively. Not surprisingly, the smallest values of outlyingness rank, number of outlying aspects, dimensionality are 1. With larger value of \(\ell \), the average outlyingness rank decreases, while the average number of outlying aspects and the average dimensionality increase. In addition, we can see that more outlying aspects with a higher dimensionality can be found on data sets with more attributes and more instances. For example, the average number of outlying aspects discovered from Breast cancer is the largest.

Table 6 Sensitivity of OAMiner’s effectiveness w.r.t. parameter \(\ell \)

6.2 Outlying aspects discovery on synthetic data sets

Keller et al. (2012) provided a collection of synthetic data sets, each consisting 1,000 data objects. Each data set contains some subspace outliers, which deviate from all clusters in at least one 2-5 dimensional subspace. As stated in Keller et al. (2012), an object can be an outlier in multiple subspaces independently. We perform test on the data sets of 10, 20, 30, 40, 50 dimensions, and denote the data sets by \(Synth\_10D\), \(Synth\_20D\), \(Synth\_30D\), \(Synth\_40D\), \(Synth\_50D\), respectively.

For an outlier \(q\) in a data set, let \(S\) be the ground truth about outlying subspace of \(q\). Please note that \(S\) may not be an outlying aspect of \(q\) if there exists another outlier more outlying than \(q\) in \(S\), since OAMiner finds the subspaces whereby the query object is most outlying. To verify the effectiveness of OAMiner using the known ground truth about outlying subspaces, in the case of multiple implanted outliers in \(S\), we keep \(q\) and remove the other outliers, and take \(q\) as the query object. Since \(q\) is the only implanted strong outlier in subspace \(S\), OAMiner is expected to find the ground truth outlying subspace \(S\) where \(q\) takes rank 1 in outlyingness, that is, \(rank_S(q)=1\).

We divide the mining results of OAMiner into the following 3 cases:

  • Case 1: only the ground truth outlying subspace is discovered by OAMiner with outlyingness rank 1.

  • Case 2: besides the ground truth outlying subspace, OAMiner finds other outlying aspects with outlyingness rank 1.

  • Case 3: instead of the ground truth outlying subspace, OAMiner finds a subset of the ground truth as an outlying aspect with outlyingness rank 1.

Table 7 lists the mining resultsFootnote 2 on \(Synth\_10D\). For all outliers (query objects), outlying aspects with outlyingness rank 1 are discovered. Moreover, we can see that for objects 183, 315, 577, 704, 754, 765 and 975, OAMiner finds not only the ground truth outlying subspace, but also some other outlying subspaces (Case 2). For object 245, the outlying aspect discovered by OAMiner is a subset of the ground truth outlying subspace (Case 3). For the other 11 objects, the outlying aspects discovered by OAMiner are identical with the ground truth outlying subspaces (Case 1).

Table 7 Outlying aspects on \(Synth\_10D\)
Fig. 5
figure 5

Outlying aspects for objects 245 and 315 in \(Synth\_10D\)

To further demonstrate the effectiveness of OAMiner, for object 245 in Case 3, we illustrate the outlying aspect \(\{2,5\}\) in Fig. 5a, and for object 315 in Case 2, we illustrate the outlying aspect \(\{3,4\}\) in Fig. 5b. Visually, the objects show outlying characteristics in the corresponding outlying aspects.

Table 8 summarizes the mining results of OAMiner on the synthetic data sets of 10, 20, 30, 40, 50 dimensions. As OAMiner finds all subspaces in which the outlyingness rank of the query object are the minimum, we can see that the number of Case 2 increases with higher dimensionality. In other words, more outlying aspects can be found on data sets with more attributes. Please note that this observation is consistent with the experimental observations in real data sets (Sect. 6.1). In addition, the number of Case 3 increases a bit, since OAMiner applies the dimensionality minimality condition to outlying aspect mining.

6.3 Outlying aspects discovery on NBA data sets

As a real case study, we verified the usefulness of outlying aspect mining by analyzing the outlying aspects of some NBA players.

Please note that “outlying” is different from “outstanding”. A player receives a good outlyingness rank in a subspace if very few other players are close to him in the subspace, regardless of whether the performance is “good” or not. Table 9 lists 10 guards who have the largest number of rank-1 outlying aspects, where the dimensions are represented by their serial numbers in Table 4. (Due to space limits, Table 9 only lists the outlying aspects whose dimensionality are not greater than 3.)

Table 8 Statistics on the mining results of OAMiner on synthetic data sets
Table 9 The guards having the most rank-1 outlying aspects

In Table 9, the first several players are not well-known. Their low outlyingness ranks arise due to no other players having similar statistics. For example, Quentin Richardson, who has 18 outlying aspects, just played one game in which he played very well at rebounds, but poor at field goal. Will Conroy played four games and his performance on shooting is poor. Brandon Rush played two games, and his number of personal fouls is large. Ricky Rubio performs well at stealing. Rajon Rondo’s ability to assist is impressive, but his statistics for turnover is large. Scott Machado did not make any personal foul in the six games he played. The last four players in Table 9 are famous. Their overall performance on every aspect is much better than most of the other guards. For example, Kobe Bryant is a great scorer, Jamal Crawford’s personal fouls are very low, James Harden is excellent at the free throw, and Stephen Curry leads in 3-points scoring.

Please note that different objects may share some outlying aspects with the same outlyingness rank. For example, both Quentin Richardson and Will Conroy are ranked number 1 in \(\{5, 8\}\). There are two reasons for this situation. First, the values of objects are identical in these subspaces. Second, the difference between the outlyingness degrees is so tiny that it is beyond the precision of the program.

Table 10 lists the guards who have poor outlyingness ranks overall (i.e. there are not any subspaces where they are ranked particularly well). Their performance statistics is in the middle of the road, and do not have any obvious shortcomings. They may be important to be included in a team as “the sixth man”, even though they are not star performers.

Table 10 The guards having poor ranks in outlying aspects

As mentioned in Sect. 3, subspace outlier detection is fundamentally different from outlying aspect mining, since subspace outlier detection finds contrast subspaces for all possible outliers. However, we can make use of the results of subspace outlier ranking to verify to some extent our discovered outlying aspects. Specifically, we look at the objects that are ranked the best by either HiCS (Keller et al. 2012) or SOD (Kriegel et al. 2009), and check their outlyingness ranks. As HiCS randomly selects subspace slices, we run it 3 times independently on each data set with the default parameters. The parameter for the number of nearest neighbors in both LOF and SOD was varied across 5, 10 and 20, and the best ranks were reported. In SOD (Kriegel et al. 2009), the parameter \(l\) specifying the size of the reference sets cannot be larger than the number of nearest neighbors. We set it to the number of nearest neighbors. For a given object, we denote by \(rank_{HL}\) and \(rank_{SOD}\) the ranks computed by HiCS and by SOD, respectively. We denote by \(rank_S\) the outlyingness rank computed by OAMiner. Table 11 shows the results. The results clearly show that every player ranked top in either HiCS or SOD has some outlying subspaces where he is ranked number 1. The results of outlying aspect mining are consistent with those of subspace outlier ranking. At the same time, we notice that the rankings of HiCS and SOD are not always consistent with each other, such as for Kobe Bryant, Brandon Roy and Andrew Bogut.

Table 11 The outlyingness ranks of players ranked top in HiCS or SOD

6.4 Efficiency

To the best of our knowledge, there is no other method tackling the exact same problem as OAMiner. Therefore, we only evaluate the efficiency of OAMiner and its variations. Specifically, we implemented the baseline method (Algorithm 1 with Pruning Rule 1). Recall that OAMiner uses both upper and lower bounds of quasi-density to speed up the computation of outlyingness ranks. To evaluate the efficiency of our techniques for quasi-density comparison, we implemented a version OAMiner-part that does not use bounds in quasi-density estimation and strategies presented in Sect. 5.2. Moreover, we implement the full version OAMiner-full that uses all techniques.

Once again, we used a synthetic data set from Keller et al. (2012). The dimensionality of the data set is 50, and the data set consists of 1,000 data points. We randomly chose 10 data points (non-outliers) from the data set as query objects, and reported the average runtime. Again, we set \(\ell = 5\) for all three methods and \(\alpha = 1.0\) for OAMiner-full by default.

Figure 6a shows the runtime with respect to data set size. The runtime is plotted using the logarithmic scale. The baseline method is time consuming, which is consistent with our analysis. Our pruning techniques can achieve a roughly linear runtime in practice. Both versions of OAMiner are substantially faster than the baseline method. Moreover, OAMiner-full is more efficient than OAMiner-part.

Figure 6b shows the runtime with respect to dimensionality. The runtime is also plotted using the logarithmic scale. As dimensionality increases, the runtime increases exponentially. However, our heuristic pruning techniques speed up the search in practice. Again, OAMiner-full is more efficient than OAMiner-part.

Figure 6c shows the runtime with respect to maximum dimensionality threshold \((\ell )\). The runtime is plotted using the logarithmic scale, too. As \(\ell \) increases, more subspaces will be enumerated. Correspondingly, the runtime increases. Once more, both versions of OAMiner are considerably faster than the baseline method, and OAMiner-full is more efficient than OAMiner-part.

Fig. 6
figure 6

Efficiency test

We also notice that the runtime of OAMiner is related with the outlyingness rank of the query object. Figure 7 shows the runtime with respect to outlyingness rank on each real data set. Not surprisingly, the objects with large outlyingness rank cost more runtime, since OAMiner prunes subspaces based on the rank of the query object by either Pruning Rule 1 or Pruning Rule 2.

Fig. 7
figure 7

Runtime w.r.t. outlyingness rank. (a) Guards, (b) Forwards, (c) Centers, (d) Breast cancer, (e) Climate model, (f) Concrete slump, (g) Parkinsons, and (h) Wine

Last, we test the sensitivity of the parameter \(\alpha \) for bounding quasi-density. We vary the parameter \(\alpha \), which sets the \(\epsilon \)-neighborhood distance. Table 12 lists the average runtime of OAMiner for a query object on each real data set. The runtime of OAMiner is not sensitive to \(\alpha \) in general. Experimentally, the shortest runtime of OAMiner (bold values in Table 12) happens when \(\alpha \) is in [0.8, 1.2].

Table 12 Average runtime of OAMiner w.r.t parameter \(\alpha \)

7 Discussions and conclusions

In this paper, we studied the novel and interesting problem of finding outlying aspects of a query object on multidimensional numeric data. We systematically developed a model and a heuristic method. Using both real and synthetic data sets, we verified that mining outlying aspects is interesting and useful. Moreover, our experiments show that our outlying aspect mining method is effective and efficient.

There are several interesting issues that deserve research effort in the future. First, to further examine the quality of outlying aspects, we plan to compute a statistical confidence interval on the rank via bootstrap sampling, and select the subspace with tighter confidence interval on the rank. Second, since OAMiner ranks the query object among all the objects by their probability densities estimated by a Gaussian kernel , it is interesting to consider using other kernel functions or outlierness degree measures proposed by outlier detection methods, such as SOD (Kriegel et al. 2009) and LOF (Breunig et al. 2000). Third, OAMiner discovers outlying aspects for a given object. Obviously selecting appropriate query points requires domain knowledge. Selecting appropriate background data sets for contrast against the query point also requires background knowledge. It is interesting to explore strategies for incorporating domain knowledge into outlying aspect mining. In practice, it may be the case that a user may want to study a set of objects. In addition, we will explore parallel computation approaches to improve the efficiency of OAMiner, and extend OAMiner for mixed data containing both numerical and non-numerical values. Finally, it is also interesting to investigate concise representation of outlying aspects, such as maximal outlying aspects, explore relations among outlying aspects, and investigate how to measure the interpretability and the interestingness of an outlying aspect.