Keywords

1 Introduction

Outlying aspect mining aims to discover an aspect (i.e., a set of features or attributes) in which a query point has the most significant outlyingness. In many real-life application scenarios, it can provide interpretable information and decision support for downstream tasks [13]. For example, a recruitment team somehow highly interested in identifying what are the most outstanding merits or shortcomings of a particular candidate compared to others.

It is worth noting that the distribution of query points in different spaces varies significantly. As Fig. 1(a) shows, all the data samples are scattered in a 3-dimensional space. The outlyingness of the query point (red triangle) is not significant in the full space. After projecting the data into various 2-dimensional subspaces, the red triangle is more distinguishable from the other points (blue dots) in Fig. 1(b) and Fig. 1(c). The space that can exhibit significant differences between the query point and others is called the outlying aspect.

Technically, it is not feasible to enumerate all aspects due to the number of subspaces grows exponentially with the increasing data dimensionality. To find the optimal outlying aspect efficiently, we consider the following two challenges:

  • (C1) How to search aspects efficiently and comprehensively? To reduce the enormous computational cost of enumerating aspects, it should select representative aspects in the search process. But the distributions of different spaces are not regular, making it challenging to find all representative aspects.

  • (C2) How to measure the outlyingness of different aspects impartially? A scoring function needs to be designed to quantify the outlyingness of a query point in different aspects. However, calculating the outlyingness in different dimensional aspects may lead to biased results.

Fig. 1.
figure 1

An example of data distribution in different spaces. (a) Data distribution in full space; (b)(c)(d) Data distribution in three projection subspaces.

Current approaches have limitations in addressing the above challenges. Concerning the search strategy (C1), the most advanced and general approach is the beam algorithm [10] using heuristic rules. This search strategy has an assumption that if a point scores high in an l-dimensional aspect, it generally generated by adding a dimension in a well-behaved (\(l-1\))-dimensional aspect. However, Fig. 1(b) and Fig. 1(c) show extreme cases that disprove this assumption. Higher-dimensional aspects with extreme distribution are not searchable. As for the outlyingness scoring function (C2), Zhang et al. [17] used the distance from surrounding points to the query point as a metric, leading to results biased towards higher-dimensional aspects. The subsequent methods chose density as the evaluation criterion, but they still have shortcomings. The density ranking [5] loses the absolute degree of deviation, Z-score normalization [10] tends to aspects with high variance, and the method of generating hypersphere simulation densities by random sampling [12] produces a large instability.

In this paper, we propose a score-and-search method OSIER (short for outlying aspects mining based on genetic algorithm), which can effectively identify the candidate outlying aspects. OSIER makes improvements to overcome the shortcomings of existing methods from two perspectives. For C1, it retain the strategy of generating high-dimensional aspects from well-behaved aspects, while increasing the diversity of the searched aspect by replacing dimensions with a mutation operation. This search method has the probability of getting rid of the local optimum. For C2, it uses the strategy of stage comparison. It compares the absolute density of the query point in different aspects under the same dimension. Moreover, the optimal aspects under each dimension are compared with a normalized score. In this way, it retains the absolute density information of all data, avoiding biased results, and accelerates computational efficiency. The main contributions of this paper are summarized as follows:

  • We propose a novel genetic algorithm-based method, named OSIER, to increase the diversity of searched aspects and generate more representative aspects efficiently.

  • We use a simplified density estimation function in a multi-dimensional space and analyze an improved outlyingness measure guided by prior knowledge.

  • We calculate the impact of each dimension on the outlying aspect by generating a maximum interval hyperplane, which can be used to guide the direction of evolution in genetic algorithms. Besides, we use a new comparison strategy preserving the absolute density of all data to avoid biased results.

  • We demonstrate OSIER on multiple real-world and synthetic datasets. The experimental results show that OSIER has higher search efficiency and stability, which can be applied to some extreme cases.

The rest of the paper is organized as follows. The related work is reviewed in Sect. 2. Section 3 presents the details of OSIER. The experiments and results are provided in Sect. 4, and Sect. 5 shows the conclusion and future work.

2 Related Work

2.1 Outlying Aspect Mining

Assume that there exists a d-dimensional space \(D=\{D_{1},D_{2},...,D_{d}\}\) and a set of data points \(X \in \mathbb {R}^d\). For a point \(X_{i} \in X\), its feature under the full space is represented as \( \{X_{i}.D_{1},X_{i}.D_{2},...,X_{i}.D_{d}\} \).

We call the combination of multiple dimensions as aspects (a dimension can also be considered as an aspect), and the space composed of these dimensions is a subspace of the full space D. For an aspect \(\mathcal {S}=\{D_{i_{1}},D_{i_{2}},...,D_{i_{|S|}}\}\), we can define a measure of outlyingness scoring function \(\rho _{\mathcal {S}}(q)\), which measures the outlyingness of query point q in the aspect \(\mathcal {S}\). Based on the above definitions, we can formalize the Outlying Aspect Mining problem as follows:

Definition 1 (Outlying Aspect Mining)

Given a set of n instance X in d-dimensional space D, a query point \(q \in X\), the outlying aspect mining is to identify the non-empty aspect \(\mathcal {S} \subseteq D\) in which the query point q’s outlying degree \(\rho _{\mathcal {S}}(q)\) is larger than any other aspect.

The existing OAM methods are mainly divided into two categories, feature selection methods [4, 9] and score-and-search methods [5, 10, 17]. Feature selection methods transform the OAM problem into a classical feature selection classification problem. More specifically, the two classes are defined as the query points (positive class) and the rest of the data (negative class). Therefore, there is a data imbalance problem when the model is trained. Besides, the interpretability of these methods is poor. The score-and-search methods are more widely and deeply researched than feature selection methods.

The frame of score-and-search methods is to search each candidate aspect, calculate outlyingness for query points, and select the aspect with the highest outlyingness as the optimal result. Zhang et al. [17] proposed a metric based on the idea of kNN, called outlying degree. Duan et al. [5] applied kernel density estimation to multidimensional space by using a product of univariate Gaussian kernels. Meanwhile, they used a boundary pruning-based search strategy.

Vinh et al. [10] considered that the use of density ranking would lose important information about the degree of absolute deviation. Thus they designed a standard scoring function and proposed the concept of dimensionality unbiasedness for outlying aspect mining measures.

Definition 2 (Dimensionality Unbiasedness)

If a density scoring function \(\rho _{\mathcal {S}}(\cdot )\) satisfies the formula:

$$\begin{aligned} \frac{1}{n}\sum _{X_{i} \in X}\rho _{\mathcal {S}}(X_{i})=const.w.r.t.|\mathcal {S}| \end{aligned}$$
(1)

the function can be used to compare the outlyingness of query points in different dimensional spaces directly.

Dimensionality unbiasedness provides a desirable property for designing density scoring functions, thus avoiding bias due to different dimensions and making OAM more interpretable. Meanwhile, to prevent the problem of exploding the number of high-dimensional aspects, a beam search method was proposed to ensure that the number of search spaces is within a specific range by heuristic pruning. Wells et al. [15] analyzed the shortcomings of kernel density search and proposed SGrid density estimation instead, thus considerably speeding up the computational process. Samariya et al. [12] generated hyperspheres by random sampling to evaluate the outlyingness of query points to solve the complex problem of density computation in high-dimensional space.

It is worth noting that outlier detection and outlying aspect mining are different. Outlier detection aims to detect anomalous data that are exceptional with respect to the majority of objects in the databases. It can applied in various fields, such as disease detection [6], social media monitoring [19] and network intrusion supervision [1]. However, outlier detection is difficult to provide a reasonable and intuitive explanation for the identified objects. The OAM task was proposed for discovering aspects where the query instance exhibits the most outlying characteristics.

2.2 Genetic Algorithm

In genetic algorithms, each individual consists of a gene string that represents a feasible solution to that problem. Fitness is the metric used to evaluate the individual, and the fitness function is usually determined based on the objective function. The selection operation selects a parent based on fitness and inherits its genes to the next generation of individuals. The selected parent undergoes a crossover operation with a certain probability to produce the next individual. After many generations, the genetic algorithm jumps out of the loop with a defined threshold or number of iterations and obtains a better quality solution. Genetic algorithm has been used to solve a large variety of problems efficiently, including classification [3], credit risk assessment [8] and time-series analysis [11].

Zhu et al. [18] used genetic algorithms in the outlier detection problem. They used cell-based segmentation techniques, which resulted in a high outlyingness computation cost in a high-dimensional space. Zhang et al. [16] devised a method that does not depend on the upper and lower bound closure properties. Similar to  [17], they chose distance as outlyingness, which is used to guide the evolution.

The genetic algorithm is an efficient optimization algorithm for intelligent global search, which is simple and robust. Thus, we can use these characteristics to discover outlying aspects efficiently.

3 Design of OSIER

In this section, we discuss the details of OSIER. It takes a query point q together with a dataset X of n points \(\{X_{1},...,X_{n}\}\) as input, \(X_{i} \in \mathbb {R}^d \), and outputs an aspect in which the given point has the highest outlyingness.

3.1 Outlying Scoring Function

In the choice of scoring function, we use a simplified version of the multidimensional density estimation function:

$$\begin{aligned} \rho _{\mathcal {S}}(q)=\frac{1}{nh^{|\mathcal {S}|}}\sum _{i=1}^{n}K(\frac{\Vert q-X_{i} \Vert _p}{h}) \end{aligned}$$
(2)

where p denotes norm. In the absence of any prior knowledge, we choose the Euclidean norm (\(p=2\)), adopt the Gaussian kernel as kernel function \(K(\cdot )\), and calculate the bandwidth h follows Silverman’s rule of thumb [14] is more general.

This default density estimation parameter can be improved by prior knowledge. One type of prior knowledge derived from the data description is the bound of each dimension. Suppose the dataset is restricted in a dimension to a range of values. It is not reasonable to have a density distribution for points outside the range. For example, age is a non-negative number. It is unreasonable to produce a probability distribution in the space where age is negative by the density estimation function. We use a reflection strategy to solve this problem. If the data has a minimum value boundary b in dimension \(D_{i}\), for query point \(q=\{q.D_{1},...,q.D_{i},...,q.D_{|\mathcal {S}|}\}\), we set the symmetry point \(q_{sym}=\{q.D_{1},...,2b-q.D_{i},...,q.D_{|\mathcal {S}|}\}\). The optimized scoring function is:

$$\begin{aligned} \rho ^{\prime }_{\mathcal {S}}(q)=\rho _{\mathcal {S}}(q)+\rho _{\mathcal {S}}(q_{sym}) \end{aligned}$$
(3)

For the case where there are boundaries on both sides, we only consider the first reflection point because the appropriate bandwidth ensures that the density distribution after multiple reflections is equal to 0 or infinitely close to 0.

In addition, if a dataset is composed of multiple datasets, resulting in far from normally distributed data, Silverman’s rule of thumb [14] will result in a poor density estimate. We prefer to use an improved sheather algorithm [2] which can achieve better results when the dataset is distributed in multiple dense regions.

3.2 Dimensions Correlation Analysis

Before formally calculating the query points’ outlyingness, each dimension of the dataset needs to be analyzed. Different dimensions provide different contributions to the generation of outlying aspects. Thus the analysis of a single dimension helps to guide subsequent search process.

To make the outlying aspects result more credible, we perform a deeper analysis of the density estimation method. For a query point, if the distribution of the projection on a dimension is remote and the probability density is minimal, the dimension can have a large impact on aspect generation. This kind of dimensions is called trivial outlying dimension, which is defined as follows:

Definition 3 (Trivial Outlying Dimension)

Given a query point q, an outlyingness scoring function \(\rho (\cdot )\) and a threshold \(\epsilon \), a dimension \(D_{i}\) is called trivial outlying dimension if \(\rho _{D_{i}}(q) \le \epsilon \).

An intuitive fact is that when a trivial outlying dimension is coupled with another dimension, the generated aspects may still have a good outlyingness score for the query point. Therefore, the trivial outlying dimensions should be pre-processed to reduce their impact on the search results. For a query point, if a trivial outlying dimension exists, aspects that may cause superior outlyingness will be replaced by different combinations of that trivial outlying dimension with other dimensions.

Table 1. Shooting statistics of six active players in Los Angeles Lakers

Example 1

Table 1 shows the shooting statistics for Los Angeles Lakers. We employ OAMiner [5] to figure out outlying aspects of LeBron James. The result shows that the top-5 outlying aspects are \(\{3PM\}\), \(\{3PM,3PA\}\), \(\{3PM,FGM\}\), \(\{3PM,FTM\}\) and \(\{3PM,3PA,FGM\}\). When we ignore the effects of feature 3PM, outlying aspect \(\{FG\%,FT\%,2P\%\}\) is revealed, which can offer more hidden information.

Afterwards, we need to evaluate the correlation between dimensions and analyze the contribution of each dimension to the degree of query point outliers, represented as a set of weights \(\mathbf{{w}}=\{w_{1},w_{2},...,w_{d}\}\). We construct a hyperplane on the full space so that the query points are as separated as possible from other points to calculate w. Since the query point and the rest of the sample points are two classes of samples with extreme imbalance, we choose the one-class SVM method and set the query point as the origin. Solving this maximum geometric margin hyperplane is essentially a convex optimization problem:

$$\begin{aligned} \begin{aligned} \min&\frac{1}{2}{||\textbf{w}||}^{2}-\tau +\frac{1}{\upsilon n}\sum \limits _{i=1}^{n} \xi _{i} \\ s.t.&\ {\mathbf{{w}}}^{T}X_{i} \ge \tau - \xi _{i},\quad \xi _{i} \ge 0 \end{aligned} \end{aligned}$$
(4)

where \(X_{i}\) denotes spatial vectors of the i-th point, \(\tau \) denotes the hyperplane bias, \(\xi _{i}\) denotes the relaxation variable of the i-th point and \(\upsilon \) denotes a trade-off parameter. We solve the quadratic programming problem by Lagrange Multiplier Method:

$$\begin{aligned} \mathcal {L}(\mathbf{{w}},\tau ,\xi ,\alpha ,\beta )= \frac{1}{2}{||\mathbf{{w}}||}^{2}-\tau +\frac{1}{\upsilon n}\sum \limits _{i=1}^n \xi _{i}-\sum \limits _{i=1}^{n}\alpha _{i}(\mathbf{{w}}^{T}X_{i}-\tau +\xi _{i})-\sum \limits _{i=1}^{n}\beta _{i}\xi _{i} \end{aligned}$$
(5)

In order to obtain a specific form for solving the dual problem, let the partial derivative of \(\mathcal {L}(\mathbf{{w}},\tau , \xi ,\alpha ,\beta )\) with respect to W, \(\tau \), and \(\xi \) equal to zero. We can obtain the following three conditions:

$$\begin{aligned} \mathbf{{w}}-\sum \limits _{i=1}^n \alpha _{i}X_{i}=0 \end{aligned}$$
(6)
$$\begin{aligned} \sum \limits _{i=1}^n \alpha _{i}-1=0 \end{aligned}$$
(7)
$$\begin{aligned} 0 \le \alpha _{i} \le \frac{1}{\upsilon n} \end{aligned}$$
(8)

Equation 6 can be solved for \(\textbf{w}\), where \(X=\{X_{1},X_{2},...,X_{d}\}\) is known, and the optimal solution for \(\alpha \) can be obtained by substituting Eq. (6)-(8) into Eq. 5:

$$ \begin{aligned} \begin{aligned} \min \& \frac{1}{2}\sum \limits _{i=1}^{n}\sum \limits _{j=1}^{n}\alpha _{i}{X_{i}}^{T}X_{j}\alpha _{j} \\ s.t.&\ \sum \limits _{i=1}^n \alpha _{i}=1,\quad 0 \le \alpha _{i} \le \frac{1}{\upsilon n} \end{aligned} \end{aligned}$$
(9)
Fig. 2.
figure 2

Significance of the weights generated by hyperplane.

In reality, it is reasonable to expect that if this hyperplane is more perpendicular to a dimension, the greater the contribution of this dimension in the classification. As Fig. 2(a) shows, red triangle indicates the query, and the red dashed line indicates the generated hyperplane. The hyperplane corresponds to a weight vector of \(\mathbf{{w}}=\left[ 3,1 \right] \), indicating that the dimension corresponding to the x-axis has a greater influence on the query point becoming an outlier. Figure 2(b) and Fig. 2(c) denote the density distributions of the data after projection on the x-axis and y-axis, respectively, which also justify the analysis.

figure a

3.3 Outlying Aspect Generation

The strategy of searching candidate aspects is the core problem of computing the outlying aspects in a high-dimensional data set. When the dimensionality is large enough, the computation of traversing every aspect brings an unbearable computational cost. This cost is exponentially related to the number of dimensions. Taking OAMiner [5] as an example, it takes over 24 h on a dataset with 30 dimensions and 10,000 points, which is impracticable for many real-world high-dimensional datasets. OSIER uses a genetic algorithm, which includes recombination, mutation, and selection operations to search for representative aspects efficiently. The search strategy is given in Algorithm 1.

Procedure Recombination\((\cdot )\) generates new individuals by reorganizing parts of the structure of multiple parent individuals in Step 4. If an aspect performs better in the paternal generation, the dimensions that make up this aspect are more likely to participate in the generation of new individuals. Recombination can discover most of the high-scored aspects by the heuristic search strategy.

Procedure Mutation\((\cdot )\) in Step 5 can handle extreme cases (e.g. Figure 1(b) and Fig. 1(c)). OSIER use sampling to calculate the probability of each dimension participating in the mutation, which is calculated as:

$$\begin{aligned} pro(D_{i})=\frac{W_{D_{i}}}{\sum \limits _{D_{j} \in H} W_{D_{j}}} \end{aligned}$$
(10)

where H is the set of dimensions not involved in the recombination, \(W_{D_{i}}\) indicates the weight of the dimension \(D_{i}\) on the outlyingness in full space, which is mentioned before. A dimension with a larger weight will have a larger opportunity to be selected for next generation to reproduce with modification. OSIER uses bit-wise mutation which randomly replacing one of the dimensions that make up an individual.

Moreover, in order to ensure a fair comparison between different dimensional aspects, the outlyingness score needs to be normalized after calculation. A well-known normalization method in the OAM task is Z-score [10]:

$$\begin{aligned} Z(\rho _{\mathcal {S}}(q)) \triangleq \frac{\rho _{\mathcal {S}}(q)-\mu _{\rho _{\mathcal {S}}}}{\sigma _{\rho _{\mathcal {S}}}} \end{aligned}$$
(11)

where \(\mu _{\rho _{\mathcal {S}}}\) is the mean of the density of all points in the aspect \(\mathcal {S}\), and \(\sigma _{\rho _{\mathcal {S}}}\) is the standard deviation. The score obtained by this transformation satisfies the dimensional unbiasedness requirement of Eq. 1. However, the computational cost of normalizing each searched aspect is still large, so we consider a staged comparison. When generating the aspects in each dimensionality, the optimal aspect is obtained by comparing the original density evaluation score (Steps 6–18). After obtaining the optimal aspects under each dimension, the outliers are normalized (Step 19) and compared (Step 20) in these aspects. Compared to OAMiner [5] and Density Z-score [10], the overall search complexity is reduced from \(O(n^{2}d \cdot Wd)\) to \(O(nd \cdot Wd+n^{2}d \cdot d)\), where n and d are the size and dimensionality of data set, W is the average search width of each dimension.

Table 2. Characteristics of the datasets

4 Experiments and Result Discussion

4.1 Experimental Setting

Datasets. We use four real-world datasets from the UCI machine learning repositoryFootnote 1. We also use the synthetic datasets provided by Keller et al. [7]. Table 2 shows the characteristics of these datasets. The attribute values in the dataset are all real numbers. For the sake of the subsequent description, we use incremental subscripts to record the attributes of the dataset.

Baselines. Four OAM methods are selected as baselines to demonstrate the efficiency and stability of OSIER, including kernel density rank (OAMiner) [5], Z-score normalized Kernel density (ZKDE) [10], sGrid [15] and SINNE [12]. We have made a brief summary in Table 3 for the outlyingness calculation used by each method, where \(\psi \) denotes the size of the random subsample, t denotes the number of ensemble models, and Individual Complexity represents the complexity of computing outlyingness in one aspect for a query point. There are three additional notes: (1) Although OAMiner introduces a boundary method to perform pruning operations, the amount of search space is uncertain and usually much larger than other methods. The exact number depends on the true distribution of the data. (2) SINNE is a sampling-based method. Given a query point and an aspect, the results will vary each time. In particular, the use of a heuristic search approach can also lead to a volatile search of the aspects. (3) The stability of OSIER depends on the mutation rate, which is usually small. It works by accepting the probability of generating an aspect that is worse than the current one, so it is possible to jump out of the local optimal solution. If there is no extreme case, the searched aspects are stable for the same dataset.

Table 3. Summary the characteristics of baselines

Experimental Setup. We use the parameters mentioned in Sect. 3.1. For the aforementioned methods, we apply the default parameters. For the scoring function, OAMiner and ZKDE use the Gaussian kernel. We set \(\psi \) to 8 and t to 100 in SINNE. For the search strategy, epsilon neighborhood range of OAMiner is set to 1. Search width W and maximum dimensionality of searched aspects \(d_{max}\) in beam search are set to 120 and 3, respectively.

Experiments were conducted on a PC with four Intel Xeon E5-2698 CPUs, four GeForce RTX 2080 Ti GPUs and 512 GB memory, running Ubuntu 20.04. The algorithms were implemented in Java and compiled by Java version 13.

Fig. 3.
figure 3

Visualization results on four real data sets.

4.2 Effectiveness

We use four real-world datasets from different domains to demonstrate the effectiveness of OSIER in the OAM problem. Since we do not have a standard to measure the quality of the found aspects, we display the visualization of outlier points in the aspects as shown in Fig. 3. We choose query points with aspects of three dimensions as examples for better visualisation. The red triangles indicate the query points which exhibit good outlyingness in each obtained subspaces. Visualization results show that OSIER can obtain well-behaved aspects.

Table 4. Overall Performances of Comparison with Baselines

We also compare our method with the baseline methods on 10, 20, 30, 40, 50, 75, and 100-dimensional synthetic datasets, respectively. The original datasets provide with 19–136 outliers and the ground-truth of their outlying aspects. For the fairness of the experimental criteria, we augmented the number of outliers. For example, there is an outlier \(X_{1}\) and a normal point \(X_{2}\) in a d-dimensional data set, and the outlying aspect of \(X_{1}\) is \(\{D_{1},D_{2}\}\). By replacing the dimensions which are not in outlying aspect of \(X_{1}\), we can generate a new outlier \(X_{3}=\{X_{1}.D_{1},X_{1}.D_{2},X_{2}.D_{3},...X_{2}.D_{d}\}\). Since OAMiner is committed to finding aspects ranked 1, it may find more than one aspect. We consider that the correct aspect has been found if the result contains the ground truth.

The results are shown in Table 4. Experiment shows that OSIER achieves the state-of-the-art performance on all datasets of different dimensions. There is little difference in the effectiveness of each method in the low-dimensional space. As the number of dimensions rises, the accuracy of OSIER improves more significantly. The trend shows that the search strategy plays a greater role in high-dimensional space. Besides, the heuristic rule pruning strategy (OSIER and ZKDE) can search for more representative aspects than the boundary pruning (OAMiner). ZKDE performs better than Sgrid and SINNE, which indicates that using partial data makes the searched aspects unstable.

Figure 4 shows the efficiency test on the synthetic datasets with varying number of dimensions d and data size n. The base OAMiner method [5] is chosen as the baseline method, which can reduce the impact caused by the different functions of calculating the outlyingness. We select several query points with a result subspace of no more than 3 dimensions to calculate the average running time. Experiment shows that the search efficiency of our method is much faster. Moreover, the execution time rises slower than the baseline as the dimensionality and size of the dataset expand.

4.3 Parameter Analysis

For ease of understanding, we select the data under two dimensions of the synthetic datasets and visualize the results by contour lines.

Fig. 4.
figure 4

Efficiency test w.r.t the number of dimensions d and data size n.

The choice of norm comes in to play when \(d \ge 2\). In the previous norm studies, the commonly used norms are \(p=1\) (Manhattan distance), \(p=2\) (Euclidean norm), and \(p=\infty \) (Maximum norm). As shown in Fig. 5(a), the value of p has a tiny effect on the density distribution in a dense region. As the number of data points increases, the choice of p is less important. We recommend using the 2-norm for its stronger symmetry.

Figure 5(b) shows that the value of bandwidth cannot be static. A reasonable bandwidth should depend on the distribution of the data. A small bandwidth will result in relatively independent density estimates (e.g. \(N=100\), \(h=0.1\)), and a large bandwidth will result in a dispersed density distribution, which makes it difficult to reflect the differences. Silverman’s rule of thumb [14] is a variance-based bandwidth selection method. By this method, the value of h is taken closer to 0.25 when \(N=100\), and it can be observed that h = 0.25 is more representative compared with the others in visualization results. bandwidth essentially scales the kernel density estimation function in different dimensions to obtain better experimental results.

Fig. 5.
figure 5

Influence of different norm p and bandwidth b.

5 Discussion and Conclusion

In this paper, we study the outlying aspect mining problem and propose OSIER, which address the shortcomings of existing methods effectively and provide more interpretable and credible results. We analyze the application of kernel density estimation methods to outlying aspect mining and design an adaptive scoring function. In addition, we improve the commonly used aspects search strategy. We introduce the idea of a genetic algorithm to obtain the fitness of individuals in the process of genetic inheritance by analyzing the correlation among dimensions. Also, the mutation operation in the genetic algorithm can handle some extreme cases during the search process, thus avoiding getting trapped in a local optimum. Experimental results on several real and synthetic datasets demonstrate the effectiveness of the proposed method in outlying aspect mining.

Our future work will focus on applying OAM to hybrid or time-series data. We plan to design a rational outlying aspect structure for interpretable results.