An Efficient Method for Outlying Aspect Mining Based on Genetic Algorithm

Chen, Zihao; Duan, Lei; Wang, Xinye

doi:10.1007/978-3-031-22064-7_25

Zihao Chen¹³,
Lei Duan¹³ &
Xinye Wang¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13725))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

1106 Accesses

Abstract

Outlying aspect mining (OAM) aims to identify a feature subspace in which a given query object is dramatically distinctive from the rest data. The identified features can assist the formulation and optimization of decisions. Score-and-search methods are widely used in outlying aspect mining. However, limited by scoring instability and search inefficiency, studies using this strategy are unable to be comprehensive and accurate for mining outlying aspects. In this paper, it proposes a novel OAM method based on genetic algorithm, named OSIER, which can be applied in mining outlying aspects from multi-dimensional spaces. OSIER improves the search efficiency by analyzing the correlations between dimensions. By combining the genetic algorithm with the traditional beam search strategy, OSIER effectively improves the diversity of the searched aspects. As a result, the execution time for candidate outlying aspects search is controlled in an acceptable range. Experiments show that OSIER outperforms the benchmark methods in terms of effectiveness on the OAM task. Besides, OSIER is capable of providing valuable outlying aspect mining results for various types of datasets.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Scalable Outlying-Inlying Aspects Discovery via Feature Ranking

sGrid++: Revising Simple Grid Based Density Estimator for Mining Outlying Aspect

A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Article Open access 29 April 2022

Keywords

1 Introduction

Outlying aspect mining aims to discover an aspect (i.e., a set of features or attributes) in which a query point has the most significant outlyingness. In many real-life application scenarios, it can provide interpretable information and decision support for downstream tasks [13]. For example, a recruitment team somehow highly interested in identifying what are the most outstanding merits or shortcomings of a particular candidate compared to others.

It is worth noting that the distribution of query points in different spaces varies significantly. As Fig. 1(a) shows, all the data samples are scattered in a 3-dimensional space. The outlyingness of the query point (red triangle) is not significant in the full space. After projecting the data into various 2-dimensional subspaces, the red triangle is more distinguishable from the other points (blue dots) in Fig. 1(b) and Fig. 1(c). The space that can exhibit significant differences between the query point and others is called the outlying aspect.

Technically, it is not feasible to enumerate all aspects due to the number of subspaces grows exponentially with the increasing data dimensionality. To find the optimal outlying aspect efficiently, we consider the following two challenges:

(C1) How to search aspects efficiently and comprehensively? To reduce the enormous computational cost of enumerating aspects, it should select representative aspects in the search process. But the distributions of different spaces are not regular, making it challenging to find all representative aspects.
(C2) How to measure the outlyingness of different aspects impartially? A scoring function needs to be designed to quantify the outlyingness of a query point in different aspects. However, calculating the outlyingness in different dimensional aspects may lead to biased results.

Current approaches have limitations in addressing the above challenges. Concerning the search strategy (C1), the most advanced and general approach is the beam algorithm [10] using heuristic rules. This search strategy has an assumption that if a point scores high in an l-dimensional aspect, it generally generated by adding a dimension in a well-behaved ($l-1$)-dimensional aspect. However, Fig. 1(b) and Fig. 1(c) show extreme cases that disprove this assumption. Higher-dimensional aspects with extreme distribution are not searchable. As for the outlyingness scoring function (C2), Zhang et al. [17] used the distance from surrounding points to the query point as a metric, leading to results biased towards higher-dimensional aspects. The subsequent methods chose density as the evaluation criterion, but they still have shortcomings. The density ranking [5] loses the absolute degree of deviation, Z-score normalization [10] tends to aspects with high variance, and the method of generating hypersphere simulation densities by random sampling [12] produces a large instability.

In this paper, we propose a score-and-search method OSIER (short for outlying aspects mining based on genetic algorithm), which can effectively identify the candidate outlying aspects. OSIER makes improvements to overcome the shortcomings of existing methods from two perspectives. For C1, it retain the strategy of generating high-dimensional aspects from well-behaved aspects, while increasing the diversity of the searched aspect by replacing dimensions with a mutation operation. This search method has the probability of getting rid of the local optimum. For C2, it uses the strategy of stage comparison. It compares the absolute density of the query point in different aspects under the same dimension. Moreover, the optimal aspects under each dimension are compared with a normalized score. In this way, it retains the absolute density information of all data, avoiding biased results, and accelerates computational efficiency. The main contributions of this paper are summarized as follows:

We propose a novel genetic algorithm-based method, named OSIER, to increase the diversity of searched aspects and generate more representative aspects efficiently.
We use a simplified density estimation function in a multi-dimensional space and analyze an improved outlyingness measure guided by prior knowledge.
We calculate the impact of each dimension on the outlying aspect by generating a maximum interval hyperplane, which can be used to guide the direction of evolution in genetic algorithms. Besides, we use a new comparison strategy preserving the absolute density of all data to avoid biased results.
We demonstrate OSIER on multiple real-world and synthetic datasets. The experimental results show that OSIER has higher search efficiency and stability, which can be applied to some extreme cases.

The rest of the paper is organized as follows. The related work is reviewed in Sect. 2. Section 3 presents the details of OSIER. The experiments and results are provided in Sect. 4, and Sect. 5 shows the conclusion and future work.

2 Related Work

2.1 Outlying Aspect Mining

Assume that there exists a d-dimensional space $D=\{D_{1},D_{2},...,D_{d}\}$ and a set of data points $X \in \mathbb {R}^d$. For a point $X_{i} \in X$, its feature under the full space is represented as $ \{X_{i}.D_{1},X_{i}.D_{2},...,X_{i}.D_{d}\} $.

We call the combination of multiple dimensions as aspects (a dimension can also be considered as an aspect), and the space composed of these dimensions is a subspace of the full space D. For an aspect $\mathcal {S}=\{D_{i_{1}},D_{i_{2}},...,D_{i_{|S|}}\}$, we can define a measure of outlyingness scoring function $\rho _{\mathcal {S}}(q)$, which measures the outlyingness of query point q in the aspect $\mathcal {S}$. Based on the above definitions, we can formalize the Outlying Aspect Mining problem as follows:

Definition 1 (Outlying Aspect Mining)

Given a set of n instance X in d-dimensional space D, a query point $q \in X$, the outlying aspect mining is to identify the non-empty aspect $\mathcal {S} \subseteq D$ in which the query point q’s outlying degree $\rho _{\mathcal {S}}(q)$ is larger than any other aspect.

The existing OAM methods are mainly divided into two categories, feature selection methods [4, 9] and score-and-search methods [5, 10, 17]. Feature selection methods transform the OAM problem into a classical feature selection classification problem. More specifically, the two classes are defined as the query points (positive class) and the rest of the data (negative class). Therefore, there is a data imbalance problem when the model is trained. Besides, the interpretability of these methods is poor. The score-and-search methods are more widely and deeply researched than feature selection methods.

The frame of score-and-search methods is to search each candidate aspect, calculate outlyingness for query points, and select the aspect with the highest outlyingness as the optimal result. Zhang et al. [17] proposed a metric based on the idea of kNN, called outlying degree. Duan et al. [5] applied kernel density estimation to multidimensional space by using a product of univariate Gaussian kernels. Meanwhile, they used a boundary pruning-based search strategy.

Vinh et al. [10] considered that the use of density ranking would lose important information about the degree of absolute deviation. Thus they designed a standard scoring function and proposed the concept of dimensionality unbiasedness for outlying aspect mining measures.

Definition 2 (Dimensionality Unbiasedness)

If a density scoring function $\rho _{\mathcal {S}}(\cdot )$ satisfies the formula:

$$\begin{aligned} \frac{1}{n}\sum _{X_{i} \in X}\rho _{\mathcal {S}}(X_{i})=const.w.r.t.|\mathcal {S}| \end{aligned}$$

(1)

the function can be used to compare the outlyingness of query points in different dimensional spaces directly.

Dimensionality unbiasedness provides a desirable property for designing density scoring functions, thus avoiding bias due to different dimensions and making OAM more interpretable. Meanwhile, to prevent the problem of exploding the number of high-dimensional aspects, a beam search method was proposed to ensure that the number of search spaces is within a specific range by heuristic pruning. Wells et al. [15] analyzed the shortcomings of kernel density search and proposed SGrid density estimation instead, thus considerably speeding up the computational process. Samariya et al. [12] generated hyperspheres by random sampling to evaluate the outlyingness of query points to solve the complex problem of density computation in high-dimensional space.

It is worth noting that outlier detection and outlying aspect mining are different. Outlier detection aims to detect anomalous data that are exceptional with respect to the majority of objects in the databases. It can applied in various fields, such as disease detection [6], social media monitoring [19] and network intrusion supervision [1]. However, outlier detection is difficult to provide a reasonable and intuitive explanation for the identified objects. The OAM task was proposed for discovering aspects where the query instance exhibits the most outlying characteristics.

2.2 Genetic Algorithm

In genetic algorithms, each individual consists of a gene string that represents a feasible solution to that problem. Fitness is the metric used to evaluate the individual, and the fitness function is usually determined based on the objective function. The selection operation selects a parent based on fitness and inherits its genes to the next generation of individuals. The selected parent undergoes a crossover operation with a certain probability to produce the next individual. After many generations, the genetic algorithm jumps out of the loop with a defined threshold or number of iterations and obtains a better quality solution. Genetic algorithm has been used to solve a large variety of problems efficiently, including classification [3], credit risk assessment [8] and time-series analysis [11].

Zhu et al. [18] used genetic algorithms in the outlier detection problem. They used cell-based segmentation techniques, which resulted in a high outlyingness computation cost in a high-dimensional space. Zhang et al. [16] devised a method that does not depend on the upper and lower bound closure properties. Similar to [17], they chose distance as outlyingness, which is used to guide the evolution.

The genetic algorithm is an efficient optimization algorithm for intelligent global search, which is simple and robust. Thus, we can use these characteristics to discover outlying aspects efficiently.

3 Design of OSIER

In this section, we discuss the details of OSIER. It takes a query point q together with a dataset X of n points $\{X_{1},...,X_{n}\}$ as input, $X_{i} \in \mathbb {R}^d $, and outputs an aspect in which the given point has the highest outlyingness.

3.1 Outlying Scoring Function

In the choice of scoring function, we use a simplified version of the multidimensional density estimation function:

$$\begin{aligned} \rho _{\mathcal {S}}(q)=\frac{1}{nh^{|\mathcal {S}|}}\sum _{i=1}^{n}K(\frac{\Vert q-X_{i} \Vert _p}{h}) \end{aligned}$$

(2)

where p denotes norm. In the absence of any prior knowledge, we choose the Euclidean norm ($p=2$), adopt the Gaussian kernel as kernel function $K(\cdot )$, and calculate the bandwidth h follows Silverman’s rule of thumb [14] is more general.

This default density estimation parameter can be improved by prior knowledge. One type of prior knowledge derived from the data description is the bound of each dimension. Suppose the dataset is restricted in a dimension to a range of values. It is not reasonable to have a density distribution for points outside the range. For example, age is a non-negative number. It is unreasonable to produce a probability distribution in the space where age is negative by the density estimation function. We use a reflection strategy to solve this problem. If the data has a minimum value boundary b in dimension $D_{i}$, for query point $q=\{q.D_{1},...,q.D_{i},...,q.D_{|\mathcal {S}|}\}$, we set the symmetry point $q_{sym}=\{q.D_{1},...,2b-q.D_{i},...,q.D_{|\mathcal {S}|}\}$. The optimized scoring function is:

$$\begin{aligned} \rho ^{\prime }_{\mathcal {S}}(q)=\rho _{\mathcal {S}}(q)+\rho _{\mathcal {S}}(q_{sym}) \end{aligned}$$

(3)

For the case where there are boundaries on both sides, we only consider the first reflection point because the appropriate bandwidth ensures that the density distribution after multiple reflections is equal to 0 or infinitely close to 0.

In addition, if a dataset is composed of multiple datasets, resulting in far from normally distributed data, Silverman’s rule of thumb [14] will result in a poor density estimate. We prefer to use an improved sheather algorithm [2] which can achieve better results when the dataset is distributed in multiple dense regions.

3.2 Dimensions Correlation Analysis

Before formally calculating the query points’ outlyingness, each dimension of the dataset needs to be analyzed. Different dimensions provide different contributions to the generation of outlying aspects. Thus the analysis of a single dimension helps to guide subsequent search process.

To make the outlying aspects result more credible, we perform a deeper analysis of the density estimation method. For a query point, if the distribution of the projection on a dimension is remote and the probability density is minimal, the dimension can have a large impact on aspect generation. This kind of dimensions is called trivial outlying dimension, which is defined as follows:

Definition 3 (Trivial Outlying Dimension)

Given a query point q, an outlyingness scoring function $\rho (\cdot )$ and a threshold $\epsilon $, a dimension $D_{i}$ is called trivial outlying dimension if $\rho _{D_{i}}(q) \le \epsilon $.

An intuitive fact is that when a trivial outlying dimension is coupled with another dimension, the generated aspects may still have a good outlyingness score for the query point. Therefore, the trivial outlying dimensions should be pre-processed to reduce their impact on the search results. For a query point, if a trivial outlying dimension exists, aspects that may cause superior outlyingness will be replaced by different combinations of that trivial outlying dimension with other dimensions.

Table 1. Shooting statistics of six active players in Los Angeles Lakers

Full size table

Example 1

Table 1 shows the shooting statistics for Los Angeles Lakers. We employ OAMiner [5] to figure out outlying aspects of LeBron James. The result shows that the top-5 outlying aspects are $\{3PM\}$, $\{3PM,3PA\}$, $\{3PM,FGM\}$, $\{3PM,FTM\}$ and $\{3PM,3PA,FGM\}$. When we ignore the effects of feature 3PM, outlying aspect $\{FG\%,FT\%,2P\%\}$ is revealed, which can offer more hidden information.

Afterwards, we need to evaluate the correlation between dimensions and analyze the contribution of each dimension to the degree of query point outliers, represented as a set of weights $\mathbf{{w}}=\{w_{1},w_{2},...,w_{d}\}$. We construct a hyperplane on the full space so that the query points are as separated as possible from other points to calculate w. Since the query point and the rest of the sample points are two classes of samples with extreme imbalance, we choose the one-class SVM method and set the query point as the origin. Solving this maximum geometric margin hyperplane is essentially a convex optimization problem:

$$\begin{aligned} \begin{aligned} \min&\frac{1}{2}{||\textbf{w}||}^{2}-\tau +\frac{1}{\upsilon n}\sum \limits _{i=1}^{n} \xi _{i} \\ s.t.&\ {\mathbf{{w}}}^{T}X_{i} \ge \tau - \xi _{i},\quad \xi _{i} \ge 0 \end{aligned} \end{aligned}$$

(4)

where $X_{i}$ denotes spatial vectors of the i-th point, $\tau $ denotes the hyperplane bias, $\xi _{i}$ denotes the relaxation variable of the i-th point and $\upsilon $ denotes a trade-off parameter. We solve the quadratic programming problem by Lagrange Multiplier Method:

$$\begin{aligned} \mathcal {L}(\mathbf{{w}},\tau ,\xi ,\alpha ,\beta )= \frac{1}{2}{||\mathbf{{w}}||}^{2}-\tau +\frac{1}{\upsilon n}\sum \limits _{i=1}^n \xi _{i}-\sum \limits _{i=1}^{n}\alpha _{i}(\mathbf{{w}}^{T}X_{i}-\tau +\xi _{i})-\sum \limits _{i=1}^{n}\beta _{i}\xi _{i} \end{aligned}$$

(5)

In order to obtain a specific form for solving the dual problem, let the partial derivative of $\mathcal {L}(\mathbf{{w}},\tau , \xi ,\alpha ,\beta )$ with respect to W, $\tau $, and $\xi $ equal to zero. We can obtain the following three conditions:

$$\begin{aligned} \mathbf{{w}}-\sum \limits _{i=1}^n \alpha _{i}X_{i}=0 \end{aligned}$$

(6)

$$\begin{aligned} \sum \limits _{i=1}^n \alpha _{i}-1=0 \end{aligned}$$

(7)

$$\begin{aligned} 0 \le \alpha _{i} \le \frac{1}{\upsilon n} \end{aligned}$$

(8)

Equation 6 can be solved for $\textbf{w}$, where $X=\{X_{1},X_{2},...,X_{d}\}$ is known, and the optimal solution for $\alpha $ can be obtained by substituting Eq. (6)-(8) into Eq. 5:

$$ \begin{aligned} \begin{aligned} \min \& \frac{1}{2}\sum \limits _{i=1}^{n}\sum \limits _{j=1}^{n}\alpha _{i}{X_{i}}^{T}X_{j}\alpha _{j} \\ s.t.&\ \sum \limits _{i=1}^n \alpha _{i}=1,\quad 0 \le \alpha _{i} \le \frac{1}{\upsilon n} \end{aligned} \end{aligned}$$

(9)

In reality, it is reasonable to expect that if this hyperplane is more perpendicular to a dimension, the greater the contribution of this dimension in the classification. As Fig. 2(a) shows, red triangle indicates the query, and the red dashed line indicates the generated hyperplane. The hyperplane corresponds to a weight vector of $\mathbf{{w}}=\left[ 3,1 \right] $, indicating that the dimension corresponding to the x-axis has a greater influence on the query point becoming an outlier. Figure 2(b) and Fig. 2(c) denote the density distributions of the data after projection on the x-axis and y-axis, respectively, which also justify the analysis.

3.3 Outlying Aspect Generation

The strategy of searching candidate aspects is the core problem of computing the outlying aspects in a high-dimensional data set. When the dimensionality is large enough, the computation of traversing every aspect brings an unbearable computational cost. This cost is exponentially related to the number of dimensions. Taking OAMiner [5] as an example, it takes over 24 h on a dataset with 30 dimensions and 10,000 points, which is impracticable for many real-world high-dimensional datasets. OSIER uses a genetic algorithm, which includes recombination, mutation, and selection operations to search for representative aspects efficiently. The search strategy is given in Algorithm 1.

Procedure Recombination$(\cdot )$ generates new individuals by reorganizing parts of the structure of multiple parent individuals in Step 4. If an aspect performs better in the paternal generation, the dimensions that make up this aspect are more likely to participate in the generation of new individuals. Recombination can discover most of the high-scored aspects by the heuristic search strategy.

Procedure Mutation$(\cdot )$ in Step 5 can handle extreme cases (e.g. Figure 1(b) and Fig. 1(c)). OSIER use sampling to calculate the probability of each dimension participating in the mutation, which is calculated as:

$$\begin{aligned} pro(D_{i})=\frac{W_{D_{i}}}{\sum \limits _{D_{j} \in H} W_{D_{j}}} \end{aligned}$$

(10)

where H is the set of dimensions not involved in the recombination, $W_{D_{i}}$ indicates the weight of the dimension $D_{i}$ on the outlyingness in full space, which is mentioned before. A dimension with a larger weight will have a larger opportunity to be selected for next generation to reproduce with modification. OSIER uses bit-wise mutation which randomly replacing one of the dimensions that make up an individual.

Moreover, in order to ensure a fair comparison between different dimensional aspects, the outlyingness score needs to be normalized after calculation. A well-known normalization method in the OAM task is Z-score [10]:

$$\begin{aligned} Z(\rho _{\mathcal {S}}(q)) \triangleq \frac{\rho _{\mathcal {S}}(q)-\mu _{\rho _{\mathcal {S}}}}{\sigma _{\rho _{\mathcal {S}}}} \end{aligned}$$

(11)

where $\mu _{\rho _{\mathcal {S}}}$ is the mean of the density of all points in the aspect $\mathcal {S}$, and $\sigma _{\rho _{\mathcal {S}}}$ is the standard deviation. The score obtained by this transformation satisfies the dimensional unbiasedness requirement of Eq. 1. However, the computational cost of normalizing each searched aspect is still large, so we consider a staged comparison. When generating the aspects in each dimensionality, the optimal aspect is obtained by comparing the original density evaluation score (Steps 6–18). After obtaining the optimal aspects under each dimension, the outliers are normalized (Step 19) and compared (Step 20) in these aspects. Compared to OAMiner [5] and Density Z-score [10], the overall search complexity is reduced from $O(n^{2}d \cdot Wd)$ to $O(nd \cdot Wd+n^{2}d \cdot d)$, where n and d are the size and dimensionality of data set, W is the average search width of each dimension.

Table 2. Characteristics of the datasets

Full size table

4 Experiments and Result Discussion

4.1 Experimental Setting

Datasets. We use four real-world datasets from the UCI machine learning repository^{Footnote 1}. We also use the synthetic datasets provided by Keller et al. [7]. Table 2 shows the characteristics of these datasets. The attribute values in the dataset are all real numbers. For the sake of the subsequent description, we use incremental subscripts to record the attributes of the dataset.

Baselines. Four OAM methods are selected as baselines to demonstrate the efficiency and stability of OSIER, including kernel density rank (OAMiner) [5], Z-score normalized Kernel density (ZKDE) [10], sGrid [15] and SINNE [12]. We have made a brief summary in Table 3 for the outlyingness calculation used by each method, where $\psi $ denotes the size of the random subsample, t denotes the number of ensemble models, and Individual Complexity represents the complexity of computing outlyingness in one aspect for a query point. There are three additional notes: (1) Although OAMiner introduces a boundary method to perform pruning operations, the amount of search space is uncertain and usually much larger than other methods. The exact number depends on the true distribution of the data. (2) SINNE is a sampling-based method. Given a query point and an aspect, the results will vary each time. In particular, the use of a heuristic search approach can also lead to a volatile search of the aspects. (3) The stability of OSIER depends on the mutation rate, which is usually small. It works by accepting the probability of generating an aspect that is worse than the current one, so it is possible to jump out of the local optimal solution. If there is no extreme case, the searched aspects are stable for the same dataset.

Table 3. Summary the characteristics of baselines

Full size table

Experimental Setup. We use the parameters mentioned in Sect. 3.1. For the aforementioned methods, we apply the default parameters. For the scoring function, OAMiner and ZKDE use the Gaussian kernel. We set $\psi $ to 8 and t to 100 in SINNE. For the search strategy, epsilon neighborhood range of OAMiner is set to 1. Search width W and maximum dimensionality of searched aspects $d_{max}$ in beam search are set to 120 and 3, respectively.

Experiments were conducted on a PC with four Intel Xeon E5-2698 CPUs, four GeForce RTX 2080 Ti GPUs and 512 GB memory, running Ubuntu 20.04. The algorithms were implemented in Java and compiled by Java version 13.

4.2 Effectiveness

We use four real-world datasets from different domains to demonstrate the effectiveness of OSIER in the OAM problem. Since we do not have a standard to measure the quality of the found aspects, we display the visualization of outlier points in the aspects as shown in Fig. 3. We choose query points with aspects of three dimensions as examples for better visualisation. The red triangles indicate the query points which exhibit good outlyingness in each obtained subspaces. Visualization results show that OSIER can obtain well-behaved aspects.

Table 4. Overall Performances of Comparison with Baselines

Full size table

We also compare our method with the baseline methods on 10, 20, 30, 40, 50, 75, and 100-dimensional synthetic datasets, respectively. The original datasets provide with 19–136 outliers and the ground-truth of their outlying aspects. For the fairness of the experimental criteria, we augmented the number of outliers. For example, there is an outlier $X_{1}$ and a normal point $X_{2}$ in a d-dimensional data set, and the outlying aspect of $X_{1}$ is $\{D_{1},D_{2}\}$. By replacing the dimensions which are not in outlying aspect of $X_{1}$, we can generate a new outlier $X_{3}=\{X_{1}.D_{1},X_{1}.D_{2},X_{2}.D_{3},...X_{2}.D_{d}\}$. Since OAMiner is committed to finding aspects ranked 1, it may find more than one aspect. We consider that the correct aspect has been found if the result contains the ground truth.

The results are shown in Table 4. Experiment shows that OSIER achieves the state-of-the-art performance on all datasets of different dimensions. There is little difference in the effectiveness of each method in the low-dimensional space. As the number of dimensions rises, the accuracy of OSIER improves more significantly. The trend shows that the search strategy plays a greater role in high-dimensional space. Besides, the heuristic rule pruning strategy (OSIER and ZKDE) can search for more representative aspects than the boundary pruning (OAMiner). ZKDE performs better than Sgrid and SINNE, which indicates that using partial data makes the searched aspects unstable.

Figure 4 shows the efficiency test on the synthetic datasets with varying number of dimensions d and data size n. The base OAMiner method [5] is chosen as the baseline method, which can reduce the impact caused by the different functions of calculating the outlyingness. We select several query points with a result subspace of no more than 3 dimensions to calculate the average running time. Experiment shows that the search efficiency of our method is much faster. Moreover, the execution time rises slower than the baseline as the dimensionality and size of the dataset expand.

4.3 Parameter Analysis

For ease of understanding, we select the data under two dimensions of the synthetic datasets and visualize the results by contour lines.

The choice of norm comes in to play when $d \ge 2$. In the previous norm studies, the commonly used norms are $p=1$ (Manhattan distance), $p=2$ (Euclidean norm), and $p=\infty $ (Maximum norm). As shown in Fig. 5(a), the value of p has a tiny effect on the density distribution in a dense region. As the number of data points increases, the choice of p is less important. We recommend using the 2-norm for its stronger symmetry.

Figure 5(b) shows that the value of bandwidth cannot be static. A reasonable bandwidth should depend on the distribution of the data. A small bandwidth will result in relatively independent density estimates (e.g. $N=100$, $h=0.1$), and a large bandwidth will result in a dispersed density distribution, which makes it difficult to reflect the differences. Silverman’s rule of thumb [14] is a variance-based bandwidth selection method. By this method, the value of h is taken closer to 0.25 when $N=100$, and it can be observed that h = 0.25 is more representative compared with the others in visualization results. bandwidth essentially scales the kernel density estimation function in different dimensions to obtain better experimental results.

5 Discussion and Conclusion

In this paper, we study the outlying aspect mining problem and propose OSIER, which address the shortcomings of existing methods effectively and provide more interpretable and credible results. We analyze the application of kernel density estimation methods to outlying aspect mining and design an adaptive scoring function. In addition, we improve the commonly used aspects search strategy. We introduce the idea of a genetic algorithm to obtain the fitness of individuals in the process of genetic inheritance by analyzing the correlation among dimensions. Also, the mutation operation in the genetic algorithm can handle some extreme cases during the search process, thus avoiding getting trapped in a local optimum. Experimental results on several real and synthetic datasets demonstrate the effectiveness of the proposed method in outlying aspect mining.

Our future work will focus on applying OAM to hybrid or time-series data. We plan to design a rational outlying aspect structure for interpretable results.

Notes

1.
http://archive.ics.uci.edu/ml/index.php.

References

Beulah, J.R., Punithavathani, D.S.: An efficient mixed attribute outlier detection method for identifying network intrusions. Int. J. Inf. Secur. Priv. 14(3), 115–133 (2020)
Article Google Scholar
Botev, Z.I., Grotowski, J.F., Kroese, D.P.: Kernel density estimation via diffusion. Ann. Stat. 38(5), 2916–2957 (2010)
Article MathSciNet MATH Google Scholar
Carvalho, E.D., Silva, R.R.V., Araújo, F.H.D., de A. L. Rabelo, R., de Carvalho Filho, A.O.: An approach to the classification of COVID-19 based on CT scans using convolutional features and genetic algorithms. Comput. Biol. Med. 136, 104744 (2021)
Google Scholar
Dang, X., Assent, I., Ng, R.T., Zimek, A., Schubert, E.: Discriminative features for identifying and interpreting outliers. In: ICDE, pp. 88–99 (2014)
Google Scholar
Duan, L., Tang, G., Pei, J., Bailey, J., Campbell, A., Tang, C.: Mining outlying aspects on numeric data. Data Min. Knowl. Disc. 29(5), 1116–1151 (2015). https://doi.org/10.1007/s10618-014-0398-2
Article MathSciNet MATH Google Scholar
Jenkinson, W.G., Li, Y.I., Basu, S., Cousin, M.A., Oliver, G.R., Klee, E.W.: Leafcuttermd: an algorithm for outlier splicing detection in rare diseases. Bioinform. 36(17), 4609–4615 (2020)
Article Google Scholar
Keller, F., Müller, E., Böhm, K.: HICS: high contrast subspaces for density-based outlier ranking. In: ICDE, pp. 1037–1048 (2012)
Google Scholar
Lappas, P.Z., Yannacopoulos, A.N.: A machine learning approach combining expert knowledge with genetic algorithms in feature selection for credit risk assessment. Appl. Soft Comput. 107, 107391 (2021)
Article Google Scholar
Micenková, B., Ng, R.T., Dang, X., Assent, I.: Explaining outliers by subspace separability. In: ICDM, pp. 518–527 (2013)
Google Scholar
Vinh, N.X., et al.: Discovering outlying aspects in large datasets. Data Min. Knowl. Disc. 30(6), 1520–1555 (2016). https://doi.org/10.1007/s10618-016-0453-2
Article MathSciNet MATH Google Scholar
do Prado Ribeiro, K., Fontes, C.H., de Melo, G.J.A.: Genetic algorithm-based fuzzy clustering applied to multivariate time series. Evol. Intell. 14(4), 1547–1563 (2021)
Google Scholar
Samariya, D., Aryal, S., Ting, K.M., Ma, J.: A new effective and efficient measure for outlying aspect mining. In: WISE, pp. 463–474 (2020)
Google Scholar
Samariya, D., Ma, J.: Mining outlying aspects on healthcare data. In: HIS, pp. 160–170 (2021)
Google Scholar
Silverman, B.W.: Density estimation for statistics and data analysis (1986)
Google Scholar
Wells, J.R., Ting, K.M.: A new simple and efficient density estimator that enables fast systematic search. Pattern Recognit. Lett. 122, 92–98 (2019)
Article Google Scholar
Zhang, J., Gao, Q., Wang, H.H.: A novel method for detecting outlying subspaces in high-dimensional databases using genetic algorithm. In: ICDM, pp. 731–740 (2006)
Google Scholar
Zhang, J., Lou, M., Ling, T.W., Wang, H.H.: HOS-Miner: a system for detecting outlying subspaces of high-dimensional data. In: VLDB, pp. 1265–1268 (2004)
Google Scholar
Zhu, C., Kitagawa, H., Faloutsos, C.: Example-based robust outlier detection in high dimensional datasets. In: ICDM, pp. 829–832 (2005)
Google Scholar
Zrira, N., Mekouar, S., Bouyakhf, E.: A novel approach for graph-based global outlier detection in social networks. Int. J. Secur. Networks 13(2), 108–128 (2018)
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (61972268), the Sichuan Science and Technology Program (2020YFG0034), and the Med-X Center for Informatics funding project of SCU (YGJC001).

Author information

Authors and Affiliations

School of Computer Science, Sichuan University, Chengdu, China
Zihao Chen, Lei Duan & Xinye Wang

Authors

Zihao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lei Duan
View author publications
You can also search for this author in PubMed Google Scholar
Xinye Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Duan .

Editor information

Editors and Affiliations

The University of Adelaide, Adelaide, SA, Australia
Weitong Chen
The University of New South Wales, Sydney, NSW, Australia
Lina Yao
Macquarie University, Sydney, NSW, Australia
Taotao Cai
Griffith University, Brisbane, QLD, Australia
Shirui Pan
Microsoft, Beijing, China
Tao Shen
The University of Queensland, Brisbane, QLD, Australia
Xue Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Z., Duan, L., Wang, X. (2022). An Efficient Method for Outlying Aspect Mining Based on Genetic Algorithm. In: Chen, W., Yao, L., Cai, T., Pan, S., Shen, T., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2022. Lecture Notes in Computer Science(), vol 13725. Springer, Cham. https://doi.org/10.1007/978-3-031-22064-7_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-22064-7_25
Published: 24 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22063-0
Online ISBN: 978-3-031-22064-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Efficient Method for Outlying Aspect Mining Based on Genetic Algorithm

Abstract

Similar content being viewed by others

Scalable Outlying-Inlying Aspects Discovery via Feature Ranking

sGrid++: Revising Simple Grid Based Density Estimator for Mining Outlying Aspect

A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Keywords

1 Introduction