DDR: A Multidimensional Case Retrieval Optimization Algorithm

Wang, Jing-Bin; Hu, Xuan

doi:10.1007/978-3-642-34522-7_88

Jing-Bin Wang⁵ &
Xuan Hu⁵

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 211))

833 Accesses
1 Citations

Abstract

Ontology-based case retrieval system, the efficiency of case retrieval will increase as the number of cases continue to lower. In this paper, a multi-dimensional case retrieval optimization algorithm, the algorithm through the multi-dimensional case dimensionality reduction into clusters of two-dimensional space, using a two-dimensional spatial clustering to represent a collection of case, and this two-dimensional spatial clustering to establish the R-tree spatial index, by the two retrieval methods to the multidimensional case retrieval. Proved that the method not only improves the accuracy of case retrieval, but also greatly improve the efficiency of case retrieval.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Case Selection Strategy Based on K-Means Clustering

Improving Query Results in Ontology-Based Case-Based Reasoning by Dynamic Assignment of Feature Weights

A Novel High-Dimensional Index Method Based on the Mathematical Features

Keywords

1 Introduction

Case-Based Reasoning (CBR) is a solution of similar problems in the past with results to establish a case library, the new case obtain a similar solution cases in the case base to adapt to the current strategy, which is an important field of artificial intelligence reasoning method [1, 2]. System of case-based reasoning, case retrieval efficiency related to the efficiency of the entire system, is the case retrieval efficiency of case retrieval results [3]. Growing system use case library will make the gradual reduction in the efficiency of case retrieval, this phenomenon is known as swamp phenomenon [4]. In every case in the previous case retrieval than similarity are demand once traversal, that every retrieval are the full match search queries gradually grow to a certain size will reduce the efficiency of this in the case base. Huan-tong [5, 6] In view of this situation, a clustering algorithm is applied to the maintenance of the case base. Zheng [7] proposed a K-Means clustering algorithm. Changzheng [8], the feature weighting C-means clustering algorithm (WF-C-means) and clustering-based case retrieval program to create the index, due to C-means clustering algorithm to adjust the weights of all attributes are included in the difference in the definition of the difference in the definition adopted by the retrieval and new cases similar case is very precise and objective. Li [9] proposed an improved k-means clustering case retrieval algorithm to solve the clustering error due to noise, to evaluate the ability of cases by collecting user feedback and Case energy force as the rules of the selected sample cases.

The paper proposes a multi-dimensional case optimization algorithm the DRR: First point to the case of multidimensional space in the case base through dimensionality reduction calculation, the two-dimensional spatial clustering, and re-established the two-dimensional space clustering R-tree index; fault by dimensionality reduction calculation, the current failure of the two-dimensional representation; through the R-tree index lookup can help us to quickly locate the current failure of the two-dimensional clustering where then KNN algorithm in two dimensional spatial clustering multidimensional case, find the closest current fault near multidimensional Case.

2 Multidimensional Case Retrieval Optimization: DRR Algorithm: DRR

2.1 Case Dimension Reduction

In case-based reasoning system, set the case library CS = (c₁, c₂, c₃,…c_n) a nonempty finite set composed by the n-th case, ∋c_i (0 ≤ i ≤ n)c_i ∈ CS₀

The case of CS is usually based on the feature vector representation. Can be set case c = (f₁,f₂,…f_i…f_m) is a nonempty finite set, where f_i(1 ≤ i ≤ m) is a feature item of c, characterized term is used to describe the case a property.

Definition 1

∀a,b a,b ∈ CS then the case space distance between two points for the Euler distance R_ab

$$ {\text{R}}_{\text{ab}} = \sqrt {\sum {\left( {a_{i} - b_{i} } \right)^{ 2} } } \left( {1 \le {\text{i}} \le {\text{m}}} \right) $$

Definition 2

∀a,b a,b ∈ CS, then the angle between the two vector case space for θ_ab.

$$ {{\theta}_{\text{ab}}} = {\cos}^{-1} \frac{{\sum {a_{i} * b_{i}} }}{\| a \|*\| b \|}(1 \le {\text{i}} \le {\text{m}})$$

Assuming an m-dimensional global reference point O (O₁,O_2,O₃,….O_m), a global reference vector $ {\vec{\text{N}}} $ (N₁,N₂,N₃,…N_m). According to definition 1 and definition 2, we calculate the relative distance of the case point c and O in the case space R_co (This simplified denoted as R_c), the angle between the case point c and vector $ {\vec{\text{N}}} $ the reference angle θ_co (This simplified referred to as θ_c), then m dimensional case point c can be expressed in a two-dimensional spatial point C(R_c,θ_c), as shown in Fig. 88.1.

Space points in the m-dimensional case, there will be the same as the number of the global reference point O (O₁,O₂,O₃,…O_m) the relative distance, and with the global reference vector $ {\vec{\text{N}}} $ (N₁,N₂,N₃,…N_m) having the same folder corner points, thus obtaining the case of a two dimensional clustering. We define:

Definition 3

Let a point D(R_d,θ_d) in a two-dimensional space S, represents a set of points (d₁,d₂,…d_i…d_n). d_i ∈ CS, the two-dimensional space of where d_i Represents both the D(R_d,θ_d).

Definition 4

Target case e (e₁,e₂,e₃,…e_m), the representation in the two-dimensional space is E(R_e, θ_e). ∀a(a₁,a₂,a_i…a_m) a ∈ CS; the representation in the two-dimensional space is A(R_a,θ_a). If E(R_e,θ_e) and A(R_a,θ_a) is closest in the two-dimensional space S, then the Case point A is the most similar point of the target case e in the case space CS.

Proof

According to Fig. 88.2, in the space CS in the m-dimensional case by the reference point O (O₁, O₂, O₃,….O_m), the fault case point e (e₁,e₂,e₃,…e_m), similar case point a (a₁,a₂,a_i…a_m) of a plane defined by three points forming a triangle △ OAE. Wherein R_e is the relative distance between the e and O (the length of the triangle sides θ_e), R_a is the relative distance between A and O (the length of the triangle sides θ_a), and the angle △θ can be with the angle between a and the angle $ {\vec{\text{N}}} $ between the e subtraction can be obtained, △θ = |θ_a−θ_e |.

In order to prove the case point a is similar to the fault case e that calculate the minimum Euler distance R_ea of the case point e with the case point a in the space CS (triangle edges ea’s length).

By the law of cosines can be obtained in the case of m-dimensional space CS

$$ \begin{array}{llll} {\text{R}}_{\text{ea}} = \sqrt {\mathop R\nolimits_{e}^{ 2} + \mathop {\text{R}}\nolimits_{a}^{ 2} - 2{\text{R}}_{a} {\text{R}}_{e} \cos \Updelta {{\uptheta}}} \\ \Rightarrow {\text{R}}_{\text{ea}} = \sqrt {\mathop {\text{R}}\nolimits_{e}^{ 2} + \mathop {\text{R}}\nolimits_{a}^{ 2} - 2{\text{R}}_{a} {\text{R}}_{e} + 2{\text{R}}_{a} {\text{R}}_{e} - 2{\text{R}}_{a} {\text{R}}_{e} \cos \Updelta {{\uptheta}}} \\ \Rightarrow {\text{R}}_{\text{ea}} = \sqrt {\left( {{\text{R}}_{a} {\text{ - R}}_{e} } \right)^{2} - 2{\text{R}}_{a} {\text{R}}_{e} \left( {1 - \cos \Updelta {{\uptheta}}} \right)} \\ \Rightarrow \lim_{\begin{subarray}{l} {\text{R}}_{a} \to {\text{R}}_{e} \\ \Updelta {{\uptheta}} \to 0 \end{subarray} } \sqrt {\left( {{\text{R}}_{a} {\text{ - R}}_{e} } \right)^{2} - 2{\text{R}}_{a} {\text{R}}_{e} \left( {1 - \cos \Updelta {{\uptheta}}} \right)} \\ \end{array} $$

The derivation of the formula can be obtained If the R_e of the current fault e infinitely approaching the R_a of point a in case library, and the angle between the e and a infinitely close to 0 (θ_e infinitely close to θ_a), in the two-dimensional space representation of S is E(R_e, θ_e) and A(R_a, θ_a) two points infinitely close, then limit R_ea will infinitely close to 0, that is, to prove the highest degree of similarity between the e and a in the case base CS.

2.2 Indexing R-Tree

Definition 5

Point D(R_d, θ_d) is a point in a two-dimensional space S, the minimum bounding MBR of point D is M_d, wherein M_d to (R_d −△r, θ_d −△θ), (R_d +△r, θ_d +△θ) for the two pairs of corner points of the rectangular range.

Point D(R_d, θ_d) in the R-Tree index is represented as a leaf node D′(ID, M_d), where ID is the case point identification ID, M_d is a the minimum outsourcing rectangle of the point D, all the space point to the establishment of the R-tree index.

2.3 Case Initial Search

Definition 6

Current fault e (e₁,e₂,e₃…e_m), expressed in two-dimensional space S E(R_e, θ_d), and the scope of the query point E rectangular the M_se based (R_e,−△sr, θ_e −△sθ), (R_e + △sr, θ_e +△sθ) for the two pairs of diagonal points, wherein △sr, △sθ as the point E in the scope of the query, the query higher rectangular greater the accuracy of the query. Set a two-dimensional space S in the case of point set D, represented as R tree M_d, M_d and M_se intersect, then the case point set D is a similar set of points of the point E in a two-dimensional space S, whereby the query intermediate nodes Result set R (R₁,R₂,R₃…R_n).

The assumption that the junction point of the R-tree index is T, a new look up on current fault e (e₁,e₂,e₃…e_m), find the highest similarity to the case of the case e algorithm described as follows:

(1)
The calculation of e with a global reference point O (O₁,O₂,O₃,…O_m) of the relative distance R_e, calculating the angle θ_d of the global reference vector $ {\vec{\text{N}}} $ (N₁,N₂,N₃,…N_m) and case e in the two-dimensional, so the mapping point in the space S is E (Re, e).
(2)
Calculation the query range rectangular M_se of E(R_e, θ_e) in the R-tree index.
(3)
If T is not a leaf node then check M_se whether intersect with M_t intersect then recursively check E intersects T sub-node, if the disjoint abandon find the node.
(4)
If the T is a leaf node, it is determined whether all records in the T intersects with the M_se. So that we can find the record of the intersection of all the E in R-tree, and find along the branches of the tree down to find, and not to traverse the tree in each record.
(5)
To find all the intermediate results set R (R₁,R₂,R₃…R_n) M_se intersect.

2.4 Case Filter

Filter the intermediate result set R to calculate e (e₁,e₂,e₃…e_m) and intermediate result concentrate all cases the similarity, and get the highest similarity case points.

The intermediate result set R may be the case in Fig. 88.3:

At this time, for the case c, the R_e− △sr > = R_c <=R_e +△sr, θ_e − △sθ > = θ_e <= θ_e +△sθ, case c is a case point in the intermediate result set R, but the case c is not similar to the case of the fault case e, dimension reduction and lead to the concentration of the two-dimensional space point case point difference becomes smaller, thus redundant case point intermediate result set will be similar to the case c, when the needs of the middle result set to be filtered, and calculating e (e₁,e₂,e₃…e_m) and intermediate result on the similarity of all cases, and the highest similarity to the case of point.

In this paper, the K-nearest neighbor (K-Nearest Neighbor Algorithm KNN) algorithm to calculate the similarity. It will be the case of feature vectors as points in a high-dimensional space, and then looking for a match with the current failure point in the problem space, will exceed the similarity threshold case is returned to the user, the general process is described as follows:

Input to be retrieved the fault case e output case object which similar to the case e. Results set R have k cases, each case by m attributes described, that x_i = {x_i1,x_i2,…,x_ij,…,x_im}, i = 1,2,…, k; j = 1,2,…,m, calculated case x_i with the current fault e of between the Euclidean distance d(x_i,e):

$$ {\text{d}}\left( {{\text{x}}_{\text{i}} , {\text{e}}} \right) = \sqrt {\sum {\left( {x_{ij} - e_{j} } \right)^{2} } } (1 \le {\text{j}} \le {\text{m}}) $$

Based on this, it is easy to know the similarity between the existing cases xi with the current fault case e:

$$ {\text{SIM}}\left( {{\text{x}}_{\text{i}} ,{\text{e}}} \right) = 1 - {\text{d}}\left( {{\text{x}}_{\text{i}} ,{\text{e}}} \right) $$

2.5 Case Studies and R-Tree Index Correction

For new case e (e₁,e₂,e₃…e_m), if the case base CS exists the case c (c₁,c₂, c₃…c_m), and c_i = e_i (1 ≤ i ≤ m), then say that the new case in the case library already exists, no new case with index correction.

If there is more than is required for the new case, Case Study:

(1)
New case e is inserted into the case base CS.
(2)
Calculation of the relative distance R_e of the new case e with the global reference point O, the angle θ_c between the case e and the global reference vector $ {\vec{\text{N}}} $, the m-dimensional case point e can be expressed as two-dimensional spatial point E(R_e, θ_e).
(3)
In two-dimensional space S for the new point E(R_e, θ_c), determine the point in space already exists, if there is illustrated in this cluster already exists, and is just a new clustering increasing identification ID; case point if there is no point in the space, a new clustering points, you need to re-adjust the R-tree index.

3 Experiment

We conducted a number of experiments in order to verify the validity of the DRR algorithm, the DRR algorithm and the traditional retrieval method based on rough set K-means algorithm [8] to compare the case base using a fault diagnosis system data, test case data were 30000, 50000, 80000, 100000, 200000, 300000, 500000. Algorithm by eclipses, JVM maximum memory 512 m.

3.1 Parameter Settings

The case spatial dimension m = 10, the reference point O (O₁,O₂,O₃,…O_m) is 0 (0,0,0,…, 0), the reference vector $ {\vec{\text{N}}} $ (1,0,0,…0), the two-dimensional space point outsourcing rectangular size as △ R = 1, △θ = 1, the case similarity threshold value of 0.9, the query of the current failure rectangular range as △ sr = 60 △ sθ = 30.

3.2 Experimental Results and Analysis

3.2.1 Precision Experiments

The standard set of test cases has been given experiment (i.e query case base test case similarity is greater than the number of cases of 0.9), the query results closer to the standard result set is to illustrate the higher accuracy of the query. The experimental results shown in Table 88.1. As can be seen from Table 88.1, the K-means algorithm is more sensitive to the noise data when the data set of the query 50000, thereby affecting the accuracy caused by the clustering of the deviation due to the noise data. Noise data for the results of a query of the DRR algorithm is almost no effect, and the DRR algorithm result set high accuracy, stability to be higher than the K-means algorithm.

Table 88.1 Precision experiments

Full size table

3.2.2 Query Efficiency Experiments

Our in 6 case data set of three algorithms query efficiency comparative experiments, the results as shown in Table 88.2 and Fig. 88.4.

Table 88.2 Comparison table of the efficiency of several algorithms

Full size table

Traditional retrieval methods every time all the data read memory query compared with the gradual growth of case space data, the time of the query data also gradually growth occurs when the data reaches a certain number of memory overflow. K-means and DRR algorithm to read data in the case base, indexing, each query are index-based query, so efficiency will be relatively fast indexing K-means based on the data of the entire case base, when the data reaches a certain amount of time will run out of memory; the DRR algorithm, dimensionality reduction means of two-dimensional clustering of the records in the case base, after for 2D clustering results establish the R-tree index, so that in memory R-tree index space is much smaller than the index of the K-means algorithm. Datasets great when DRR algorithm can still achieve inquiries, not only does not appear out of memory and a good solution to the K-means algorithm search efficiency degradation caused by the sample points is too large, large data amount of case the retrieval efficiency of the library.

4 Conclusion

In this paper, a multi-case retrieval optimization algorithm DRR algorithm, the algorithm by clustering, two retrieval of dimensionality reduction method, not only speed up the retrieval efficiency, and the case of two-dimensional clustering is based on unrelated business, to avoid the classification errors caused by manual sorting, while avoiding the degradation caused due to the sample point is too large, the search efficiency, and to improve the case library retrieval efficiency of the large amount of data. The next step will algorithm to further improve weight impact on the clustering of feature items considering the case, in order to improve the accuracy and efficiency of case retrieval.

References

Aamodt A, Plaza E (1892) Case—based reasoning: Foundational issues, methodological variations, and system approaches. AI Communications 1994 7(1):39–59
Google Scholar
Clerk Maxwell J (1892) A treatise on electricity and magnetism, vol. 2, 3rd edn. Clarendon, Oxford, pp 68–73
Google Scholar
Yin—Shan G, Qiang H, Yan Z et a1. (2003) Case-base maintenance based on representative selection for I-NN algorithm 2003 international conference on machine learning and cybemetics. pp 2421–2425 (in Chinese)
Google Scholar
Derere L (2000) Case-based reasoning: diagnosis of faults in complex systems through reuse of experience. proceedings of international test conference, pp 77–105
Google Scholar
Huantong G, Mingjun X, Xiang Z, Qingsheng C (2005) Research on application of clustering algorithm in CBM. Comp Eng 31(12):166–168
Google Scholar
Yorozu Y, Hirano M, Oka K, Tagawa Y (1982) Electron spectroscopy studies on magneto-optical media and plastic substrate interface. IEEE Transl J Magn Jpn 2:740–741, Aug 1987. [Digests 9th annual conference magnetics Japan, p 301] (in Chinese)
Google Scholar
Zheng F (2008) A rough-based k-means clustering algorithm. Computer engineering and applications 44(20):141–142 (in Chinese)
Google Scholar
Changzheng L, Dong D (2010) Case indexing and retrieval based on clustering algorithm of weighted feature C-MEANS. Comp Appl Softw 27(2):111–114 (in Chinese)
Google Scholar
Li Q, Huilin J (2011) Case retrieval algorithm based on k-means clustering. Comp Eng Appl 47(4):185–187 (in Chinese)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Mathematics and Computer Science, Fuzhou University, Fuzhou, 350001, Fujian, China
Jing-Bin Wang & Xuan Hu

Authors

Jing-Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuan Hu .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China, People's Republic
Wei Lu
Beijing Jiaotong University, Beijing, China, People's Republic
Guoqiang Cai
Beijing Jiaotong University, Beijing, China, People's Republic
Weibin Liu
Beijing Jiaotong University, Beijing, China, People's Republic
Weiwei Xing

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, JB., Hu, X. (2013). DDR: A Multidimensional Case Retrieval Optimization Algorithm. In: Lu, W., Cai, G., Liu, W., Xing, W. (eds) Proceedings of the 2012 International Conference on Information Technology and Software Engineering. Lecture Notes in Electrical Engineering, vol 211. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34522-7_88

Download citation

DOI: https://doi.org/10.1007/978-3-642-34522-7_88
Published: 06 November 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34521-0
Online ISBN: 978-3-642-34522-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics