Keywords

1 Introduction

Case-Based Reasoning (CBR) is a solution of similar problems in the past with results to establish a case library, the new case obtain a similar solution cases in the case base to adapt to the current strategy, which is an important field of artificial intelligence reasoning method [1, 2]. System of case-based reasoning, case retrieval efficiency related to the efficiency of the entire system, is the case retrieval efficiency of case retrieval results [3]. Growing system use case library will make the gradual reduction in the efficiency of case retrieval, this phenomenon is known as swamp phenomenon [4]. In every case in the previous case retrieval than similarity are demand once traversal, that every retrieval are the full match search queries gradually grow to a certain size will reduce the efficiency of this in the case base. Huan-tong [5, 6] In view of this situation, a clustering algorithm is applied to the maintenance of the case base. Zheng [7] proposed a K-Means clustering algorithm. Changzheng [8], the feature weighting C-means clustering algorithm (WF-C-means) and clustering-based case retrieval program to create the index, due to C-means clustering algorithm to adjust the weights of all attributes are included in the difference in the definition of the difference in the definition adopted by the retrieval and new cases similar case is very precise and objective. Li [9] proposed an improved k-means clustering case retrieval algorithm to solve the clustering error due to noise, to evaluate the ability of cases by collecting user feedback and Case energy force as the rules of the selected sample cases.

The paper proposes a multi-dimensional case optimization algorithm the DRR: First point to the case of multidimensional space in the case base through dimensionality reduction calculation, the two-dimensional spatial clustering, and re-established the two-dimensional space clustering R-tree index; fault by dimensionality reduction calculation, the current failure of the two-dimensional representation; through the R-tree index lookup can help us to quickly locate the current failure of the two-dimensional clustering where then KNN algorithm in two dimensional spatial clustering multidimensional case, find the closest current fault near multidimensional Case.

2 Multidimensional Case Retrieval Optimization: DRR Algorithm: DRR

2.1 Case Dimension Reduction

In case-based reasoning system, set the case library CS = (c1, c2, c3,…cn) a nonempty finite set composed by the n-th case, ∋ci (0 ≤ i ≤ n)ci ∈ CS0

The case of CS is usually based on the feature vector representation. Can be set case c = (f1,f2,…fi…fm) is a nonempty finite set, where fi(1 ≤ i ≤ m) is a feature item of c, characterized term is used to describe the case a property.

Definition 1

∀a,b a,b ∈ CS then the case space distance between two points for the Euler distance Rab

$$ {\text{R}}_{\text{ab}} = \sqrt {\sum {\left( {a_{i} - b_{i} } \right)^{ 2} } } \left( {1 \le {\text{i}} \le {\text{m}}} \right) $$

Definition 2

∀a,b a,b ∈ CS, then the angle between the two vector case space for θab.

$$ {{\theta}_{\text{ab}}} = {\cos}^{-1} \frac{{\sum {a_{i} * b_{i}} }}{\| a \|*\| b \|}(1 \le {\text{i}} \le {\text{m}})$$

Assuming an m-dimensional global reference point O (O1,O2,O3,….Om), a global reference vector \( {\vec{\text{N}}} \) (N1,N2,N3,…Nm). According to definition 1 and definition 2, we calculate the relative distance of the case point c and O in the case space Rco (This simplified denoted as Rc), the angle between the case point c and vector \( {\vec{\text{N}}} \) the reference angle θco (This simplified referred to as θc), then m dimensional case point c can be expressed in a two-dimensional spatial point C(Rc, θc), as shown in Fig. 88.1.

Fig. 88.1
figure 1

Case spatial point dimensionality reduction

Space points in the m-dimensional case, there will be the same as the number of the global reference point O (O1,O2,O3,…Om) the relative distance, and with the global reference vector \( {\vec{\text{N}}} \) (N1,N2,N3,…Nm) having the same folder corner points, thus obtaining the case of a two dimensional clustering. We define:

Definition 3

Let a point D(Rdd) in a two-dimensional space S, represents a set of points (d1,d2,…di…dn). di ∈ CS, the two-dimensional space of where di Represents both the D(Rdd).

Definition 4

Target case e (e1,e2,e3,…em), the representation in the two-dimensional space is E(Re, θe). ∀a(a1,a2,ai…am) a ∈ CS; the representation in the two-dimensional space is A(Raa). If E(Ree) and A(Raa) is closest in the two-dimensional space S, then the Case point A is the most similar point of the target case e in the case space CS.

Proof

According to Fig. 88.2, in the space CS in the m-dimensional case by the reference point O (O1, O2, O3,….Om), the fault case point e (e1,e2,e3,…em), similar case point a (a1,a2,ai…am) of a plane defined by three points forming a triangle △ OAE. Wherein Re is the relative distance between the e and O (the length of the triangle sides θe), Ra is the relative distance between A and O (the length of the triangle sides θa), and the angle △θ can be with the angle between a and the angle \( {\vec{\text{N}}} \) between the e subtraction can be obtained, △θ = |θa−θe |.

Fig. 88.2
figure 2

The case points in the m-dimensional case base CS

In order to prove the case point a is similar to the fault case e that calculate the minimum Euler distance Rea of the case point e with the case point a in the space CS (triangle edges ea’s length).

By the law of cosines can be obtained in the case of m-dimensional space CS

$$ \begin{array}{llll} {\text{R}}_{\text{ea}} = \sqrt {\mathop R\nolimits_{e}^{ 2} + \mathop {\text{R}}\nolimits_{a}^{ 2} - 2{\text{R}}_{a} {\text{R}}_{e} \cos \Updelta {{\uptheta}}} \\ \Rightarrow {\text{R}}_{\text{ea}} = \sqrt {\mathop {\text{R}}\nolimits_{e}^{ 2} + \mathop {\text{R}}\nolimits_{a}^{ 2} - 2{\text{R}}_{a} {\text{R}}_{e} + 2{\text{R}}_{a} {\text{R}}_{e} - 2{\text{R}}_{a} {\text{R}}_{e} \cos \Updelta {{\uptheta}}} \\ \Rightarrow {\text{R}}_{\text{ea}} = \sqrt {\left( {{\text{R}}_{a} {\text{ - R}}_{e} } \right)^{2} - 2{\text{R}}_{a} {\text{R}}_{e} \left( {1 - \cos \Updelta {{\uptheta}}} \right)} \\ \Rightarrow \lim_{\begin{subarray}{l} {\text{R}}_{a} \to {\text{R}}_{e} \\ \Updelta {{\uptheta}} \to 0 \end{subarray} } \sqrt {\left( {{\text{R}}_{a} {\text{ - R}}_{e} } \right)^{2} - 2{\text{R}}_{a} {\text{R}}_{e} \left( {1 - \cos \Updelta {{\uptheta}}} \right)} \\ \end{array} $$

The derivation of the formula can be obtained If the Re of the current fault e infinitely approaching the Ra of point a in case library, and the angle between the e and a infinitely close to 0 (θe infinitely close to θa), in the two-dimensional space representation of S is E(Re, θe) and A(Ra, θa) two points infinitely close, then limit Rea will infinitely close to 0, that is, to prove the highest degree of similarity between the e and a in the case base CS.

2.2 Indexing R-Tree

Definition 5

Point D(Rd, θd) is a point in a two-dimensional space S, the minimum bounding MBR of point D is Md, wherein Md to (Rd −△r, θd −△θ), (Rd +△r, θd +△θ) for the two pairs of corner points of the rectangular range.

Point D(Rd, θd) in the R-Tree index is represented as a leaf node D′(ID, Md), where ID is the case point identification ID, Md is a the minimum outsourcing rectangle of the point D, all the space point to the establishment of the R-tree index.

2.3 Case Initial Search

Definition 6

Current fault e (e1,e2,e3…em), expressed in two-dimensional space S E(Re, θd), and the scope of the query point E rectangular the Mse based (Re,−△sr, θe −△sθ), (Re + △sr, θe +△sθ) for the two pairs of diagonal points, wherein △sr, △sθ as the point E in the scope of the query, the query higher rectangular greater the accuracy of the query. Set a two-dimensional space S in the case of point set D, represented as R tree Md, Md and Mse intersect, then the case point set D is a similar set of points of the point E in a two-dimensional space S, whereby the query intermediate nodes Result set R (R1,R2,R3…Rn).

The assumption that the junction point of the R-tree index is T, a new look up on current fault e (e1,e2,e3…em), find the highest similarity to the case of the case e algorithm described as follows:

  1. (1)

    The calculation of e with a global reference point O (O1,O2,O3,…Om) of the relative distance Re, calculating the angle θd of the global reference vector \( {\vec{\text{N}}} \) (N1,N2,N3,…Nm) and case e in the two-dimensional, so the mapping point in the space S is E (Re, e).

  2. (2)

    Calculation the query range rectangular Mse of E(Re, θe) in the R-tree index.

  3. (3)

    If T is not a leaf node then check Mse whether intersect with Mt intersect then recursively check E intersects T sub-node, if the disjoint abandon find the node.

  4. (4)

    If the T is a leaf node, it is determined whether all records in the T intersects with the Mse. So that we can find the record of the intersection of all the E in R-tree, and find along the branches of the tree down to find, and not to traverse the tree in each record.

  5. (5)

    To find all the intermediate results set R (R1,R2,R3…Rn) Mse intersect.

2.4 Case Filter

Filter the intermediate result set R to calculate e (e1,e2,e3…em) and intermediate result concentrate all cases the similarity, and get the highest similarity case points.

The intermediate result set R may be the case in Fig. 88.3:

Fig. 88.3
figure 3

Intermediate result set of case retrieval

At this time, for the case c, the Re− △sr > = Rc <=Re +△sr, θe − △sθ > = θe <= θe +△sθ, case c is a case point in the intermediate result set R, but the case c is not similar to the case of the fault case e, dimension reduction and lead to the concentration of the two-dimensional space point case point difference becomes smaller, thus redundant case point intermediate result set will be similar to the case c, when the needs of the middle result set to be filtered, and calculating e (e1,e2,e3…em) and intermediate result on the similarity of all cases, and the highest similarity to the case of point.

In this paper, the K-nearest neighbor (K-Nearest Neighbor Algorithm KNN) algorithm to calculate the similarity. It will be the case of feature vectors as points in a high-dimensional space, and then looking for a match with the current failure point in the problem space, will exceed the similarity threshold case is returned to the user, the general process is described as follows:

Input to be retrieved the fault case e output case object which similar to the case e. Results set R have k cases, each case by m attributes described, that xi = {xi1,xi2,…,xij,…,xim}, i = 1,2,…, k; j = 1,2,…,m, calculated case xi with the current fault e of between the Euclidean distance d(xi,e):

$$ {\text{d}}\left( {{\text{x}}_{\text{i}} , {\text{e}}} \right) = \sqrt {\sum {\left( {x_{ij} - e_{j} } \right)^{2} } } (1 \le {\text{j}} \le {\text{m}}) $$

Based on this, it is easy to know the similarity between the existing cases xi with the current fault case e:

$$ {\text{SIM}}\left( {{\text{x}}_{\text{i}} ,{\text{e}}} \right) = 1 - {\text{d}}\left( {{\text{x}}_{\text{i}} ,{\text{e}}} \right) $$

2.5 Case Studies and R-Tree Index Correction

For new case e (e1,e2,e3…em), if the case base CS exists the case c (c1,c2, c3…cm), and ci = ei (1 ≤ i ≤ m), then say that the new case in the case library already exists, no new case with index correction.

If there is more than is required for the new case, Case Study:

  1. (1)

    New case e is inserted into the case base CS.

  2. (2)

    Calculation of the relative distance Re of the new case e with the global reference point O, the angle θc between the case e and the global reference vector \( {\vec{\text{N}}} \), the m-dimensional case point e can be expressed as two-dimensional spatial point E(Re, θe).

  3. (3)

    In two-dimensional space S for the new point E(Re, θc), determine the point in space already exists, if there is illustrated in this cluster already exists, and is just a new clustering increasing identification ID; case point if there is no point in the space, a new clustering points, you need to re-adjust the R-tree index.

3 Experiment

We conducted a number of experiments in order to verify the validity of the DRR algorithm, the DRR algorithm and the traditional retrieval method based on rough set K-means algorithm [8] to compare the case base using a fault diagnosis system data, test case data were 30000, 50000, 80000, 100000, 200000, 300000, 500000. Algorithm by eclipses, JVM maximum memory 512 m.

3.1 Parameter Settings

The case spatial dimension m = 10, the reference point O (O1,O2,O3,…Om) is 0 (0,0,0,…, 0), the reference vector \( {\vec{\text{N}}} \) (1,0,0,…0), the two-dimensional space point outsourcing rectangular size as △ R = 1, △θ = 1, the case similarity threshold value of 0.9, the query of the current failure rectangular range as △ sr = 60 △ sθ = 30.

3.2 Experimental Results and Analysis

3.2.1 Precision Experiments

The standard set of test cases has been given experiment (i.e query case base test case similarity is greater than the number of cases of 0.9), the query results closer to the standard result set is to illustrate the higher accuracy of the query. The experimental results shown in Table 88.1. As can be seen from Table 88.1, the K-means algorithm is more sensitive to the noise data when the data set of the query 50000, thereby affecting the accuracy caused by the clustering of the deviation due to the noise data. Noise data for the results of a query of the DRR algorithm is almost no effect, and the DRR algorithm result set high accuracy, stability to be higher than the K-means algorithm.

Table 88.1 Precision experiments

3.2.2 Query Efficiency Experiments

Our in 6 case data set of three algorithms query efficiency comparative experiments, the results as shown in Table 88.2 and Fig. 88.4.

Table 88.2 Comparison table of the efficiency of several algorithms
Fig. 88.4
figure 4

Query results of the histogram

Traditional retrieval methods every time all the data read memory query compared with the gradual growth of case space data, the time of the query data also gradually growth occurs when the data reaches a certain number of memory overflow. K-means and DRR algorithm to read data in the case base, indexing, each query are index-based query, so efficiency will be relatively fast indexing K-means based on the data of the entire case base, when the data reaches a certain amount of time will run out of memory; the DRR algorithm, dimensionality reduction means of two-dimensional clustering of the records in the case base, after for 2D clustering results establish the R-tree index, so that in memory R-tree index space is much smaller than the index of the K-means algorithm. Datasets great when DRR algorithm can still achieve inquiries, not only does not appear out of memory and a good solution to the K-means algorithm search efficiency degradation caused by the sample points is too large, large data amount of case the retrieval efficiency of the library.

4 Conclusion

In this paper, a multi-case retrieval optimization algorithm DRR algorithm, the algorithm by clustering, two retrieval of dimensionality reduction method, not only speed up the retrieval efficiency, and the case of two-dimensional clustering is based on unrelated business, to avoid the classification errors caused by manual sorting, while avoiding the degradation caused due to the sample point is too large, the search efficiency, and to improve the case library retrieval efficiency of the large amount of data. The next step will algorithm to further improve weight impact on the clustering of feature items considering the case, in order to improve the accuracy and efficiency of case retrieval.