Keywords

1 Introduction

The extraction of frequent itemsets and association rules is widely used as a basic technique in data mining [1]. An association rule expresses the fact that “if an instance (tuple) satisfies P, then it also satisfies Q” in a relational database. It is expressed as “if P then Q (\(P \rightarrow Q\))” where P and Q are conditions on database attributes (items). When Q is a class attribute, it can be used for rule-based classification. In the case of class classification using association rules, the expressive form of the rules is readable and considered useful for interpreting the reasons for class classification.

A method has been proposed for directly discovering class association rules that satisfy flexible conditions set by the user without constructing a frequent itemset or going through any manual process, using a network structure and an evolutionary computation method that accumulates results continuously over generations [2]. An extension of the method for determining an attribute combination P such that Q provides a small region in a continuous variable X has also been reported [3].

In the case of using the expression “if P then Q,” if this is an interesting rule, it can indicate that in the set of instances satisfying P (the set of records satisfying P, a small set taken from the whole data), there is a statistical characteristic Q in terms of the whole data. For example, if Q is a class attribute, it can indicate that the ratio of class membership is a characteristic of the population satisfying P. Furthermore, in the case of a small distribution of continuous quantities, the values of the continuous variable X are indicated to show a characteristic distribution in the population satisfying P. Thus, the interpretation of the association rule is to consider a group of cases (a group of records) satisfying P as a single cluster and extract the cluster with the label satisfying P. Furthermore, the basis for extracting the cluster is in Q. In this study, we consider Q in the expression “if P then Q” as an expression using the conventional statistical analysis method and treat a method to discover that “a statistical expression Q is found in a population that satisfies P.” Specifically, when Q comprises two continuous variables X and Y in the database, this method reveals that X and Y are correlated in a small population with attribute combination P, even though X and Y are not correlated in the entire database. In this study, we address correlations for two continuous variables, which are expected to realize conditional settings for the ratio and distribution of X and Y to be found and are positioned as a basic method to link large-scale data and conventional statistical analysis methods. We referred to this attribute combination P as ItemSB (Itemsets with statistically distinctive backgrounds) and proposed a method for its discovery [4]. ItemSB has the potential to become a new baseline method for data analysis that bridges the gap between conventional data analysis using frequent itemsets and statistical analysis.

In this paper, to examine the characteristics of ItemSB, we propose a method to discover ItemSB by focusing on the differences between two sets of data and report the results of evaluation experiments on the characteristics of ItemSB. Applying conventional statistical analysis methods to an entire dataset can sometimes be challenging in the analysis of big data. Therefore, to effectively apply statistical analysis methods to big data, statistical analysis should be conducted after establishing a certain type of stratum and dividing the records into subgroups. Typically, this subgrouping is defined a priori by the user; however, it is expected to be determined in an exploratory manner. However, there may be cases where the user wishes to determine a small group in which there is a strong correlation between the two variables of interest. Techniques to understand and apply the characteristics of ItemSB are expected to be the basis of big data mining methods that are characterized by a combination of this subgrouping and statistical analysis.

2 Contrast ItemSB

In this study, we define ItemSB as a set of rules that satisfy the conditions predetermined by the user [4]. Specifically, the set of attribute combinations \(A_i\) that appear to have statistical characteristics regarding X and Y in Table 1 is determined by expressing the following rule:

$$\begin{aligned} (A_j=1) \wedge \dots \wedge (A_k=1) \rightarrow (m_X, s_X, m_Y, s_Y, R_{X, Y}), \end{aligned}$$
(1)

where \(m_X\), \(s_X\), \(m_Y\), and \(s_Y\) are the mean values and standard deviations of X and Y, respectively, and \(R_{X, Y}\) is the correlation coefficient between X and Y for instances containing the itemset.

We proposed a method for discovering ItemSBs using a genetic network programming (GNP) structure characterized by a directed graph network structure and an evolutionary computation method (GNMiner) [4]. GNMiner is an evolutionary computational method that employs an evolutionary strategy to accumulate the outcomes obtained in each generation. Unlike general evolutionary computation methods, the solution is not the best individual of the last generation; rather, numerous interesting rule candidates are represented and searched for in the individuals in evolutionary computation. Consequently, when an interesting rule is found, it is accumulated in a rule library. Therefore, it has the following characteristics: setting the goodness of fit during the evolutionary operation has little influence on performance; rule discovery can be terminated at any time because it is an accumulative problem-solving method; the flexibility of the individual network structure allows a variety of rules expressions. However, it is limited by its inability to discover all the rules that satisfy the set conditions.

Table 1. Example of the database.

In data mining from dense databases, it is interesting to find differences between two sets of data collected in different settings (A/B) for the same subject, or between two groups (A/B) of data collected for the same subject. When \(I=(A_j=1) \wedge \dots \wedge (A_k=1)\) is the same ItemSB obtained from Databases A and B, we define contrast ItemSB as follows: For instance, \(I^A\): \(I \rightarrow (m_X^A, s_X^A, m_Y^A, s_Y^A, R_{X, Y}^A) \) in Database A satisfies the condition given by the user. \(I^B\): \(I \rightarrow (m_X^B, s_X^B, m_Y^B, s_Y^B, R_{X, Y}^B) \) in database B does not satisfy the same condition. The basic idea of contrast ItemSB discovery was introduced in [4].

In this study, an interesting feature of contrast ItemSB is defined as follows:

$$\begin{aligned} I^A: I\rightarrow (m_X^A, s_X^A, m_Y^A, s_Y^A, R_{X, Y}^A), I^B: I\rightarrow (m_X^B, s_X^B, m_Y^B, s_Y^B, R_{X, Y}^B), \end{aligned}$$
$$\begin{aligned} R_{X, Y}^A(I) \ge R_{min}, R_{X, Y}^B(I) \ge R_{min}, \end{aligned}$$
(2)

where \(R_{min}\) is a constant provided by the user in advance. In addition, the conditions for support (frequency of occurrence) are added to interestingness, such as

$$\begin{aligned} support^A(I) \ge supp_{min}, support^B(I) \ge supp_{min}, \end{aligned}$$
(3)

where \(supp_{min}\) denotes a constant provided in advance by the user. To evaluate the reproducibility and reliability of ItemSB, we used an interest setting different from that in [4].

3 Experimental Results

The YPMSD datasets [5] from the UCI machine learning repository [6] were used to evaluate the proposed method.

The characteristics of the YPMSD dataset are summarized as follows:

  • predicting the release year of a song based on audio features;

  • 90 attributes without missing values (T(j) (\(j=0, \dots , 89\)). The target attribute value (Y) is the year ranging from 1922 to 2011;

  • 463,715 instances for training and 51,630 instances for testing.

We set \(X=T(0)\) and discretized continuous attribute values T(j) \((j=1, \dots , 89)\) to sets of attributes \(A_{3j-2}\), \(A_{3j-1}\), and \(A_{3j}\) with values of 1 or 0. Discretization was performed in the same manner as that described in [4]. Two hundred sixty-seven transformed attributes (\(A_1, \dots , A_{267}\)) and continuous attributes X and Y were used. The first 100,000 of the 463,715 instances were used as Database A, and 51,630 testing instances were used as Database B. The detailed experimental setup, including the GNP setting, was the same as that described in [4]. The final solution was a set of ItmeSBs with no duplicates obtained after 100 independent rounds of trials.

Fig. 1.
figure 1

Results of contrast ItemSB discovery.

Fig. 2.
figure 2

Results of contrast ItemSB discovery (\(supp_{min}=0.015\)).

The following conditions were used for contrast ItemSB discovery:

$$\begin{aligned} R_{X, Y}^A(I) \ge 0.5, R_{X, Y}^B(I) \ge 0.5, \end{aligned}$$
(4)
$$\begin{aligned} support^A(I) \ge 0.0004, support^B(I) \ge 0.0004. \end{aligned}$$
(5)

The purpose of this experiment was to gain insight into the reproducibility of ItemSBs found between the two databases such that reliable ItemSB discovery could be performed. The fitness of a GNP individual was defined as

$$\begin{aligned} \begin{aligned} F= \sum _{I \in I_c} \{ 1000&\times (support^A(I)+support^B(I)) \\&+ 10 \times (R_{X, Y}^A(I)+R_{X, Y}^B(I)) + n_P(I) + c_{new}(I) \}, \end{aligned} \end{aligned}$$

where \(I_c\) is the set of suffixes of extracted contrast ItemSBs satisfying (4) and (5) in the GNP individual. \(n_P(I) \) is the number of attributes in I, and we set \(n_P(I) \le 6\). \(c_{new}(I) \) is a constant for novelty of I.

After 100 rounds of discovery, 133,475 ItemSBs were obtained. The scatter plots of the support value of ItemSB, correlation coefficient, mean values of X and Y, and standard deviations of X and Y for the 133,457 ItemSBs are shown in Fig. 1. Figure 2 shows the scatter plots for the 9,365 ItemSBs satisfying the additional condition \(supp_{min}=0.015\). As shown in Figs. 1 and 2, contrast ItemSB tended to be absent from ItemSB with large support values. Additionally, the values of the correlation coefficients tended to differ between Databases A and B. However, the mean values of X and Y were reproducible. Figure 2 focuses on ItemSB with relatively large support values; however, compared to Fig. 1, the range of the mean values of X and Y is narrower, resulting in a skewed distribution. This indicates that when using ItemSB for classification or stratification by subgroup, it is recommended to use the ItemSB with a relatively high frequency of occurrence. However, ItemSB, which had a comparatively high frequency of occurrence, may not be sufficient to cover all cases. Even if ItemSBs with large correlation coefficients are found, their frequency of occurrence is low, and the correlation coefficient values may not be reproducible when ItemSBs are examined between two datasets collected under the same conditions.

These results suggest that the frequency of occurrence (support) is reproducible and reliable, whereas itemsets are conventionally used based on their frequency of occurrence. The same was considered true for mean values. However, when using complex values such as the correlation coefficient as a statistical feature of ItemSB, evaluating their reliability in the process of discovery is considered an issue.

4 Conclusions

As an extension of the concept of frequent itemsets and a new interpretation of the association rule expression, we defined ItemSB as a set of items with statistical properties as the background and discussed a method of finding ItemSB by applying an evolutionary computation called GNMiner. In this study, we discussed the statistical properties of ItemSB as a basis for big data analysis and its application in forecasting and classification, focusing on setting two continuous variables as the statistical background of ItemSB. Based on experimental results using publicly available data, it is necessary to consider the reproducibility of the characteristics of ItemSB even for small groups with the same combination of items when the correlation coefficient is used as the statistical background of ItemSB. In the future, we aim to establish desirable ItemSB discovery conditions in collaboration with researchers with expertise in the data to which ItemSB is to be applied and to improve the reproducibility and reliability of the ItemSB discovery method in the discovery process.