A novel discretization algorithm based on multi-scale and information entropy

Xun, Yaling; Yin, Qingxia; Zhang, Jifu; Yang, Haifeng; Cui, Xiaohui

doi:10.1007/s10489-020-01850-w

A novel discretization algorithm based on multi-scale and information entropy

Published: 12 September 2020

Volume 51, pages 991–1009, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

A novel discretization algorithm based on multi-scale and information entropy

Download PDF

Yaling Xun ORCID: orcid.org/0000-0002-9590-6619¹,
Qingxia Yin¹,
Jifu Zhang¹,
Haifeng Yang¹ &
…
Xiaohui Cui¹

574 Accesses
8 Citations
Explore all metrics

Abstract

Discretization is one of the data preprocessing topics in the field of data mining, and is a critical issue to improve the efficiency and quality of data mining. Multi-scale can reveal the structure and hierarchical characteristics of data objects, the representation of the data in different granularities will be obtained if we make a reasonable hierarchical division for a research object. The multi-scale theory is introduced into the process of data discretization and a data discretization method based on multi-scale and information entropy called MSE is proposed. MSE first conducts scale partition on the domain attribute to obtain candidate cut point set with different granularity. Then, the information entropy is applied to the candidate cut point set, and the candidate cut point with the minimum information entropy is selected and detected in turn to determine the final cut point set using the MDLPC criterion. In such way, MSE avoids the problem that the candidate cut points are limited to only certain limited attribute values caused by considering only the statistical attribute values in the traditional discretization methods, and reduces the number of candidates by controlling the data division hierarchy to an optimal range. Finally, the extensive experiments show that MSE achieves high performance in terms of discretization efficiency and classification accuracy, especially when it is applied to support vector machines, random forest, and decision trees.

A two-stage discretization algorithm based on information entropy

Article 24 May 2017

A Comparison of Two Approaches to Discretization: Multiple Scanning and C4.5

EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method

Article 03 March 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Data discretization is one of the data preprocessing methods in the field of data mining and knowledge discovery, which is to transform quantitative data into qualitative data by dividing continuous domains [35]. For data mining and machine learning, the discretization of continuous attribute can effectively reduce the granularity of the information system to improve the performance and learning accuracy of data mining/ machine learning algorithms, and enhance the ability of classify, cluster and anti-noise. In addition, many machine learning and data mining algorithms can only deal with discrete attributes, for example, C4.5/ C5.0 decision trees [26], association rules [32, 33], Naive Bayes [34] and rough sets [31]. In essence, data discretization is a data reduction mechanism. Continuous data is grouped into discrete intervals, while it still ensures the correlation between each discrete value and a certain interval. Therefore, data discretization can effectively hide the defects in original data and has attracted widespread attention [11].

Actual datasets often contain a large number of attributes, which can form conceptual hierarchies with a clear partial order structure. Dividing the attribute values based on related concepts in the concept hierarchy can form attribute value with multi-scale characteristics, and can obtain different granularity representations of the attribute value set. Since all data subsets in a certain scale representation form of a dataset are divided according to the attribute value set of a concept, each data subset has a specific and clear data meaning. In traditional algorithms, such as CAIM [18], CACC [27], MDLP [10], etc, only the mean of the adjacent attribute intervals is considered as the candidate cut point set, and the data division based on it is insufficient. We introduce the multi-scale theory into the discretization process, which can reasonably divide the attribute value to obtain a set of candidate cut points. The candidate set is sorted, then the information entropy is applied recursively, always selecting the cut point with the smallest entropy. And MDLPC criterion is applied to decide when to refrain from applying further binary partitioning to a given interval. Therefore, the performance of the discretization algorithms and the classification accuracy of the classifiers have been significantly improved by combining multi-scale theory.

1.1 Motivations

The dataset usually involves the relative size of the conceptual scope and granularity, and the multi-scale characteristics can reflect the nature of the dataset from multiple perspectives and hierarchies.

Multi-scale can reveal the nature of the natural scale of a research object in essence. The data often corresponds to an attribute set when studying data from a certain category, which can form a conceptual hierarchy with a clear partial order structure. Dividing the data according to the concept hierarchy can form a dataset with multi-scale data characteristics, which is helpful for decision makers to make decisions from different perspectives. And the complexity of handling problems can also be further reduced by using scale conversion. Recently, multi-scale theory has been attempted to apply to general datasets. Hierarchical theory, conceptual hierarchy, and inclusion theory are used as the basis for scale division to study the distribution patterns in different scale hierarchies, and then to find meaningful facts, such as multi-scale association rules [21] and multi-scale clustering [12].
Data discretization is an important data preprocessing technique. However, most traditional discretization approaches are difficult to reach a balance between running time and classification accuracy for classifiers.

Many data mining/machine learning algorithms can only handle discrete data. However, the original user data is often continuous. Therefore, the discretization of these continuous data is necessary to facilitate the further processing of the algorithms. Moreover, data can be more further understood and reduced, which make data analysis faster and more accurate. Most discretization algorithms are difficult to achieve a balance in running time and classification accuracy when applying them to classification algorithms, even some discretization algorithms are only applicable to specific datasets. Therefore, it is necessary to research an efficient and usual data discretization method.
Incorporating multi-scale theory, a more reasonable candidate cut point set can be obtained through reasonable data scale partition.

The exploration of things, phenomena or processes will vary due to the choice of different scales. As a result, the inward nature of things may be comprehensively, partially, even incorrectly reflected. Dataset also tends to involve this multi-scale nature. If we can follow the essential characteristics of a research object and divide the corresponding data reasonably based on different scale characteristics, we can obtain more valuable information. Therefore, we introduce the multi-scale theory and give specific multi-scale partition strategy to divide the data and calculate candidate cut points with different granularities, the candidate set and computational overhead are greatly reduced. In addition, the classification results are obtained through a large number of known condition attributes and decision attributes. Therefore, the larger the amount of data, the higher the prediction accuracy. However, most discretization methods only consider the attribute values that have been counted, which makes the candidate cut points only limited to the determined finite attribute values. Cut points with different hierarchies are obtained through multi-scale partition. Then we utilize these points as test data, which will make the actual classification more reasonable.

1.2 Contributions

Compared with a large number of existing discretization methods, the main contributions of MSE are summarized as follows:

The domain attribute is hierarchically divided by introducing multi-scale theory, and a set of candidate cut points with different granularity are obtained.
Information entropy is applied to the obtained candidate cut point set, and the cut point with the minimum entropy is recursively selected and judged by MDLPC criterion to generate the final discrete interval.
A data discretization algorithm based on multi-scale and information entropy, called MSE, is proposed.
We conduct extensive experiments to exhibit that MSE offers ample opportunities to boost the execution efficiency of discretization algorithms and classification accuracy for classifiers.

1.3 Organization

The rest of this paper is organized as follows. Section 2 investigates previous work related to this study. In Section 3, we describe basic concepts pertaining to data discretization as well as muti-scale. Then, a data discretization algorithm based multi-scale and information entropy (i.e., MSE) is presented in Section 4. Sections 5 and 6 detail the experimental settings and comparison results respectively. We conclude our research work and future research directions in Section 7 .

2 Related work

In the field of data mining and machine learning, discretization of continuous attributes can not only effectively reduce the time and space overhead, but also enhance the learning accuracy and anti-noise ability of algorithms. The most attention of discretization algorithms and the multi-scale theory are summarized as follows:

Discretization algorithms based on class-attribute interdependence. Kurgan et al. proposed a classic algorithm— CAIM (class-attribute interdependence maximization), which is a global, static, top-down, supervised discretization algorithm [18]. They emphasized that CAIM can generate a minimal number of discrete intervals and need not require the user to predefine the number of intervals. However, CAIM has three drawbacks. First, the importance of attributes is not fully considered during the discretization process. Second, the inconsistency rates of the decision-making table is ignored. Finally, it is unreasonable to adopt the caim value as a discrete discriminant. The above drawbacks often results in information loss, and the accuracy of machine learning is affected. To address the issues of CAIM, Cano et al. presented ur-CAIM, which extended the CAIM criterion to address interdependence, redundancy, and uncertainty of class-attributes [5]. Therefore, the algorithm is superior to CAIM, especially on unbalanced datasets, which generating fewer intervals and better discretization schemes at the lower computational overhead. The same year, Cano et al. presented LAIM (Label-Attribute Interdependence Maximization), which is inspired in the discretization heuristic of CAIM for single-label classification [4]. LAIM provides the possibility to process multi-label dataset. Tsai et al. proposed CACC (class-attribute contingency coefficient), which is a static, global, incremental, supervised and top-down discretization algorithm [27]. They developed a novel heuristic objective function that takes into account the class distribution information for all samples. CACC avoids overfitting of the algorithm to produce better discretization results, and improve the classification prediction accuracy of machine learning. However, CACC is time consuming, which reduces its appeal when applying on real-world problems. Xiaolong Liu et al. proposed an improved algorithm based on CACC, which selects the cut points using the CACC standard and increases the constraint conditions of the data inconsistency rate to reduce the amount of data loss information [20].
Discretization algorithms based on rough set theory. Hong Shi et al. proposed a novel algorithm, which implements global discretization through consistency measurement, which overcomes the defect of the inconsistency rate introduced by the local discretization MDLPC criterion [25]. Cheng et al. proposed an improved continuous attribute discretization algorithm based on rough set from the perspective of decision tables and information entropy [38]. In which, the concepts of ‘conditional attribute weights’ and ‘equivalent class projections’ are defined. Unnecessary candidate cut points are quickly eliminated by judging the importance of conditional attributes to the decision table and comparing the relations between conditional attribute values and equivalent class projections, and then algorithm efficiency is significantly improved. Jiang et al. proposed a supervised multivariate discretization method (abbr. SMDNS), which uses the interdependence between class information and condition attributes to improve classification effect [15]. Cao et al. proposed a continuous attribute discretization algorithm combining binary ant colony and rough set [6]. This algorithm constructs a binary ant colony network on the multi-dimensional continuous attribute candidate breakpoint set space. According to the approximate classification accuracy of the rough set, fitness evaluation function is established to find the globally optimal discretized breakpoint set.
Discretization algorithm based on clustering. Min et al. proposed a global discretization and attribute reduction algorithm based on clustering and rough set theory [22]. In which, k-means clustering algorithm is adopted by comparing different discretization methods. In order to overcome the deficiency of k-means clustering algorithm, F-analysis of variance statistics and the support strength of conditional attributes are introduced to control the effectiveness of discretization. In order to meet the premise of rough set theory, a reasonable number of clusters can be obtained based on the correlation index. Thereafter, attributes are reduced by using rough set theory and decision rules are derived. Jifu Zhang et al. first selected candidate initial fuzzy clustering center by using the density values of the samples to effectively overcome the shortcomings of sensitivity to noise data [35]. Then, the algorithm parameters are dynamically adjusted to achieve the best discretization of spectral characteristic lines based on the compatibility of the decision table.
Discretization algorithm based on entropy. Recently, discretization methods based on information entropy have been widely researched. Fayyad et al. proposed a discretization algorithm based on the entropy and the Minimum Description Length Principle (MDLP) [10]. The algorithm selects the breakpoints that can form a boundary between classes, and uses MDLPC criterion to determine the appropriate number of discrete intervals. However, it belongs to local discretization methods and it is easy to introduce inconsistency rates. Addressing to this problem, lots of research work has been conducted. For example, a comprehensive analysis of local and global information based on information entropy is carried out by Wen et al. [28]. In the local discretization phase, k strong cut points are selected for each attribute to minimize the conditional entropy.
Discretization algorithms based on Chi2. The Chi2-based algorithms are a typical supervised, global, bottom-up discretization algorithm of statistical independence. Kerber proposed the pioneering ChiMerge method in this series of methods [17]. First, continuous attribute values are sorted in ascending order, and then the set of each value of continuous attributes is used as an interval, and tests all adjacent intervals. The pair of chi-square statistics are used to determine whether the current adjacent interval is merged, that is, the minimum chi-square adjacent interval is merged iteratively. At the same time, a chi-square parameter threshold (significant level α) is artificially set, and the iterative process is terminated until the values of all adjacent interval pairs are greater than the given threshold. However, the calculation of the inconsistency rate leads to a reduction in the credibility of the original data and some classification errors. To cope with this problem, a series of studies have been proposed. Changlei Zhao et al. proposed a new data reduction method, namely RS-D (Rough Sets-Discretization), which performs attribute reduction and rule reduction on the discrete data using the Rectified Chi2 algorithm combined with rough set theory [23]. Yu et al. considered that the theoretical basis for determining the importance of a node using the value of the difference between the critical value D deviding 2V was insufficient, and Accuracy cannot be guaranteed [24]. So a novel discretization Method was proposed (a.k.a., Rectified Chi). Rectified Chi uses (2k − v)/2k as an important part of the value of E_ij, and finally achieves the desired discretization result, which improves the learning accuracy of the classifier.
Discretization algorithms based on genetic algorithm. Jing Zhang et al. proposed a multi-attribute discretization algorithm based on genetic algorithms and variable-precision rough sets [36]. It establishes the fitness evaluation function of genetic algorithm by approximating classification accuracy of variable precision rough set, and uses genetic algorithm to find the optimal breakpoint subset on multidimensional continuous attribute candidate breakpoint set. The algorithm achieves better data classification fault tolerance and anti-noise ability
Scale theory and data mining. Multi-scale theory has been paid close attention in data mining field. However, the research on multi-scale data mining is still in its infancy, lacking universal theory and methods. With the deepening of the application of big data, its research becomes more urgent. Mengmeng Liu et al. conducted a study of universal multi-scale data mining on theoretical and methodological aspect [21]. The point-domain Kriging method and area-domain Kerry were introduced to accomplish scale-down and scale-up mining respectively. Chao Li et al. proposed a multi-scale association rule scale-up algorithm MSARSUA, which introduced a similarity calculation method based on inclusion degree and a Gaussian pyramid scale-up theory [19]. The introduction of multi-scale theory can not only effectively reduce the scale of the problem to improve the processing efficiency but also help the decision maker to make decisions from different perspectives. In 2019, Ye Zhang et al. proposed a data scale partition method for multi-scale data mining, which is based on a discretization method using probability density estimation [37]. This method expands the scale data types and effectively reduces the scale effect caused by scale deduction in multi-scale data mining.

In summary, most of the supervised discretization algorithms ignore the more valuable information that may exist in the dataset when initially selecting the candidate cut point set, resulting in insufficient final discretization results. Therefore, candidate cut points with different granularities are obtained by incorporating the multi-scale theory to the division of the initial data, and using these points as test dataset will make the actual classification more reasonable.

3 Background

To facilitate the presentation of MSE, we summarize the notation used throughout this paper in Table 1.

Table 1 Symbol and annotation

A novel discretization algorithm based on multi-scale and information entropy

Abstract

Similar content being viewed by others

A two-stage discretization algorithm based on information entropy

A Comparison of Two Approaches to Discretization: Multiple Scanning and C4.5

EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method

Explore related subjects

1 Introduction

1.1 Motivations

1.2 Contributions

1.3 Organization

2 Related work

3 Background

3.1 Decision table

Definition 1

Example 1

3.2 Information entropy

Definition 2

4 Data discretization based on multi-scale and information entropy

4.1 Multi-scale partition

4.1.1 Related multi-scale definitions

Definition 3

Definition 4

Definition 5

Example 2

4.1.2 Multi-scale data partitioning

4.2 Cut set detection based on information entropy and MDLPC criterion

Definition 6

Definition 7

4.3 Discretization algorithm based on multi-scale and information entropy

4.3.1 Algorithm description

4.3.2 Time complexity analysis

5 Experimental setup

5.1 Experimental setup

5.2 Discretization algorithms for comparison

5.3 Classifiers for comparison

5.4 Experimental dataset

6 Experimental analysis

6.1 Impact of parameters on MSE

6.2 Discretization efficiency

6.3 Number of discretized intervals

6.4 Impacts on classification accuracy

7 Conclusion and future work

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Ethical approval

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation