1 Introduction

Data discretization is one of the data preprocessing methods in the field of data mining and knowledge discovery, which is to transform quantitative data into qualitative data by dividing continuous domains [35]. For data mining and machine learning, the discretization of continuous attribute can effectively reduce the granularity of the information system to improve the performance and learning accuracy of data mining/ machine learning algorithms, and enhance the ability of classify, cluster and anti-noise. In addition, many machine learning and data mining algorithms can only deal with discrete attributes, for example, C4.5/ C5.0 decision trees [26], association rules [32, 33], Naive Bayes [34] and rough sets [31]. In essence, data discretization is a data reduction mechanism. Continuous data is grouped into discrete intervals, while it still ensures the correlation between each discrete value and a certain interval. Therefore, data discretization can effectively hide the defects in original data and has attracted widespread attention [11].

Actual datasets often contain a large number of attributes, which can form conceptual hierarchies with a clear partial order structure. Dividing the attribute values based on related concepts in the concept hierarchy can form attribute value with multi-scale characteristics, and can obtain different granularity representations of the attribute value set. Since all data subsets in a certain scale representation form of a dataset are divided according to the attribute value set of a concept, each data subset has a specific and clear data meaning. In traditional algorithms, such as CAIM [18], CACC [27], MDLP [10], etc, only the mean of the adjacent attribute intervals is considered as the candidate cut point set, and the data division based on it is insufficient. We introduce the multi-scale theory into the discretization process, which can reasonably divide the attribute value to obtain a set of candidate cut points. The candidate set is sorted, then the information entropy is applied recursively, always selecting the cut point with the smallest entropy. And MDLPC criterion is applied to decide when to refrain from applying further binary partitioning to a given interval. Therefore, the performance of the discretization algorithms and the classification accuracy of the classifiers have been significantly improved by combining multi-scale theory.

1.1 Motivations

  • The dataset usually involves the relative size of the conceptual scope and granularity, and the multi-scale characteristics can reflect the nature of the dataset from multiple perspectives and hierarchies.

    Multi-scale can reveal the nature of the natural scale of a research object in essence. The data often corresponds to an attribute set when studying data from a certain category, which can form a conceptual hierarchy with a clear partial order structure. Dividing the data according to the concept hierarchy can form a dataset with multi-scale data characteristics, which is helpful for decision makers to make decisions from different perspectives. And the complexity of handling problems can also be further reduced by using scale conversion. Recently, multi-scale theory has been attempted to apply to general datasets. Hierarchical theory, conceptual hierarchy, and inclusion theory are used as the basis for scale division to study the distribution patterns in different scale hierarchies, and then to find meaningful facts, such as multi-scale association rules [21] and multi-scale clustering [12].

  • Data discretization is an important data preprocessing technique. However, most traditional discretization approaches are difficult to reach a balance between running time and classification accuracy for classifiers.

    Many data mining/machine learning algorithms can only handle discrete data. However, the original user data is often continuous. Therefore, the discretization of these continuous data is necessary to facilitate the further processing of the algorithms. Moreover, data can be more further understood and reduced, which make data analysis faster and more accurate. Most discretization algorithms are difficult to achieve a balance in running time and classification accuracy when applying them to classification algorithms, even some discretization algorithms are only applicable to specific datasets. Therefore, it is necessary to research an efficient and usual data discretization method.

  • Incorporating multi-scale theory, a more reasonable candidate cut point set can be obtained through reasonable data scale partition.

    The exploration of things, phenomena or processes will vary due to the choice of different scales. As a result, the inward nature of things may be comprehensively, partially, even incorrectly reflected. Dataset also tends to involve this multi-scale nature. If we can follow the essential characteristics of a research object and divide the corresponding data reasonably based on different scale characteristics, we can obtain more valuable information. Therefore, we introduce the multi-scale theory and give specific multi-scale partition strategy to divide the data and calculate candidate cut points with different granularities, the candidate set and computational overhead are greatly reduced. In addition, the classification results are obtained through a large number of known condition attributes and decision attributes. Therefore, the larger the amount of data, the higher the prediction accuracy. However, most discretization methods only consider the attribute values that have been counted, which makes the candidate cut points only limited to the determined finite attribute values. Cut points with different hierarchies are obtained through multi-scale partition. Then we utilize these points as test data, which will make the actual classification more reasonable.

1.2 Contributions

Compared with a large number of existing discretization methods, the main contributions of MSE are summarized as follows:

  • The domain attribute is hierarchically divided by introducing multi-scale theory, and a set of candidate cut points with different granularity are obtained.

  • Information entropy is applied to the obtained candidate cut point set, and the cut point with the minimum entropy is recursively selected and judged by MDLPC criterion to generate the final discrete interval.

  • A data discretization algorithm based on multi-scale and information entropy, called MSE, is proposed.

  • We conduct extensive experiments to exhibit that MSE offers ample opportunities to boost the execution efficiency of discretization algorithms and classification accuracy for classifiers.

1.3 Organization

The rest of this paper is organized as follows. Section 2 investigates previous work related to this study. In Section 3, we describe basic concepts pertaining to data discretization as well as muti-scale. Then, a data discretization algorithm based multi-scale and information entropy (i.e., MSE) is presented in Section 4. Sections 5 and 6 detail the experimental settings and comparison results respectively. We conclude our research work and future research directions in Section 7 .

2 Related work

In the field of data mining and machine learning, discretization of continuous attributes can not only effectively reduce the time and space overhead, but also enhance the learning accuracy and anti-noise ability of algorithms. The most attention of discretization algorithms and the multi-scale theory are summarized as follows:

  • Discretization algorithms based on class-attribute interdependence. Kurgan et al. proposed a classic algorithm— CAIM (class-attribute interdependence maximization), which is a global, static, top-down, supervised discretization algorithm [18]. They emphasized that CAIM can generate a minimal number of discrete intervals and need not require the user to predefine the number of intervals. However, CAIM has three drawbacks. First, the importance of attributes is not fully considered during the discretization process. Second, the inconsistency rates of the decision-making table is ignored. Finally, it is unreasonable to adopt the caim value as a discrete discriminant. The above drawbacks often results in information loss, and the accuracy of machine learning is affected. To address the issues of CAIM, Cano et al. presented ur-CAIM, which extended the CAIM criterion to address interdependence, redundancy, and uncertainty of class-attributes [5]. Therefore, the algorithm is superior to CAIM, especially on unbalanced datasets, which generating fewer intervals and better discretization schemes at the lower computational overhead. The same year, Cano et al. presented LAIM (Label-Attribute Interdependence Maximization), which is inspired in the discretization heuristic of CAIM for single-label classification [4]. LAIM provides the possibility to process multi-label dataset. Tsai et al. proposed CACC (class-attribute contingency coefficient), which is a static, global, incremental, supervised and top-down discretization algorithm [27]. They developed a novel heuristic objective function that takes into account the class distribution information for all samples. CACC avoids overfitting of the algorithm to produce better discretization results, and improve the classification prediction accuracy of machine learning. However, CACC is time consuming, which reduces its appeal when applying on real-world problems. Xiaolong Liu et al. proposed an improved algorithm based on CACC, which selects the cut points using the CACC standard and increases the constraint conditions of the data inconsistency rate to reduce the amount of data loss information [20].

  • Discretization algorithms based on rough set theory. Hong Shi et al. proposed a novel algorithm, which implements global discretization through consistency measurement, which overcomes the defect of the inconsistency rate introduced by the local discretization MDLPC criterion [25]. Cheng et al. proposed an improved continuous attribute discretization algorithm based on rough set from the perspective of decision tables and information entropy [38]. In which, the concepts of ‘conditional attribute weights’ and ‘equivalent class projections’ are defined. Unnecessary candidate cut points are quickly eliminated by judging the importance of conditional attributes to the decision table and comparing the relations between conditional attribute values and equivalent class projections, and then algorithm efficiency is significantly improved. Jiang et al. proposed a supervised multivariate discretization method (abbr. SMDNS), which uses the interdependence between class information and condition attributes to improve classification effect [15]. Cao et al. proposed a continuous attribute discretization algorithm combining binary ant colony and rough set [6]. This algorithm constructs a binary ant colony network on the multi-dimensional continuous attribute candidate breakpoint set space. According to the approximate classification accuracy of the rough set, fitness evaluation function is established to find the globally optimal discretized breakpoint set.

  • Discretization algorithm based on clustering. Min et al. proposed a global discretization and attribute reduction algorithm based on clustering and rough set theory [22]. In which, k-means clustering algorithm is adopted by comparing different discretization methods. In order to overcome the deficiency of k-means clustering algorithm, F-analysis of variance statistics and the support strength of conditional attributes are introduced to control the effectiveness of discretization. In order to meet the premise of rough set theory, a reasonable number of clusters can be obtained based on the correlation index. Thereafter, attributes are reduced by using rough set theory and decision rules are derived. Jifu Zhang et al. first selected candidate initial fuzzy clustering center by using the density values of the samples to effectively overcome the shortcomings of sensitivity to noise data [35]. Then, the algorithm parameters are dynamically adjusted to achieve the best discretization of spectral characteristic lines based on the compatibility of the decision table.

  • Discretization algorithm based on entropy. Recently, discretization methods based on information entropy have been widely researched. Fayyad et al. proposed a discretization algorithm based on the entropy and the Minimum Description Length Principle (MDLP) [10]. The algorithm selects the breakpoints that can form a boundary between classes, and uses MDLPC criterion to determine the appropriate number of discrete intervals. However, it belongs to local discretization methods and it is easy to introduce inconsistency rates. Addressing to this problem, lots of research work has been conducted. For example, a comprehensive analysis of local and global information based on information entropy is carried out by Wen et al. [28]. In the local discretization phase, k strong cut points are selected for each attribute to minimize the conditional entropy.

  • Discretization algorithms based on Chi2. The Chi2-based algorithms are a typical supervised, global, bottom-up discretization algorithm of statistical independence. Kerber proposed the pioneering ChiMerge method in this series of methods [17]. First, continuous attribute values are sorted in ascending order, and then the set of each value of continuous attributes is used as an interval, and tests all adjacent intervals. The pair of chi-square statistics are used to determine whether the current adjacent interval is merged, that is, the minimum chi-square adjacent interval is merged iteratively. At the same time, a chi-square parameter threshold (significant level α) is artificially set, and the iterative process is terminated until the values of all adjacent interval pairs are greater than the given threshold. However, the calculation of the inconsistency rate leads to a reduction in the credibility of the original data and some classification errors. To cope with this problem, a series of studies have been proposed. Changlei Zhao et al. proposed a new data reduction method, namely RS-D (Rough Sets-Discretization), which performs attribute reduction and rule reduction on the discrete data using the Rectified Chi2 algorithm combined with rough set theory [23]. Yu et al. considered that the theoretical basis for determining the importance of a node using the value of the difference between the critical value D deviding 2V was insufficient, and Accuracy cannot be guaranteed [24]. So a novel discretization Method was proposed (a.k.a., Rectified Chi). Rectified Chi uses (2kv)/2k as an important part of the value of Eij, and finally achieves the desired discretization result, which improves the learning accuracy of the classifier.

  • Discretization algorithms based on genetic algorithm. Jing Zhang et al. proposed a multi-attribute discretization algorithm based on genetic algorithms and variable-precision rough sets [36]. It establishes the fitness evaluation function of genetic algorithm by approximating classification accuracy of variable precision rough set, and uses genetic algorithm to find the optimal breakpoint subset on multidimensional continuous attribute candidate breakpoint set. The algorithm achieves better data classification fault tolerance and anti-noise ability

  • Scale theory and data mining. Multi-scale theory has been paid close attention in data mining field. However, the research on multi-scale data mining is still in its infancy, lacking universal theory and methods. With the deepening of the application of big data, its research becomes more urgent. Mengmeng Liu et al. conducted a study of universal multi-scale data mining on theoretical and methodological aspect [21]. The point-domain Kriging method and area-domain Kerry were introduced to accomplish scale-down and scale-up mining respectively. Chao Li et al. proposed a multi-scale association rule scale-up algorithm MSARSUA, which introduced a similarity calculation method based on inclusion degree and a Gaussian pyramid scale-up theory [19]. The introduction of multi-scale theory can not only effectively reduce the scale of the problem to improve the processing efficiency but also help the decision maker to make decisions from different perspectives. In 2019, Ye Zhang et al. proposed a data scale partition method for multi-scale data mining, which is based on a discretization method using probability density estimation [37]. This method expands the scale data types and effectively reduces the scale effect caused by scale deduction in multi-scale data mining.

In summary, most of the supervised discretization algorithms ignore the more valuable information that may exist in the dataset when initially selecting the candidate cut point set, resulting in insufficient final discretization results. Therefore, candidate cut points with different granularities are obtained by incorporating the multi-scale theory to the division of the initial data, and using these points as test dataset will make the actual classification more reasonable.

3 Background

To facilitate the presentation of MSE, we summarize the notation used throughout this paper in Table 1.

Table 1 Symbol and annotation

3.1 Decision table

Discretization technique aims to divide the domain of the problem based on the known condition attributes and decision attributes of the decision table to ensure that the decision table has a high classification ability. The decision table is a table-like graphical tool, which is suitable for the situations that contain multiple, intercombined conditions and multiple decision-making schemes. A way to express complex logic accurately and concisely is to associate multiple conditions with actions to be performed after these conditions are met. Decision tables can clearly associate multiple independent conditions and corresponding action actions.

Definition 1

(Decision table). A decision table DT is defined as the 5-tuple [28]:

$$ \mathrm{DT = (U, A, D, \lbrace V_{a}|a\in A \cup D \rbrace, \lbrace I_{a}|a\in A \cup D \rbrace)} $$
(1)

where

  1. 1.

    U is a nonempty finite set of objects called the universe;

  2. 2.

    A is a nonempty finite set of conditional attributes;

  3. 3.

    D is a nonempty finite set of decision attributes;

  4. 4.

    Va is the set of values for each aAD; and

  5. 5.

    \(I_{a}:U \rightarrow V_{a}\) is an information function for each aAD.

Example 1

Table 2 illustrates a decision table DT, where U = {x1, x2, x3, x4, x5, x6}, A = {a1, a2}, and D = {d}.

Table 2 Example of a decision table DT

3.2 Information entropy

Information entropy is a commonly used method to select cutting points in the supervised discretization algorithms. Information entropy is usually used to indicate the degree of chaos for a system. In this study, information entropy is used to indicate the purity of the divided dataset. A smaller entropy value means that the greater the data purity, that is, the higher availability of the discrete data will be obtained, and vice versa. Information entropy and its related definitions are as follows:

Definition 2

(Information entropy). The information entropy of T is defined as [28]:

$$ \mathrm{Ent(T)=-\sum\nolimits_{i=1}^{n}p(X_{i})Log(p(X_{i}))} $$
(2)

where \( p(X_{i})=\frac {| X_{i}|}{|U|} \), and Xi is the distribution of decision attributes which have been divided according to the cut point T.

4 Data discretization based on multi-scale and information entropy

In this section, we give relevant definitions of the multi-scale theory, and elaborate on the idea of MSE.

4.1 Multi-scale partition

4.1.1 Related multi-scale definitions

Definition 3

(Scale). Scale refers to the measurement unit of the research object, and is a standard for measuring research objects [37].

Broadly speaking, scale can be regarded as the unit or measurement tool of a research object. We apply scale theory to user data objects to help effectively discretize data. The ‘scale measurement’ includes two aspects: the range measurement (Fig. 1a) and the granularity measurement (Fig. 1b). The range scale measure studies the size of an object, and the granularity scale measure concerns the smallest measurement unit of a study object in a scale range. In this study, the granularity scale measurement method is adopted to divide the attribute values.

Fig. 1
figure 1

Scale Measurement

Definition 4

(Concept hierarchy). The concept hierarchy H is a partial order relation set (H,≺), where H represents a finite concept set, and ≺ reflects a partial order relation between two adjacent concepts contained in H [21].

The attribute set of data in certain category can form a conceptual hierarchy with a clear partial order structure: each attribute hi(i = 1,...,n) can be regarded as a concept of the finite concept set H = {h1,...,hi,...,hn}. Based on the domain knowledge, there is a partial order relationship among attributes, which corresponds to the partial order relationship of concepts in a finite concept set. An instantiated attribute hiH usually corresponds to a group of specific attribute values, denoted as Vhi = {v1, v2,...,vmi} (where, vj represents a specific discrete value or a continuous interval). That is, the high abstraction of the group of attribute values in semantic forms the attribute (concept) hi, we call the attribute value vjvhi belongs to hi in semantic and record it as vjhi. In practical applications, the attribute set in regional category can form a conceptual hierarchy (Hlocation,≺) = {villagecityprovincecountry}; the attribute set in the time category can also form a conceptual hierarchy (Htime,≺) = {daymonthyear}.

According to the definition of concept hierarchy, we can conduct multi-scale partition on the attribute values. Accordingly, those points used to divide the attribute values are called as the candidate cut point set. Below we give the definition of the candidate cut point set.

Definition 5

(Candidate cut point set). Given a decision table DT, the candidate cut point set of attribute A is:

$$ \mathrm{{C_{i}^{a}}=d_{0}+\frac{(d_{n}-d_{0})i}{O^{j}}} $$
(3)

Where

  1. 1.

    \( {C_{i}^{a}}(a\in A, 1 \le i \le |V_{a}|) \) is the i-th candidate cut point of the attribute A;

  2. 2.

    d0 is the minimum value of attribute A;

  3. 3.

    dn is the maximum value of attribute A;

  4. 4.

    Oj is the granularity of hierarchy;

  5. 5.

    O is the order of the tree, which is used to determine the base of the granularity in the scale. This parameter defaults to 4, which is a discretizers parameter recommended by earlier studies. The best default value has also been verified by subsequent experiments in this study;

  6. 6.

    J is the layer number of the tree, which is used to determine the index of granularity division. The size of J depends on the logarithm between the number of distinct attribute values CountA and order O, that is J = LogO(CountA/2). The value of J can ensure that the divided data can be controlled within a certain range to prevent overfitting.

Example 2

Suppose an attribute value ranges from 0 to 100 and contains a total of 90 different values. According to formula (3), the value d0 is 1, dn is 100, O is 4, and j is 2. The division process of this attribute value is shown in Fig. 2.

Fig. 2
figure 2

An example for generating candidate cut point set

The candidate cut point set is: [ 0, 6.25, 12.5, 18.75, 25, 31.25, 37.5, 43.75, 50, 56.25, 62.5, 68.75, 75, 81.25, 87.5, 93.75, 100 ].

4.1.2 Multi-scale data partitioning

We apply multi-scale theory to study the optimal scale selection for continuous data. Due to the different value ranges and number of values of the multiple attributes that make up the dataset, we need to determine the best scale partition for each attribute. Different scale selection will affect the final conclusion, and even an irrational scale partition may lead to wrong conclusions. Therefore, we hierarchically partition the data from coarse to fine according to the concept hierarchy, then to determine the best partition scale depending on the adjustment of the scale granularity.

When scale partition is performed, the more hierarchies are divided, the finer the partition, however, the amount of data will also increase accordingly. In the scale partition process, we can find that better candidate cutting points often correspond to a certain scale hierarchy. Therefore, the discretization algorithm can obtain the best trade-off between time and classification accuracy. Then, the corresponding hierarchy is the best scale we are looking for. Granularity scale measurement is the smallest measurement unit of the research object under the scale range, which is our main concern. Our proposed method MSE is similar to equal width discretization method. The biggest difference between them is that MSE carries out partition based on different granularities instead of the same granularity used in equal width discretization method. The specific partition idea of MSE is described as follows:

Based on the tree structure division, each interval partition maps the range of continuous values to a node of the tree structure. And the root of the tree is the value range of continuous attributes. As we gradually divide down the hierarchy, increasingly fine intervals are formed, The number of intervals determines the degree of accuracy. Figure 3 shows a general multi-scale interval representation based on a q-order tree, and Fig. 4 gives an instance of a four-order tree.

Fig. 3
figure 3

Schematic representation of multi-scale interval of q-order tree

Fig. 4
figure 4

An instance of four-order tree

In the tree, a discrete interval is represented by two tuples (m,n), which are corresponding scale and node number respectively. That is, node (m,n) indicates that the interval is on the nth node of the mth hierarchy (scale m).

The coarsest scale locates the node m = 0 in which only one interval is expressed as (0,1) interval (attribute domain). The number of intervals (a.k.a., the number of nodes) of the Q-order tree on the scale m is qm. At each layer, we adopt equal width method to complete partition. Let [a,b] be the domain of the continuous variable attribute x, then the discretization interval represented by the node (m,n) is [a + (ba)(n − 1)/qm, a + (ba)n/qm]. The larger the values of q and m, the higher the classification accuracy, but it also means the calculation amount is also larger. From formula (4), the data partition hierarchy is determined by the number of different attribute values, that is, the larger the number of attribute values, the greater the number of hierarchies. Based on the interval partition, the obtained interval values are used as the candidate cut set to be further optimized in the next step.

4.2 Cut set detection based on information entropy and MDLPC criterion

In order to judge which points are the best after the candidate cut set is generated, information entropy is introduced. From the definition of information entropy introduced in Section 3.2, we can know that if the data in the dataset has good consistency, the corresponding information entropy value will be small. Therefore, our goal is to find a cut point with a small information entropy. The specific idea of cut set detection based on information entropy is as follows.

First, each candidate cut point is used to divide attribute value set Va of an attribute A into two parts, and the class information entropy corresponding to each cut point is calculated according to definition 6 according to definition. Then, the cut point with the minimum entropy value will be selected as the candidate best cut point. However, whether the cut point can be determined as the final discrete interval needs to be judged by the MDLPC criterion (the reasons and definitions are given below). If a cut point is selected as the final discrete interval, the attribute value set will be divided into two subsets by this cut point. And the above process will be performed recursively on the divided attribute value subsets until all candidate cut points do not meet the MDLPC criterion.

Definition 6

(Class information entropy). For an example set Va, an attribute A, and a cut value T: Let Va1Va be the subset of examples in Va with AvaluesT and Va2 = VaVa1. The class information entropy of the partition induced by T, Ent(A,T;Va), is defined as [10]:

$$ \mathrm{Ent(A, T; V_{a})=\frac{|V_{a1}|}{|V_{a}|}Ent(V_{a1})+\frac{|V_{a2}|}{|V_{a}|}Ent(V_{a2})} $$
(4)

The discretized cut point of the attribute A is determined by selecting the cut point TA with the minimal Ent(A,T;Va).

In most supervised discretization algorithms, the number of class labels is set to the maximum interval value of continuous data to determine the final discrete interval, such as CAIM, CACC, and so on. However, the division schema is not flexible enough and is not suitable for the data with various modes. The ideal discrete algorithm not only needs to consider enough interval values to prevent the loss of data information, but also to avoid overfitting due to too many interval values. Therefore, we use the evaluation standard proposed by Fayyad and Irani, that is MDLPC criterion [10]. MDLPC criterion compares the minimum description length without division and the minimum description length after division according to the best cut point to determine whether the cut point can be selected as the final discrete interval to divide the data. The data should be divided when the former value is greater than the latter, otherwise the division should be discarded. The MDLPC criterion is defined in Definition 7 below.

Definition 7

(MDLPC criterion). The MDLPC criterion is an evaluation standard for attribute selection metrics, that is, to determine whether a cutting point is a final cut point. For a set Va of N examples, if a cut point T can be accepted as the final discrete cut point iff [10]):

$$ \mathrm{Gain(A,T;V_{a})>\frac{Log_{2}(N-1)}{N}Ent(V_{a1}) + \frac{\Delta(A,T;V_{a})}{N}Ent(V_{a2})} $$
(5)

Where, Δ(A,T;Va) = Log2(3k − 2) − [kEnt(Va) − k1Ent(Va1) − k2Ent(Va2)], Gain(A,T;Va) = Ent(Va) − Ent(A,T;Va). Accept when the conditions are met, otherwise reject.

According to MDLPC criterion, the final number of interval values can be obtained objectively without causing overfitting and information loss.

4.3 Discretization algorithm based on multi-scale and information entropy

4.3.1 Algorithm description

Step 1: Select candidate cut points. The continuous attributes in the decision table are sorted in ascending order according to their values. we label the minimum value as dmin and the maximum value as dmax. And we initialize the discretization scheme D[dmin, dmax] and candidate cut point set C[dmin, dmax]. The candidate cut point set C[dmin, dmax] are then mapped to the initial root nodes of the O-order tree (for the initial value of O, see definition 5). The continuous attribute value range represented by C[dmin, dmax] is divided layer by layer according to Definition 5. The boundary values corresponding to each node in the tree are calculated by the formula \({C_{i}^{a}}= d_{0} + \frac {(d_{n}-d_{0})i}{O^{j}}\) (a and i represent attribute and the number of cut points, respectively) until the number of division layers J is reached, then a complete O-order tree will be obtained. During this process candidate cut point set C will be updated by adding all newly generated boundary values \({C_{i}^{a}}\) to C (see lines 1 to 9 in Algorithm 1).

Step 2: Select a optimal cut point. The class information entropy value corresponding to each candidate cut point is calculated based on definition 6, and the cutting point with the smallest entropy value is selected as the best cut point (see lines 12 to 13 of the following algorithm 1).

Step 3: Determine the final discrete interval set. The MDLPC criterion is adopted to judge whether the best cut point in step 2 can be selected as the final discrete interval. If it meets, add it to the discretization scheme D[dmin, dmax], that is, it is determined as the final cut point to divide the data. Then, return to step 2 and start repeatedly on each divided data block. Otherwise, the cut point is discarded and the next cut point is selected to return to step 2. The algorithm terminates until all candidate cut points are judged (see lines 14 to 19 in Algorithm 1)

figure f

4.3.2 Time complexity analysis

In this section, the time complexity of the MSE algorithm for a single attribute is analyzed. The time complexity of the MSE algorithm is mainly determined by the calculation of all possible candidate cut points and the information entropy of each cut point. The specific steps involved are as follows:

All attribute values are searched to find the maximum and minimum values of the attribute values in Line 3. Suppose an attribute contains N objects, so the time complexity corresponding to this step is O(N).

Line 8 is to calculate all candidate cut point using the given formula (5). In the worst case, the time complexity of this step is O(m), where m is the number of unique values of the attribute.

Lines 10 to 19 in the algorithm are used to determine the final discrete interval. First, lines 12 to 13 calculate the entropy value of each candidate cut point. It can be known from the above algorithm description that the complexity of entropy calculation is O(N), so the time complexity corresponding to these steps is O(mN). Then the MDLPC criterion is used to detect these cut point. From the definition 7, it can be seen that the time complexity of the detection process i is O(N), so the overall time complexity from line 10 to 19 is O(mN) + O(N).

Therefore, for a dataset containing K attributes, the time complexity of the algorithm is O(k) × O(N + m + mN + N) = O(2NK + mK + mNK) = O(mNK).

5 Experimental setup

In this section, some existing classic algorithms (see Section 5.2 in details) and classifiers(see Section 5.3 in details) are selected to evaluate the performance of our proposed algorithm MSE. The performance differences among them are examined and analysis using the ten UCI datasets (see Section 5.4 in details).

5.1 Experimental setup

We implement the MSE and its comparison algorithms using Python 3.6.2 on a computing node equipped with Windows 10 operating system, InterCore i5-7500G Hz CPU and SamsungDDR44GB memory.

5.2 Discretization algorithms for comparison

The following typical discretization algorithms are chosen to effectively evaluate our algorithm MSE.

  1. 1.

    Multi-scale Data and Information Entropy

  2. 2.

    EW [29]: Equal Width

  3. 3.

    EF [29]: Equal Frequency

  4. 4.

    KMeans [8]: Clustering-based

  5. 5.

    MDLP [10]: Minimum Description Length Principle

  6. 6.

    CAIM [18]: Class-Attribute Interdependence Maximization

  7. 7.

    CACC [27]: Class-Attribute Contingency Coefficient

  8. 8.

    UrCAIM [5]: Improved CAIM Discretization

  9. 9.

    TSD [28]: A Two-stage Discretization

5.3 Classifiers for comparison

In order to avoid the bias of particular classifiers to data, 5 different classifiers belonging to different families are used to evaluate the classification performance, which increases the strength of the experimental study. The classifiers are:

  1. 1.

    CART [2]: This is a typical binary decision tree, considered one of the top 10 DM algorithms [30].

  2. 2.

    Naive Bayes [16]: This is another of the top 10 DM algorithms [30]. Its aim is to construct a rule which will allow us to assign future objects to a class, assuming independence of attributes when probabilities are established.

  3. 3.

    RandomForest [3]: This is an algorithm that integrates multiple trees through the idea of ensemble learning. It is unexcelled in accuracy among current algorithms.

  4. 4.

    SVM [7]: It is a supervised learning method, which is widely used in statistical classification and regression analysis. It is also considered one of the top 10 DM algorithms [30].

  5. 5.

    OneR [14]: This is a very simple classification method, which can quickly build a model for classification prediction. The basic idea of OneR is to use the most important features found in all the features of the dataset for classification.

5.4 Experimental dataset

We choose 10 datasets from the University of California Irvine Machine Learning UCI Database (http://archive.ics.uci.edu/ml) [1] to evaluate our algorithm. These datasets are typical due to their differences in complexity, number of classes, number of attributes, number of instances, etc., and they are often used by other algorithms to evaluate discretization performance [11].

  1. 1.

    Abalone Data (abalone)

  2. 2.

    Glass Identification Database (glass)

  3. 3.

    Johne Hopkins University ionosphere Database (ionosphere)

  4. 4.

    Iris Plants dataset Iris (iris)

  5. 5.

    Optical Recognition of Handwritten Digits (optdigit)

  6. 6.

    Pen-Based Recognition of Handwritten Digits Dataset (pendigits)

  7. 7.

    Statlog (Landsat Satellite) Dataset (satellite)

  8. 8.

    Statlog (shuttle) Dataset (shuttle)

  9. 9.

    Waveform Database Generator (Version 1) dataset (waveform)

  10. 10.

    Wine Quality (winequality)

The main characteristics of these datasets are summarized in Table 3. And the corresponding probability density functions are shown in Fig. 5 to express the probability of the attribute values of the datasets. In addition, the covariance matrix (Fig. 6) is adopted to demonstrate the actual relation among the attributes of each dataset. However, too large conditional attributes will make it impossible to display all the attribute information at the same time in the visualization process. Therefore, for cases where there are too many conditional attributes in individual datasets (such as ionosphere, optdigits, satellite, shuttle, waveform, winewuality), we select the conditional attributes with greater correlations to plot by calculating the correlation between conditional attributes and decision attributes using the chi-square.

Table 3 The summary of 10 UCI experimental datasets
Fig. 5
figure 5

The probability density functions of datasets

Fig. 6
figure 6

The covariance matrix of datasets

6 Experimental analysis

In the experimental evaluation process, ten UCI datasets were tested to evaluate the performance of MSE. Note that the best results are marked in bold in all tables exhibiting experimental results below.

In the subsequent experimental results, we added the last row or column (i.e., Rank) to compare the grades of the nine algorithms. Each Rank value is the grade mean of the corresponding discretization algorithm on 10 datasets. That is, for each dataset, we assign different grade to each algorithm according to performance. The algorithm grade with the best performance is assigned to 1, and so on. To further evaluate the significant differences in algorithm performance, we used the Friedman test and Holm post-hoc test [13] [9] to verify statistically. When the Friedman statistic value is greater than a certain threshold, it indicates that there is a statistical difference among the Rank of the discretization methods. The Friedman statistical distribution corresponds to the F-distribution with degrees of freedom k − 1 and (k − 1)(N − 1), in which k is the number of algorithms, N is the number of data sets. When the Rank difference between the two comparison algorithms is greater than the critical difference obtained by Holm post-hoc test, it indicates that the algorithm with the larger Rank value has a significant performance advantage.

6.1 Impact of parameters on MSE

The parameters that affect the performance of the MSE algorithm mainly include the order and number of layers of the data partition. In this group of experiment, we verified the impact of these two parameters on the performance of the MSE algorithm in terms of running time, number of intervals, and classification accuracy.

In order to make the experimental results more convincing, experiments were performed on CART and RandomForest. CART has the characteristics of strong fault tolerance for outliers and high robustness, so it is not affected by some abnormal parameters in our experiments. For RandomForest, compared with other classification algorithms, it has a strong ability to resist overfitting, and can also balance errors on unbalanced datasets.

The effects of different partition orders on the performance of CART and RandomForest are shown Tables 4 and 5. It can be seen that when the order is 4, the classification accuracy performs best. From Tables 6 and 7, we can also see when the other parameters of the algorithm are fixed, the running time and interval numbers on most datasets are not much different for different orders. This trend is consistent with the quantile theory in statistics. The quartile, as a form of quantile, has very important meaning and function in statistics, which can effectively help us identify the characteristics of the data: (1) intuitively identify outliers in the dataset, (2) judge the degree of data dispersion and bias of the dataset.

Table 4 Impacts of different partition orders on the classification accuracy of CART
Table 5 Impacts of different partition orders on the classification accuracy of RandomForest
Table 6 Impacts of different partition orders on the runtime of MSE
Table 7 Impacts of different partition orders on the interval number of MSE

Tables 8 and 9 show the effect of different partition hierarchies on the performance of CART and RandomForest. The experimental results clearly show that the classification accuracy can reach the optimal value when the number of hierarchies J = logO(CountA/2). The experimental results in Table 10 clearly show that the execution time will increase as the number of partition hierarchies increases. And, the experimental results illustrated in Table 11 show that the number of interval values also increases with the number of partition hierarchies until a specific value is reached. The sign ‘ - ’ in the table indicates that the running time is too long due to too many candidate cut points generated by the attribute value division. The reason for this trend is that more candidate cut points provide a better opportunity to choose the best cut point, so that the classification accuracy is improved. When the partition hierarchies reaches a certain value, the classification accuracy is no longer significantly improved because the number of selected cut points has met the selection criteria.

Table 8 Impacts of different partition hierarchies on the classification accuracy of CART
Table 9 Impacts of different partition hierarchies on the classification accuracy of RandomForest
Table 10 Impacts of different partition hierarchies on the runtime of MSE
Table 11 Impacts of different partition hierarchies on the interval number of MSE

6.2 Discretization efficiency

In the comparative experiment, the parameter intervals of the unsupervised discretization algorithm are all set to 4 referring to other literatures [28], and the analysis results are shown in Table 12. For the supervised discretization algorithms, their parameter settings defer to the recommended configuration.

Table 12 Discretization efficiency of the nine algorithms

Table 12 shows The two classic unsupervised discretization algorithms, equal width and equal frequency, run faster than other algorithms because unsupervised algorithms do not involve heuristic knowledge and avoid extra judgement time. It can also be seen from Table 12 that our MSE algorithm outperforms most other supervised discretization algorithms in execution time. That is, there are significant differences between MSE and other algorithms, and it performs significantly better than the time-consuming CACC algorithm, which is also verified by Friedman test (the Friedman statistical value 6.43 is greater than the threshold of 3.1) and Bonferroni-Dunn test (the critical difference is 3.34). Because MSE effectively controls the amount of candidate set of cut points through scale division, thereby the calculation time of the optimal cut point selection is reduced. It is worth noting that the CACC and CAIM discretization algorithms obviously behave poorly on abalone dataset. This is mainly because the dataset is unevenly distributed and the large number of class labels affect the algorithm runtime. However, we also found that the UrCAIM algorithm performs better on certain datasets, because UrCAIM improves the efficiency of the CAIM algorithm by improving the discretization standard. However, TSD consumes a lot of time due to its two-stage execution strategy.

For different datasets, in the case that the amount of attribute data is less than 1000, such as glass, ionosphere, and iris, the algorithms run faster because the amount of data that needs to be processed is relatively small. However, from the time comparison of various discretization algorithms running on these three datasets, it is obvious that the ionosphere dataset consumes longer than the other two datasets. This is because the ionosphere dataset requires 33 discrete attributes, which is much larger than the other two datasets. As a result, the number of cycles of the algorithm will increase, and the running time will also increase, especially for CACC discretization algorithm. In the case where the data volume of an attribute is around 5000, such as abalone, optdigit, satellite, waveform, and winequality, most algorithms take a relatively long time on the two datasets abalone and waveform. This is because these two datasets contain more unique attribute values. And, for the two datasets pendigits and shuttle, although the data volume is large (the data volume of an attribute exceeds 10,000), the values are all integer values and the data distribution is relatively concentrated. Therefore, the running time on the two datasets is relatively stable.

6.3 Number of discretized intervals

Table 13 statistics the number of intervals generated by the discrete algorithm. In this group of tests, the corresponding Friedman statistical test value 5.52 of Table 13 is greater than the threshold value of 3.1, indicating that there is a statistical difference among the Rank of these discretization methods. The subsequent Bonferroni-Dunn test (the critical difference is 3.34) shows that MSE has no significant performance advantage over other algorithms. Since the difference in Rank between MSE and other algorithms does not exceed this critical value observed from Table 13.

Table 13 Number of discretized interval generated by the nine algorithms

The interval of the two unsupervised algorithms (i.e., equal width and equal frequency algorithms) and the KMeans algorithm are directly given by the user with strong randomness. Therefore, the interval are fixed. For the supervised discretization algorithm, the number of interval of MSE algorithm and MDLP algorithm is more than the other two supervised discretization algorithms. The main reason is that these two algorithms use the MDLPC criterion to obtain the interval instead of the fixed value given in advance. In the such way, continuous data can be divided more fully, which can reduce information loss caused by inconsistent data. Overall, the CACC and TSD algorithms perform better on the number of discretized interval. Because the number of intervals generated by the CACC algorithm is always kept within the number of class labels. And, TSD discretizes the data using two stages, which makes the result of local discretization further reduced during the global discretization process, resulting in fewer discrete intervals.

6.4 Impacts on classification accuracy

We evaluate the the impact of the MSE discretization algorithm on classification accuracy by applying our MSE to five classic classification algorithms, widely used to verify the classification accuracy for the discretization algorithms. The datasets used in the experiment are partitioned using the 10-fold cross-validation (10-fcv) procedure. The parameters we used in the discretizer and classifier experiments were recommended by their respective authors, and we assume that these parameters are optimal. The specific parameters are listed in Table 14. From the Rank value, Friedman and Bonferroni-Dunn tests show that the MSE algorithm has the best comprehensive performance, because it ranks the first in four classification algorithms. On the contrary, the TSD algorithm performs the worst among the remaining algorithms. Below, we elaborate on the impact of these discretization methods on the accuracy of each classifier.

Table 14 Parameters of the discretizers and classifiers

All experimental results shows the two unsupervised discretization methods, the equal width and equal frequency discretization methods, behave poor classification accuracy because of a large data inconsistency rate caused by the equal division of data without considering decision attributes. The classification effect of KMeans discretization algorithm be superior to the two unsupervised algorithms, because it clusters close data into one category when dividing dataset. However, the classification effect of KMeans is not perfect because it lacks guidance knowledge without considering the decision attributes.

The supervised discretization algorithms are applied to a binary decision tree algorithm CART to evaluate the classification accuracy performance of our proposed algorithm. Table 15 shows results of this comparison of various discretization algorithms on different datasets. It can be seen that our proposed MSE algorithm appears good classification accuracy, especially for six datasets. This is because we divide the dataset into appropriate scales to obtain candidate cut point sets with different manifestations. Therefore, these cut point sets can better reflect the essential characteristics of the research objects, thereby improving the classification accuracy.

Table 15 Impacts on classification accuracy of CART

Tables 1617 and 18 reveals the impacts of the ten discretization algorithmths on three commonly used classification algorithms. It can be seen from the experimental results MSE significantly improves classification accuracy, especially for naive Bayes (see Table 17) and support vector machines (see Table 18), the classification accuracy is significantly improved. Compared with other types of discretization algorithms, MSE is more suitable for datasets with a larger number of class labels, a larger amount of data, and uneven distribution. For this type of dataset, the candidate cut point set selected by MSE can get a wider range of data, thereby finding more valuable cut points. For this kind of data, the candidate cut point set selected by MSE can get a wider range of data, thereby finding more valuable cut points. In other cases, MSE can also maintain relatively stable classification accuracy compared with other discretization algorithms.

Table 16 Impacts on classification accuracy of RandomForest
Table 17 Impacts on classification accuracy of Naive Bayes
Table 18 Impacts on classification accuracy of SVM

Table 19 shows the classification accuracy on the 1R classification algorithm, which is the simplest classification algorithm. 1R is to constructs rules for each feature in a dataset based on a single feature, that is 1-rules. As can be seen from Table 19, the CAIM gains the optimal performance. Because the CAIM algorithm finds the most number of points in the set for data partitioning, which is similar to the 1R algorithm. However, the selection of cut points by the MSE algorithm is based on information entropy and MDLPC criterion, which may result in more data division points and relatively fewer datasets. Therefore, the performance of classification accuracy is slightly unsatisfactory. But we cannot deny the validity of MSE because 1R algorithm is only suitable for the case where only focuses on one attribute.

Table 19 Impacts on classification accuracy of OneR

In general, CAIM, CACC and UrCAIM belong to the discretization algorithms whose class attributes are interdependent. And the latter two algorithms are the improvement of CAIM algorithm, that is, they maximize the class attribute interdependence and calculate the best interval according to their own criteria. It can be seen that CAIM and UrCAIM algorithms have similar Ranks, and UrCAIM performs best on two classification algorithms and CAIM performs best on one classification algorithm for the ionosphere dataset. The ionosphere dataset contains many conditional attributes, and the number of class labels is only two. The discretization algorithm based on the interdependence of class attributes performs better in this type of feature distribution dataset.

MSE, IEM and TSD algorithms all use information entropy in the selection of interval points. We found that IEM and MSE algorithms have similar ranks, and they perform well in most datasets. Because they can obtain appropriate discrete interval number that are more in line with the data distribution feature by adopting the MDLPC standard. In addition, we can also see that the classification accuracy on the two datasets abalone and winequality is relatively low. Because the abalone dataset contains up to 28 class labels, it is difficult to effectively distinguish the category to which the attribute belongs in the discretization process. In particular, the MSE algorithm has higher classification accuracy than other discrete algorithms. Because it divides the candidate set more widely. For the winequality dataset, the low classification accuracy is mainly due to its large attribute value difference.

7 Conclusion and future work

In this study, we developed a supervised, top-down, static discretization algorithm called MSE, which addresses to balance the running time and classification accuracy. The algorithm performs reasonable multi-scale partitioning on the dataset and can generate the smallest candidate cut for a given continuous attribute. For the evaluation of each best candidate cut point, the MDLPC criterion is used to make the selection of the cut point more objective and reasonable. We verified the performance of our proposed algorithm through extensive experiments by comparing five classic classification algorithms and nine discretization algorithms on 10 UCI datasets. The evidence shows our MSE exhibits higher execution efficiency than that of other supervised discretization algorithms and better prediction classification accuracy for the five classification algorithms.

The importance of attributes will be considered in our future research work, as well as the relationship between attributes. According to the importance of the attributes, unnecessary attributes are eliminated by setting a reasonable weight value for each attribute to further improve the running time of the algorithm. Besides, in recent years, a big evolution of information technology has brought a sudden growth in data size. Such big data are not only large in size but also complex-structured. Therefore, distributed and parallel computing based on cluster environment are widely to adopted to discretization process.