Evaluation is one of the key steps in big data analytics, which determines the merit of data analysis towards the experimental objectives. It usually relates a trade-off comparison of multiple criteria which may conflict each other or complex interpretations of the problems in nature. This chapter provides several of evaluation models of the recent studies on data science. Section 9.1 reviews three evaluation formations for the known methodologies. Section 9.1.1 describes a decision-making support for the evaluation of clustering algorithms based on multiple criteria decision making (MCDM) [1]. Section 9.1.2 is about evaluation of classification algorithms using MCDM and rank correlation [2]. Section 9.1.3 discusses the public blockchain evaluation using entropy and Technique of Order Preference Similarity to the Ideal Solution (TOPSIS) [3]. Section 9.2 outlines two evaluation methods for Software. Section 9.2.1 is about a classifier evaluation for software defect prediction [4], while Sect. 9.2.2 is about an ensemble of software defect predictors by AHP-based evaluation method [5]. Section 9.3 describes four evaluation methods for sociology and economics. Section 9.3.1 is about a delivery efficiency and supplier performance evaluation in China’s E-retailing industry [6]. Section 9.3.2 is about the credit risk evaluation with Kernel-based affine subspace nearest points learning method [7]. Section 9.3.3 is a dynamic assessment method for urban eco-environmental quality evaluation [8], while Sect. 9.3.4 is an empirical study of classification algorithm evaluation for financial risk prediction [9].

1 Reviews of Evaluation Formations

1.1 Decision-Making Support for the Evaluation of Clustering Algorithms Based on MCDM

In many disciplines, the evaluation of algorithms for processing massive data is a challenging research issue. However, different algorithms can produce different or even conflicting evaluation performance, and this phenomenon has not been fully investigated. The motivation of this section aims to propose a solution scheme for the evaluation of clustering algorithms to reconcile different or even conflicting evaluation performance. This section develops a model, called decision making support for evaluation of clustering algorithms (DMSECA), to evaluate clustering algorithms by merging expert wisdom in order to reconcile differences in their evaluation performance for information fusion during a complex decision-making process.

1.1.1 Clustering Algorithms

Clustering is a popular unsupervised learning technique. It aims to divide large data sets into smaller sections so that objects in the same cluster are lowly distinct, whereas objects in different clusters are lowly similar [10]. Clustering algorithms, based on similarity criteria, can group patterns, where groups are sets of similar patterns [11,12,13]. Clustering algorithms are widely applied in many research fields, such as genomics, image segmentation, document retrieval, sociology, bioinformatics, psychology, business intelligence, and financial analysis [14].

Clustering algorithms are usually known as the four classes of partitioning methods, hierarchical methods, density-based methods, and model-based methods [15]. Several classic clustering algorithms are proposed and reported, such as the K-means algorithm [16], k-medoid algorithm [17], expectation maximization (EM) [18], and frequent pattern-based clustering [15]. In this section, the six most influential clustering algorithms are selected for the empirical study. These are the KM algorithm, EM algorithm, filtered clustering (FC), farthest-first (FF) algorithm, make density-based clustering (MD), and hierarchical clustering (HC). These clustering algorithms can be implemented by WEKA [19].

The KM algorithm, a partitioning method, takes the input parameter k and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high, and the intercluster similarity is low. And the cluster similarity can be measured by the mean value of the objects in a cluster, which can be viewed as the centroid or center of gravity of the cluster [15].

The EM algorithm, which is considered as an extension of the KM algorithm, is an iterative method to find the maximum likelihood or maximum a posteriori estimates of parameters in statistical models, where the model depends on unobserved latent variables [20]. The KM algorithm assigns each object to a cluster.

In the EM algorithm, each object is assigned to each cluster according to a weight representing its probability of membership. In other words, there are no strict boundaries between the clusters. Thus, new means can be computed based on the weighted measures [18].

The FC applied in this work can be implemented by WEKA [19]. Like the cluster, the structure of the filter is based exclusively on the training data, and test instances will be addressed by the filter without changing their structure.

The FF algorithm is a fast, greedy, and simple approximation algorithm to the k-center problem [17], where the k points are first selected as a cluster center, and the second center is greedily selected as the point farthest from the first. Each remaining center is determined by greedily selecting the point farthest from the set of chosen centers, and the remaining points are added to the cluster whose center is the closest [16, 21].

The MD algorithm is a density-based method. The general idea is to continue growing the given cluster as long as the density (the number of objects or data points) in the neighborhood exceeds some threshold. That is, for each data point within a given cluster, the neighborhood of a given radius must contain a minimum number of points [15]. The HC algorithm is a method of cluster analysis that seeks to build a hierarchy of clusters, which can create a hierarchical decomposition of the given data sets [16, 22].

1.1.2 MCDM Methods

The MCDM methods, which were developed in the 1970s, are a complete set of decision analysis technologies that have evolved as an important research field of operation research [23, 24]. The International Society on MCDM defines MCDM as the research of methods and procedures concerning multiple conflicting criteria, which can be formally incorporated into the management planning process [24]. In an MCDM problem, the evaluation criteria are assumed to be independent [25, 26]. MCDM methods aim to assist decision-makers (DMs) to identify an optimal solution from a number of alternatives by synthesizing objective measurements and value judgments [27, 28]. In this section, four classic MCDM methods: the weighted sum method (WSM), grey relational analysis (GRA), TOPSIS, and PROMETHEE II are introduced as follows.

1.1.2.1 WSM

WSM [29] is a well-known MCDM method for evaluating finite alternatives in terms of finite decision criteria when all the data are expressed in the same unit [30, 31]. The benefit-to-cost-ratio and benefit-minus-cost approaches [32] can be applied to the problem of involving both benefit and cost criteria. In this section, the cost criteria are first transformed to benefit criteria. Besides, there is nominal-the-better (NB), when the value is closer to the objective value, the nominal-the-better (NB) is better. The calculation steps of WSM are as follows. First, assume n criteria, including benefit criteria and cost criteria, and m alternatives. The cost criteria are first converted to benefit criteria in the following standardization process.

  1. 1.

    The larger-the-better (LB): a larger objective value is better, that is, the benefit criteria, and it can be standardized as

    $$ {x}_{ij}^{\prime }=\frac{x_{ij}-\underset{i}{\min }{x}_{ij}}{\underset{i}{\max }{x}_{ij}-\underset{i}{\min }{x}_{ij}} $$
    (9.1)
  2. 2.

    The smaller-the-better (SB): the smaller objective value is better, that is, the cost criteria, and it can be standardized as

    $$ {x}_{ij}^{\prime }=\frac{\underset{i}{\max }{x}_{ij}-{x}_{ij}}{\underset{i}{\max }{x}_{ij}-\underset{i}{\min }{x}_{ij}} $$
    (9.2)
  3. 3.

    The nominal-the-better (NB): the closer to the objective value is better, and it can be standardized as

    $$ {x}_{ij}^{\prime }=1-\frac{\left|{x}_{ij}-{x}_{ob}\right|}{\max \left\{\underset{i}{\max }{x}_{ij}-{x}_{ob};{x}_{ob}-\underset{i}{\min }{x}_{ij}\right\}} $$
    (9.3)

Finally, the total benefit of all the alternatives can be calculated as

$$ {A}_i=\sum \limits_{j=1}^k{w}_j{x}_{ij}^{\prime },\kern1em 1\le i\le m,1\le j\le n $$
(9.4)

The larger WSM value indicates the better alternative.

1.1.2.2 GRA

GRA is a basic MCDM method of quantitative research and qualitative analysis for system analysis. Based on the grey space, it can address inaccurate and incomplete information. GRA has been widely applied in modeling, prediction, systems analysis, data processing, and decision-making [33]. The principle is to analyze the similarity relationship between the reference series and alternative series. The detailed steps are as follows.

Assume that the initial matrix is R:

$$ \mathrm{R}=\left[\begin{array}{llll}{ccccx}_{11}& {x}_{12}& \cdots & {x}_{1n}\\ {}{x}_{21}& {x}_{22}& \cdots & {x}_{2n}\\ {}\vdots & \vdots & \cdots & \vdots \\ {}{x}_{m1}& {x}_{m2}& \cdots & {x}_{mn}\end{array}\right]\left(1\le i\le m,1\le j\le n\right) $$
(9.5)
  1. 1.

    Standardize the initial matrix:

    $$ {\mathrm{R}}^{\prime }=\left[\begin{array}{llll} cccc{x}_{11}^{\prime }& {x}_{12}^{\prime }& \cdots & {x}_{1n}^{\prime}\\ {}{x}_{21}^{\prime }& {x}_{22}^{\prime }& \cdots & {x}_{2n}^{\prime}\\ {}\vdots & \vdots & \cdots & \vdots \\ {}{x}_{m1}^{\prime }& {x}_{m2}^{\prime }& \cdots & {x}_{mn}^{\prime}\end{array}\right]\left(1\le i\le m,1\le j\le n\right) $$
    (9.6)
  2. 2.

    Generate the reference sequence \( {x}_0^{\prime } \):

    $$ {x}_0^{\prime }=\left({x}_0^{\prime }(1),{x}_0^{\prime }(2),\dots, {x}_0^{\prime }(n)\right) $$
    (9.7)

    where \( {x}_0^{\prime }(j) \) is the largest and standardized value in the jth factor.

  3. 3.

    Calculate the differences Δ 0i(j) between the reference series and alternative series:

    $$ \begin{array}{r@{\;}l}\Delta_{0i}(j)&=|x_0^{\prime}(j)-x_{ij}^{\prime}|,\\ \Delta &=\left[\begin{array}{llll}{\Delta}_{01}(1)& {\Delta}_{01}(2)& \cdots & {\Delta}_{01}(n)\\ {}{\Delta}_{02}(1)& {\Delta}_{02}(2)& \cdots & {\Delta}_{02}(n)\\ {}\vdots & \vdots & \vdots & \vdots \\ {}{\Delta}_{0m}(1)& {\Delta}_{0m}(2)& \cdots & {\Delta}_{0m}(n)\end{array}\right]\left(1\le i\le m,1\le j\le n\right)\end{array} $$
    (9.8)
  4. 4.

    Calculate the grey coefficient r 0i(j):

    $$ {r}_{0i}(j)=\frac{\underset{i}{\min}\underset{j}{\min }{\Delta}_{0i}(j)+\delta \underset{i}{\max}\underset{j}{\max }{\Delta}_{0i}(j)}{\Delta_{0i}(j)+\delta \underset{i}{\max}\underset{j}{\max }{\Delta}_{0i}(j)} $$
    (9.9)
  5. 5.

    Calculate the value of grey relational degree b i:

    $$ {b}_i=\frac{1}{n}\sum \limits_{j=1}^n{r}_{0i}(j) $$
    (9.10)
  6. 6.

    Finally, standardize the value of grey relational degree β i:

    $$ {\beta}_i=\frac{b_i}{\sum \limits_{i=1}^n{b}_i} $$
    (9.11)
1.1.2.3 TOPSIS

TOPSIS is one of the classic MCDM methods to rank alternatives over multicriteria. The principle is that the chosen alternative should have the shortest distance from the positive ideal solution (PIS) and the farthest distance from the negative ideal solution (NIS) [34]. TOPSIS can find the best alternative by minimizing the distance to the PIS and maximizing the distance to the NIS [35]. The alternatives can be ranked by their relative closeness to the ideal solution. The calculation steps are as follows [36]:

  1. 1.

    The decision matrix A is standardized:

    $$ {a}_{ij}=\frac{x_{ij}}{\sqrt{\sum \limits_{i=1}^m{\left({x}_{ij}\right)}^2}}\left(1\le i\le m,1\le j\le n\right) $$
    (9.12)
  2. 2.

    The weighted standardized decision matrix is computed:

    $$ {\displaystyle \begin{array}{l}D=\left({a_{ij}}^{\ast }{w}_j\right)\kern1em \left(1\le i\le m,1\le j\le n\right)\\ {}\sum \limits_{i=1}^m{w}_j=1\end{array}} $$
    (9.13)
  3. 3.

    The PIS V* and the NIS V—are calculated:

    $$ {\displaystyle \begin{array}{l}{V}^{\ast }=\left\{{v}_1^{\ast },{v}_2^{\ast },\dots, {v}_n^{\ast}\right\}=\left\{\left(\underset{i}{\max }{v}_{ij}\left|j\in J\right|\right),\left(\underset{i}{\min }{v}_{ij}\left|j\in {J}^{\prime}\right|\right)\right\}\\ {}{V}^{-}=\left\{{v}_1^{-},{v}_2^{-},\dots, {v}_n^{-}\right\}=\left\{\left(\underset{i}{\min }{v}_{ij}\left|j\in J\right|\right),\left(\underset{i}{\max }{v}_{ij}\left|j\in {J}^{\prime}\right|\right)\right\}\end{array}} $$
    (9.14)
  4. 4.

    The distances of each alternative from PIS and NIS are determined:

    $$ {\displaystyle \begin{array}{l}{S}_i^{+}=\sqrt{\sum \limits_{j=1}^n{\left({V}_i^j-{V}^{\ast}\right)}^2}\kern1em \left(1\le i\le m,1\le j\le n\right)\\ {}{S}_i^{-}=\sqrt{\sum \limits_{j=1}^n{\left({V}_i^j-{V}^{-}\right)}^2}\kern1em \left(1\le i\le m,1\le j\le n\right)\end{array}} $$
    (9.15)
  5. 5.

    The relative closeness to the ideal solution is obtained:

    $$ {Y}_i=\frac{S_i^{-}}{S_i^{+}+{S}_i^{-}}\left(1\le i\le m\right) $$
    (9.16)
  6. 6.

    The preference order is ranked.

The larger relative closeness indicates the better alternative.

1.1.3 PROMETHEE II

PROMETHEE II, proposed by Brans in 1982, uses pairwise comparisons and “values outranking relations” to select the best alternative [37]. PROMETHEE II can support DMs to reach an agreement on feasible alternatives over multiple criteria from different perspectives [38, 39]. In the PROMETHEE II method, a positive outranking flow reveals that the chosen alternative outranks all alternatives, whereas a negative outranking flow reveals that the chosen alternative is outranked by all alternatives. Based on the positive outranking flows and negative outranking flows, the final alternative can be selected and determined by the net outranking flow. The steps are as follows:

  1. 1.

    Normalize the decision matrix R:

    $$ {R}_{ij}=\frac{x_{ij}-{minx}_{ij}}{maxx_{ij}-{minx}_{ij}}\left(1\le i\le n,1\le j\le m\right) $$
    (9.17)
  2. 2.

    Define the aggregated preference indices. Let a, b ∈ A and

    $$ \left\{\begin{array}{c}\pi \left(a,b\right)=\sum \limits_{j=1}^k{p}_j\left(a,b\right){w}_j\\ {}\pi \left(a,b\right)=\sum \limits_{j=1}^k{p}_j\left(b,a\right){w}_j\end{array}\right. $$
    (9.18)

    where A is a finite set of alternatives {a1, a2, …, an}, k is the number of criteria such that 1 ≤ k ≤ m, w j is the weight of criterion j, and \( \sum \limits_{j=1}^k{w}_j=1\left(1\le \mathrm{k}\le \mathrm{m}\right) \). π(a, b) represents how a is preferred to b over all criteria, and p j(a, b) represents how b is preferred to a over all criteria. p j(a, b) and p j(b, a) are the preference functions of the alternatives a and b.

  3. 3.

    Calculate π(a, b) and π(b, a) for each pair of alternatives

    In general, there are six types of preference function. DMs must select one type of preference function and the corresponding parameter value for each criterion [40, 41].

  4. 4.

    Determine the positive outranking flow and negative outranking flow. The positive outranking flow is determined by

    $$ {\phi}^{+}(a)=\frac{1}{n-1}\sum \limits_{x\in A}\pi \left(a,x\right) $$
    (9.19)

    and the negative outranking flow is determined by

    $$ {\phi}^{-}(a)=\frac{1}{n-1}\sum \limits_{x\in A}\pi \left(a,x\right) $$
    (9.20)
  5. 5.

    Calculate the net outranking flow:

    $$ \phi (a)={\phi}^{+}(a)-{\phi}^{-}(a) $$
    (9.21)
  6. 6.

    Determine the ranking according to the net out-ranking flow.

Larger ϕ(a) is the more appropriate alternative.

1.1.4 Performance Measures

External measures for evaluating clustering results are more effective than internal and relative measures. Accordingly, in this study, nine clustering external measures are selected for evaluation. These are entropy, purity, micro-average precision (MAP), Rand index (RI), adjusted Rand index (ARI), F-measure (FM), Fowlkes–Mallows index (FMI), Jaccard coefficient (JC), and Mirkin metric (MM). Among them, measures of entropy and purity are widely applied as external measures in the fields of data mining and machine learning [42, 43]. The nine external measures are generated by a computer with an Intel core i5-3210M CPU @ 2.50 GHz with 8G memory. Before introducing external measures, the contingency table is described.

1.1.5 The Contingency Table

Given a data set D with n objects, suppose we have a partition P = {P 1, P 2, …, P n} by some clustering method, where \( {\cup}_{i=1}^k{P}_i=D \) and P i ∩ P j = ϕ, for 1 ≤ i ≠ j ≤ k. According to the preassigned class labels, we can create another partition on C = {C 1, C 2, …, C k} where \( {U}_{i=1}^k{C}_i=D \) and C i ∩ C j = ϕ for 1 ≤ i ≠ j ≤ k. Let nij denote the number of objects in cluster Pi with the label of class Cj. Then, the data information between the two partitions can be displayed in the form of a contingency table, as shown in Table 9.1.

Table 9.1 Contingency table

The following paragraphs define the external measures. The measures of entropy and purity are widely applied in the field of data mining and machine learning.

  1. 1.

    Entropy. The measure of entropy, which originated in the information-retrieval community, can measure the variance of a probability distribution. If all clusters consist of objects with only a single class label, the entropy is zero, and as the class labels of objects in a cluster become more varied, the entropy increases. The measure of entropy is calculated as

    $$ E=-\sum \limits_i\frac{n_i}{n}\left(\sum \limits_j\frac{n_{ij}}{n_i}\log \frac{n_{ij}}{n_i}\right) $$
    (9.22)
  2. 2.

    Purity. The measure of purity pays close attention to the representative class (the class with majority objects within each cluster). Purity is similar to entropy. It is calculated as

    $$ P=\sum \limits_i\frac{n_{i_i}}{n}\left(\underset{j}{\mathit{\max}}\frac{n_{ij}}{n_{i_i}}\right) $$
    (9.23)

    A higher purity value usually represents more effective clustering.

  3. 3.

    F-Measure. The F-measure (FM) is a harmonic mean of precision and recall. It is commonly considered as clustering accuracy. The calculation of FM is inspired by the information-retrieval metric as follows:

    $$ {\displaystyle \begin{array}{cc}F- \text{measure}& =\frac{2\times \text{precision}\times \text{recall}}{\text{precision}+ \text{recall}}\\ {} \text{precision}& =\frac{n_{ij}}{n_j}, \text{recall}=\frac{n_{ij}}{n_i}\end{array}} $$
    (9.24)

    A higher value of FM generally indicates more accurate clustering.

  4. 4.

    Micro-average Precision. The MAP is usually applied in the information-retrieval community. It can obtain a clustering result by assigning all data objects in a given cluster to the most dominant class label and then evaluating the following quantities for each class:

    1. (a)

      α(Cj): the number of objects correctly assigned to class Cj.

    2. (b)

      β(Cj): the number of objects incorrectly assigned to class Cj.

      The MAP measure is computed as follows:

      $$ \mathrm{MAP}=\frac{\sum_j\alpha \left({C}_j\right)}{\sum_j\alpha \left({C}_j\right)+\beta \left({C}_j\right)} $$
      (9.25)

      A higher MAP value indicates more accurate clustering.

  5. 5.

    Mirkin Metric. The measure of Mirkin metric (MM) assumes the null value for identical clusters and a positive value, otherwise. It corresponds to the Hamming distance between the binary vector representations of each partition [44]. The measure of MM is computed as

    $$ M=\sum \limits_i{n}_{i.}^2+\sum \limits_j{n}_i^2-2\sum \limits_i\sum \limits_j{n}_{ij}^2 $$
    (9.26)

    A lower value of MM implies more accurate clustering. In addition, given a data set, assume a partition C is a clustering structure of a data set and P is a partition by some clustering method. We refer to a pair of points from the dataset as follows:

    1. (a)

      SS: if both points belong to the same cluster of the clustering structure C and to the same group of the partition P

    2. (b)

      SD: if the points belong to the same clusters of C and to different groups of P

    3. (c)

      DS: if the points belong to different clusters of C and to the same groups of P

    4. (d)

      DD: if the points belong to different clusters of C and to different groups of P

      Assume that a, b, c, and d are the numbers of SS, SD, DS, and DD pairs, respectively, and that M a + b + c + d, which is the maximum number of pairs in the data set. The following indicators for measuring the degree of similarity between C and P can be defined.

  6. 6.

    Rand Index. The RI is a measure of the similarity between two data clusters in statistics and data clustering [45]. RI is computed as follows:

    $$ R=\frac{\left(a+d\right)}{M} $$
    (9.27)

    A higher value of RI indicates a more accurate result of clustering.

  7. 7.

    Jaccard Coefficient. The JC, also known as the Jaccard similarity coefficient (originally named the “coefficient de commutate” by Paul Jaccard), is a statistic applied to compare the similarity and diversity of sample sets [46]. JC is computed as follows:

    $$ J=\frac{a}{\left(a+b+c\right)} $$
    (9.28)

    A higher value of JC indicates a more accurate result of clustering.

  8. 8.

    Fowlkes and Mallows Index. The Fowlkes and Mallows index (FMI) was proposed by Fowlkes and Mallows [47] as an alternative for the RI. The measure of FMI is computed as follows:

    $$ \mathrm{FMI}=\sqrt{\frac{a}{a+b}\cdot \frac{a}{a+c}} $$
    (9.29)

    A higher value of FMI indicates more accurate clustering.

  9. 9.

    Adjusted Rand Index. The adjusted Rand index (ARI) is the corrected-for-chance version of the measure of RI. It ranges from −1 to 1 and expresses the level of concordance between two bipartitions [48]. A value of ARI closest to 1 indicates almost perfect concordance between the two compared bipartitions, whereas a value near −1 indicates almost complete discordance [49]. The measure of ARI is computed as:

    $$ \mathrm{ARI}=\frac{a-\left(\left(a+c\right)+\frac{a+b}{M}\right)}{\left(\left(a+c\right)+\frac{a+b}{2}\right)-\left(\left(a+c\right)+\frac{a+b}{M}\right)} $$
    (9.30)

    A higher value of ARI indicates more accurate clustering.

1.1.6 Index Weights

In this work, the index weights of the four MCDM methods can be calculated by AHP. The AHP method, proposed by Saaty [50] is a widely used tool for modeling unstructured problems by synthesizing subjective and objective information in many disciplines, such as politics, economics, biology, sociology, management science, and life sciences [51,52,53]. It can elicit a corresponding priority vector according to pair-by-pair comparison values [54] obtained from the scores of experts on an appropriate scale. AHP has some problems, for example, the priority vector derived from the eigenvalue method can violate a condition of order preservation pro-posed by Costa and Vansnick [55]. However, AHP is still a classic and important approach, especially in the fields of operation research and management science [56]. AHP has the following steps:

  1. 1.

    Establish a hierarchical structure: a complex problem can be established in such a structure, including the goal level, criteria level, and alternative level [57].

  2. 2.

    Determine the pairwise comparison matrix: once the hierarchy is structured, the prioritization procedure starts for determining the relative importance of the criteria (index weights) within each level [5]. The pairwise comparison values are obtained from the scores of experts on a 1–9 scale.

  3. 3.

    Calculate index weights: the index weights are usually calculated by the eigenvector method proposed by Saaty [50].

  4. 4.

    Test consistency: the value of 0.1 is generally considered the acceptable upper limit of the consistency ratio (CR). If the CR exceeds this value, the procedure must be repeated to improve consistency.

1.1.7 The Proposed Model

Clustering results can vary according to the evaluation method. Rankings can conflict even when abundant data are processed, and a large knowledge gap can exist between the evaluation results [58] due to the anticipation, experience, and expertise of all individual participants. The decision-making process is extremely complex. This makes it difficult to make accurate and effective decisions [41]. The proposed DMSECA model consists of three steps. They are as follows.

The first step usually involves modeling by clustering algorithms, which can be accomplished using one or more procedures selected from the categories of hierarchical, density-based, partitioning, and model-based methods. In this section, we apply the six most influential clustering algorithms, including EM, the FF algorithm, FC, HC, MD, and KM, for task modeling by using WEKA 3.7 on 20 UCI data sets, including a total of 18,310 instances and 313 attributes. Each of these clustering algorithms belongs to one of the four categories of clustering algorithms mentioned previously. Hence, all categories are represented.

In the second step, four commonly used MCDM methods (TOPSIS, WSM, GRA, and PROMETHEE II) are applied to rank the performance of the clustering algorithms over 20 UCI data sets based on nine external measures as the input, computed in the first step. These methods are highly suitable for the given data sets. Unsuitable methods were not selected. For example, we did not select VIKOR because its denominator would be zero for the given data sets. The index weights are determined by AHP based on the eigenvalue method. Three experts from the field of MCDM are selected and consulted as the DMs to derive the pairwise comparison values completed by the scores of experts. We randomly assign each MCDM method to five UCI data sets. We apply more than one MCDM method to analyze and evaluate the performance of clustering algorithms, which is essential.

Finally, in the third step, we propose a decision-making support model to reconcile the individual differences or even conflicts in the evaluation performance of the clustering algorithms among the 20 UCI data sets. The proposed model can generate a list of algorithm priorities to select the most appropriate clustering algorithm for secondary mining and knowledge discovery. The detailed steps of the decision-making support model, based on the 80-20 rule, are described as follows.

  • Step 1. Mark two sets of alternatives in a lower position and an upper position, respectively.

    It is well known that the 80-20 rule reports that 80% of the results originate in 20% of the activity in most situations. The rule can be credited to Vilfredo Pareto, who observes that 80% of the wealth is usually controlled by 20% of the people in most countries. The implication is that it is better to be in the top of 20% than in the bottom of 80%. So, the 80-20 rule can be applied to focus on the analysis of the most important positions of the rankings in relation to the number of observations for predictable imbalance. The 80-20 rule indicates that the 20% of people, who are creating 80% of the results, which are highly leveraged. In this research, based on the expert wisdom originating from the 20% of people, the set of alternatives is classified into two categories, where the top of 1/5 of the alternatives is marked in an upper position, which represents more satisfactory rankings from the opinion of all individual participants involved in the algorithm evaluation process. The bottom of 1/5 is in a lower position, which represents more dissatisfactory rankings from the opinion of all individual participants. The element marked in the upper position is calculated as follows:

    $$ x=\frac{n\ast 1}{5} $$
    (9.31)

    where n is the number of alternatives. For instance, if n 7, then × 7 × 1/5 1.4 ≈ 2. Hence, the second position classifies the ranking, where the first and second positions are those alternatives in the upper position, which are considered as the collective group idea of the most appropriate and satisfactory alternatives.

    Similarly, the element marked in the lower position is calculated as

    $$ x=\frac{n\ast 4}{5} $$
    (9.32)

    where n is the number of alternatives. For instance, if n = 7, then 7*4/5 = 5.6 ≈ 6. Thus, the sixth position classifies the ranking, where the sixth and seventh positions in the lower positions are considered collectively as the worst and most dissatisfactory alternatives.

  • Step 2. Grade the sets of alternatives in the lower and upper positions, respectively.

    A score is assigned to each position of the set of alternatives in the lower position and upper position, respectively.

    The score in the lower position can be calculated by assigning a value of 1 to the first position, 2 to the second position,…, and x to the last position. Finally, the score of each alternative in the lower position is totaled, marked d.

    Similarly, the score in the upper position can be calculated by assigning a value of 1 to the last position, 2 to the penultimate position,…, and x to the first position. Finally, the score of each alternative in the upper position is totaled, marked b.

  • Step 3. Generate the priority of each alternative.

    The priority of each alternative fi, which represents the most satisfactory rankings from the opinions of all individual participants, can be determined as

    $$ {f}_i={b}_i-{d}_i $$
    (9.33)

    where a higher value of f i implies a higher priority.

1.2 Evaluation of Classification Algorithms Using MCDM And Rank Correlation

This subsection combines MCDM methods with Spearman’s rank correlation coefficient to rank classification algorithms. This approach first uses several MCDM methods to rank classification algorithms and then applies Spearman’s rank correlation coefficient to resolve differences among MCDM methods. Five MCDM methods, including TOPSIS, ELECTRE III, grey relational analysis, VIKOR, and PROMETHEE II are implemented in this research.

1.2.1 Two MCDM Methods

In addition to GRA, TOPSIS, and PROMETHEE II methods, here two more MCDM methods are outlined as below.

1.2.1.1 ELimination and Choice Expressing REality (ELECTRE)

ELECTRE stands for ELimination Et Choix Traduisant la REalite (ELimination and Choice Expressing the REality) and was first proposed by Roy [59] to choose the best alternative from a collection of alternatives. Over the last four decades, a family of ELECTRE methods has been developed, including ELECTRE I, ELECTRE II, ELECTRE III, ELECTRE IV, ELECTRE IS, and ELECTRE TRI.

There are two main steps of ELECTRE methods: the first step is the construction of one or several outranking relations; the second step is an exploitation procedure that identifies the best compromise alternative based on the outranking relation obtained in the first step.[60] ELECTRE III is chosen in this section because it is appropriate for the sorting problem. The procedure can be summarized as follows [59, 61, 62]:

  • Step 1. Define a concordance and discordance index set for each pair of alternatives

    $$ {A}_j\ \mathrm{and}\ {A}_k,j,k=1,\dots, m;i\ne k $$
  • Step 2. Add all the indices of an alternative to get its global concordance index C ki.

  • Step 3. Define an outranking credibility degree σ s(A i, A k); by combining the discordance indices and the global concordance index.

  • Step 4. Define two outranking relations using descending and ascending distillation. Descending distillation selects the best alternative first and the worst alternative last. Ascending distillation selects the worst alternative first and the best alternative last.

  • Step 5. Alternatives are ranked based on ascending and descending distillation processes.

1.2.1.2 VlseKriterijumska Optimizacija I Kompromisno Resenje (VIKOR)

VIKOR was proposed by Opricovic [63] and Opricovic and Tzeng [64] for multicriteria optimization of complex systems. The multicriteria ranking index, which is based on the particular measure of closeness to the ideal alternative, is introduced to rank alternatives in the presence of conflicting criteria. This section uses the following VIKOR algorithm provided by Opricovic and Tzeng in the experiment:

  • Step 1. Determine the best \( {f}_i^{\ast } \) and the worst \( {f}_i^{-} \) values of all criterion functions, i = 1, 2, ⋯, n.

    $$ {\displaystyle \begin{array}{c}{f}_i^{\ast }=\left\{\begin{array}{c}{\max}_j{f}_{ij}, \text{for\ benefit\ criteria}\ \\ {}{\min}_j{f}_{ij}, \text{for\ cost\ criteria}\ \end{array}\right\},j=1,2,\dots, J,\\ {}{f}_i^{-}=\left\{\begin{array}{c}{\min}_j{f}_{ij}, \text{for\ benefit\ criteria}\ \\ {}{\max}_j{f}_{ij}, \text{for\ cost\ criteria}\ \end{array}\right\},j=1,2,\dots, J,\end{array}} $$

    where J is the number of alternatives, n is the number of criteria, and f ij is the rating of ith criterion function for alternative aj.

  • Step 2. Compute the values S j and R j; j = 1, 2, ⋯, J, by the relations

    $$ {\displaystyle \begin{array}{c}{S}_j={\sum}_{i=1}^n{w}_i\left({f}_i^{\ast }-{f}_{ij}\right)\left({f}_i^{\ast }-{f}_i^{-}\right)\\ {}{R}_j={\max}_i\left[{w}_i\left({f}_i^{\ast }-{f}_{ij}\right)\left({f}_i^{\ast }-{f}_i^{-}\right)\right]\end{array}} $$

    where wi is the weight of ith criteria, Sj and Rj are used to formulate ranking measure.

  • Step 3. Compute the values Qj; j = 1, 2, ⋯, J, by the relations

    $$ {\displaystyle \begin{array}{cc}{Q}_j& =v\left({S}_j-{S}^{\ast}\right)\left({S}^{-}-{S}^{\ast}\right)+\left(1-v\right)\left({R}_j-{R}^{\ast}\right)\left({R}^{-}-{R}^{\ast}\right)\\ {}{S}^{\ast }& ={\min}_j{S}_j,{S}^{-}={\max}_j{S}_j\\ {}{R}^{\ast }& ={\min}_i{R}_j,{R}^{-}={\max}_j{R}_j\end{array}} $$

    where the solution obtained by S is with a maximum group utility, the solution obtained by R is with a minimum individual regret of the opponent, and v is the weight of the strategy of the majority of criteria. The value of v is set to 0.5 in the experiment.

  • Step 4. Rank the alternatives in decreasing order. There are three ranking lists: S; R, and Q.

  • Step 5. Propose the alternative a′, which is ranked the best by Q, as a compromise solution if the following two conditions are satisfied:

    (a) Q(a ) − Q(a ) ≥ 1(J − 1); (b) Alternative a 0 is ranked the best by S or/and R.

    If only the condition (b) is not satisfied, alternatives a and a are proposed as compromise solutions, where a″ is ranked the second by Q. If the condition (a) is not satisfied, alternatives a ; a …; a M are proposed as compromise solutions, where a M is ranked the Mth by Q and is determined by the relation Q(a M) − Q(a ) < 1(J − 1) for maximum M.

1.2.2 Spearman’s Rank Correlation Coefficient

Spearman’s rank correlation coefficient measures the similarity between two sets of rankings. The basic idea of the proposed approach is to assign a weight to each MCDM method according to the similarities between the ranking it generated and the rankings produced by other MCDM methods. A large value of Spearman’s rank correlation coefficient indicates a good agreement between a MCDM method and other MCDM methods.

The proposed approach is designed to handle conflicting MCDM rankings through three steps. In the first step, a selection of MCDM methods is applied to rank classification algorithms. If there are strong disagreements among MCDM methods, the different ranking scores generated by MCDM methods are used as inputs for the second step.

The second step utilizes Spearman’s rank correlation coefficient to find the weights for each MCDM method. Spearman’s rank correlation coefficient between the kth and ith MCDM methods is calculated by the following equation:

$$ {\rho}_{ki}=1-\frac{6\sum {d}_i^2}{n\left({n}^2-1\right)} $$
(9.34)

where n is the number of alternatives and di is the difference between the ranks of two MCDM methods. Based on the value of k i, the average similarities between the kth MCDM method and other MCDM methods can be calculated as

$$ {\rho}_k=\frac{1}{q-1}{\sum}_{i=1,i\ne k}^q{\rho}_{ki},k=1,2,\dots, q, $$
(9.35)

where q is the number of MCDM methods. The larger the k value, the more important the MCDM method is. Normalized k values can then be used as weights for MCDM methods in the secondary ranking.

The third step uses the weights obtained from the second step to get secondary rankings of classifiers. Each MCDM method is applied to re-rank classification algorithms using ranking scores produced by MCDM methods in the first step and the weights obtained in the second step.

  • The detailed experimental study of this method can be found in [2]

1.3 Public Blockchain Evaluation Using Entropy and TOPSIS

This subsection aims to make a comprehensive evaluation of public blockchains from multiple dimensions. Three first-level indicators and eleven second-level indicators are designed to evaluate public blockchains. The technique for order preferences by similarity to an ideal solution (TOPSIS) method is used to rank public blockchains, and the entropy method is used to determine the weights of each dimension. Since Bitcoin has an absolute advantage, a let-the-first-out (LFO) strategy is proposed to reduce the criteria of the positive ideal solution and make a more reasonable evaluation.

1.3.1 Proposed Evaluation Model

1.3.1.1 Evaluation Indicator

With the increasing requirement of performance, more and more blockchains are designed by new technology. Technology is an important indicator to evaluate public blockchains, but technology is not everything. The popularity is a key factor to measure a platform or system, and the blockchain is the same. For example, the second global public blockchain technology assessment index shows that Bitcoin ranks 17th, but Bitcoin is still one of the most popular blockchains.

Therefore, two indicators are designed to measure the popularity of public blockchains. One is recognition, which is the degree of acceptance of public blockchains by developers and others. The greater the acceptance, the better the blockchain. The other is activity, which measures the activity of developers and others. When developers stop maintaining and improving a blockchain, or people stop talking about it, the blockchain is no longer popular. Developers and other people can be considered separately, but they are under the same indicator in this section because of the same topic. Figure 9.1 shows the first-level indicators and their second-level indicators.

Fig. 9.1
figure 1

The evaluation indicators for public blockchains evaluation

1.3.1.2 Technology

The basic technology (I11) and the applicability (I12) are the first and the second second-level indicators of technology respectively. These two indicators are quantified by the expert scoring method. Since CCID has established a technology assessment index for public blockchains, this section will reference its scoring results for the two indicators. The basic technology mainly examines the realization function, basic performance, safety and degree of centralization of public blockchains. The applicability focuses on the application scenarios, the number of wallets, the ease of use, and the development support on the chain.

The TPS (I13) is the most important indicator of public blockchain networks. The TPS of Bitcoin and Ethereum are 7 and 20 respectively, while the TPS of VISA is 2000. A blockchain’s TPS depends on its consensus algorithm, and the POW consensus algorithm makes the TPS of Bitcoin and Ethereum small.

In November 2017, Ethereum launched a pet cat game called CryptoKitties. Since December 3, 2017, pending transactions at Ethereum have skyrocketed. CryptoKitties accounted for more than 10% of the activity in Ethereum, resulting in serious congestion in the Ethereum network. The gas fee, also called transaction fee, is required to be paid to the miners to run a particular transaction or contract. With the congestion of the Ethereum network, the gas fee will increase. As can be seen in Fig. 9.2, the gas fee increases rapidly since December 3, 2017. Additionally, the congestion appears again in the Ethereum network since June 30, 2018, because of the principles of FCoin GPM listing. These high transaction costs show the congestion in the Ethereum network. Since people pay most attention to the TPS nowadays, the TPS is independent of the I11 as the third second-level indicator of technology.

Fig. 9.2
figure 2

The transaction fee of Ethereum network

However, even if the TPS needs to be upgraded to solve the congestion problem, too large TPS is meaningless. For example, if 2000 TPS is enough to handle the daily transactions, there is no difference between 5000 TPS and million TPS. In this case, the hyperbolic tangent function is introduced to reduce the benefits of the increased TPS:

$$ y=\frac{e^x-{e}^{-x}}{e^x+{e}^{-x}},x=\frac{\text{TPS}}{\alpha } $$
(9.36)

where α is a scale factor and set to 2000 in this section.

1.3.1.3 Recognition

The market capitalization (I21) is the first second-level indicator of recognition. The market capitalization of a company is the result of the transaction price of the company’s stock in the securities market multiplied by the total share capital, reflecting the company’s asset value, profitability value, and growth value. Similarly, the market capitalization of a public blockchain is the result of the transaction price of the public blockchain’s coin in the cryptocurrency market multiplied by the total number of coins. It reflects the blockchain’s use value and growth value. Once a blockchain is not recognized and no longer used, its value will be zero.

The fork (I22), the total commits (I23), and the star (I24) in GitHub are the second, third, and fourth second-level indicator of recognition respectively. A basic technical feature of the blockchain is the shared ledger, which requires multiple participation and cooperation. Due to the openness and transparency of the open source, the open source of blockchain not only quickly obtain the recognition and trust of partners, but also quickly gather a number of outstanding talents for continuous developments. The fork in GitHub represents the number of people who recognize or want to contribute to the blockchain; the total commits in GitHub represent the improvements of the blockchain; the star in GitHub represents the number of developers who like the blockchain.

The number of followers in Twitter (I25) is the fifth second-level indicator of recognition. Twitter is one of the most famous online news and social networking service. The blockchains always have Twitter accounts to post news to the public, and the followers of a public blockchain’s Twitter account represent the people who care and recognize the public blockchain.

1.3.1.4 Activity

The Google search heat in the previous month (I31) is the first second-level indicator of activity. In the search market, Google handles around 90% of searches worldwide. The popularity of search terms over time and across various regions of the world can be compared in Google Trends. The Google search heat of a public blockchain is the sum of its name’s search heat and its short name’s search heat.

The number of commits in GitHub in the previous month (I32) is the second second-level indicator of activity. It reflects the improvements of blockchains in the previous month.

The turnover rate in the previous month (I33) is the third second-level indicator of activity. The turnover rate is the frequency of coins traded in the market in a certain period of time. The higher the turnover rate, the more active the transactions of cryptocurrency and the more popular the public blockchain. Generally, a high turnover rate means good liquidity of the cryptocurrency.

1.3.1.5 Evaluation Process

The choice of indicators weights is an important step in the TOPSIS. The entropy method is an objective method to calculate weights based on the objective information of indicators [65]. An indicator with small entropy value means the indicator is important and has a large weight [66]. The entropy is calculated as follows:

$$ {e}_j=-\frac{1}{\ln n}\sum \limits_{i=1}^n{p}_{ij}\ln {p}_{ij},\kern0.5em {p}_{ij}=\frac{x_{ij}}{\sum_{i=1}^n{x}_{ij}} $$
(9.37)

where x ij is the jth normalized indicator value of the ith public blockchain. Then the degree of divergence (d j) and the weight (w j) can be calculated as follows:

$$ {d}_j=1-{e}_j \vspace*{-18pt}$$
(9.38)
$$ {w}_j=\frac{d_j}{\sum_{j=1}^m{d}_j} $$
(9.39)

The TOPSIS ranks public blockchains according to their relative proximities calculated by the distance from the positive ideal solution and the distance from the negative ideal solution [67]. The steps for the TOPSIS are described below. The first step is to normalize the indicator matrix:

$$ {r}_{ij}=\frac{x_{ij}}{\sqrt{\sum_{i=1}^n{x}_{ij}^2}} $$
(9.40)

With the weights obtained by the entropy method, the weighted normalization matrix is calculated as follows:

$$ v=r\cdot \mathit{\operatorname{diag}}(w) $$
(9.41)

where diag(w) is a diagonal matrix where the diagonal elements are the weights w. Then the positive ideal solution (A +) and the negative ideal solution (A ) can be obtained:

$$ { {\begin{array}{ll}\displaystyle {\displaystyle \begin{array}{c}{A}^{+}=\left\{\left(\underset{i}{\mathit{\max}}{v}_{ij}|j\in {J}_1\right),\left(\underset{i}{\mathit{\min}}{v}_{ij}|j\in {J}_2\right)|i=1,2,\dots, n\right\}=\left\{{v}_1^{+},{v}_2^{+},\dots, {v}_j^{+},\dots, {v}_m^{+}\right\}\end{array}}\end{array}}}\vspace*{-12pt} $$
(9.42)
$$ { {\begin{array}{ll}\displaystyle {\displaystyle \begin{array}{c}{A}^{-}=\left\{\left(\underset{i}{\mathit{\min}}{v}_{ij}|j\in {J}_1\right),\left(\underset{i}{\mathit{\max}}{v}_{ij}|j\in {J}_2\right)|i=1,2,\dots, n\right\}=\left\{{v}_1^{-},{v}_2^{-},\dots, {v}_j^{-},\dots, {v}_m^{-}\right\}\end{array}}\end{array}}} $$
(9.43)

where J 1 and J 2 are the benefit and the cost indicators respectively. The distance of each indicator from A + and A can be calculated as follows:

$$ {\displaystyle \begin{array}{c}{S}_i^{+}=\sqrt{\sum \limits_{j=1}^m{\left({v}_{ij}-{v}_j^{+}\right)}^2},i=1,2,\dots, n\end{array}} \vspace*{-6pt}$$
(9.44)
$$ {\displaystyle \begin{array}{c}{S}_i^{-}=\sqrt{\sum \limits_{j=1}^m{\left({v}_{ij}-{v}_j^{-}\right)}^2},i=1,2,\dots, n\end{array}} $$
(9.45)

The relative proximity of each public blockchain to the ideal solution can be calculated as follows:

$$ {\displaystyle \begin{array}{c}{C}_i^{\ast }=\frac{S_i^{-}}{S_i^{+}+{S}_i^{-}},i=1,2,\dots, n\end{array}} $$
(9.46)

Lastly, the public blockchains can be ranked by their relative proximities.

The relative proximities are based on the positive ideal solution and the negative ideal solution. If the relative proximity of the first place is much larger than that of the second place, then some indicator values of the first place are much larger than those of the second place. In this case, even if the second place is much better than the third place, the advantage will become very small under the absolute advantage of the first place. Since the positive ideal solution cannot be achieved by other items, it is better to reduce the criteria of the positive ideal solution. Therefore, a let-the-first-out (LFO) strategy is proposed to make a more reasonable evaluation. In the LFO, if the relative proximity of the first place is much larger than that of the second place, the position of the first place is retained and the other items are re-evaluated.

  • The data analysis can be found in [3].

2 Evaluation Methods for Software

2.1 Classifier Evaluation for Software Defect Prediction

This subsection integrates traditional feature selection methods and multi-criteria decision making (MCDM) methods to improve the accuracy and reliability of defect prediction models and evaluate the performances of software defect detection models.

2.1.1 Research Methodology

Results of empirical studies on software defect prediction models do not always converge. Myrtveit et al. [68] analyzed some empirical software engineering studies and identified three factors that may contribute to the divergence: a single sample dataset, choice of accuracy indicators, and cross validation. They concluded that a crucial step in software defect prediction is the design of research procedures.

The inputs are four public-domain software defect datasets provided by the NASA IV&V Facility Metrics Data Program (MDP) repository. Feature selection and classification are conducted in four steps. First, feature selection is conducted using traditional techniques. Features are then ranked using the proposed feature selection method. The third step employs MCDM methods to evaluate feature selection techniques and choose the better performed techniques. In the last step, the selected features are used in the classification to predict software defects. The performances of classifiers are also evaluated using MCDM methods and a recommendation of classifiers for software defect prediction is made based on their accuracy and reliability.

Multiple criteria decision making (MCDM) aims at solving decision problems with multiple objectives and often conflictive constraints [40, 68, 69]. Five MCDM methods, i.e., DEA (BCC model), ELECTRE, PROMETHEE, TOPSIS, and VIKOR, are used in the experimental study to evaluate algorithms.

For feature selection algorithms, output components include seven attributes:

  • LOC_COMMENTS (The number of lines of comments in a module),

  • HALSTEAD_PROG_TIME (The halstead programming time metric of a module),

  • MAINTENANCE_SEVERITY (Maintenance Severity),

  • NODE_COUNT (Number of nodes found in a given module),

  • NUM_OPERATORS (The number of operators contained in a module),

  • NUM_UNIQUE_OPERATORS (The number of unique operators contained in a module),

  • PERCENT_COMMENTS (Percentage of the code that is comments).

All other attributes are input components. For classification algorithms, input component is false positive rate and output components include the area under receiver operating characteristic (AUC), precision, F-measure, and true positive rate.

2.1.2 Experimental Study

2.1.2.1 Data Sources

The data used in this study are modified public-domain software defect datasets provided by the NASA IV&V Facility Metrics Data Program (MDP) repository [70]. The structures of the datasets are summarized in Table 9.2.

Table 9.2 Dataset structures

CM is from a science instrument written in a C code with approximately 20 kilo-source lines of code (KLOC). KC is about the collection, processing and delivery of satellite metadata and is written in Java with 18 KLOC. PC is flight software from an earth orbiting satellite written in a C code with 26 KLOC. UC is dynamic simulator for attitude control systems. Forty common attributes are selected for each dataset.

2.1.2.2 Discussion of Results

Table 9.3 summarizes the feature weights for each dataset. Features that are highly ranked in one or two dataset may have low rankings in other datasets, such as attribute 4, 9, and 27. This indicates that performances of feature selection techniques vary at different datasets. It also shows a need for evaluation of feature selection techniques.

Table 9.3 Feature weights for the four datasets

The five MCDM methods are applied to evaluate the 11 feature selection techniques.

Tables 9.4, 9.5, 9.6, and 9.7 summarize the evaluation results of the nine classifiers on the four datasets. The rankings of classifiers vary with different datasets. Even within a dataset, different MCDM methods may produce divergent rankings for the same classifier. For example, RIPPER was ranked the second best classifier by ELECTRE and the worst classifier by DEA for CM dataset. In general, FLR outperforms other classifiers. It was ranked the best classifier by at least two MCDM methods for every dataset. SMO achieves good performances on PC and UC, which are larger than CM and KC. The performances of other classifiers on the four software defect datasets are rather mixed.

Table 9.4 MCDM evaluation of classifiers for CM dataset
Table 9.5 MCDM evaluation of classifiers for KC dataset
Table 9.6 MCDM evaluation of classifiers for PC dataset
Table 9.7 MCDM evaluation of classifiers for UC dataset

2.2 Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method

This subsection evaluates the quality of ensemble methods for software defect prediction with the analytic hierarchy process (AHP) method. The AHP is a multicriteria decision-making approach that helps decision makers structure a decision problem based on pairwise comparisons and experts’ judgments. Three popular ensemble methods (bagging, boosting, and stacking) are compared with 12 well-known classification methods using 13 performance measures over 10 public-domain datasets from the NASA Metrics Data Program (MDP) repository.[70] The classification results are then analyzed using the AHP to determine the best classifier for software defect prediction task.

2.2.1 Ensemble Methods

Ensemble learning algorithms construct a set of classifiers and then combine the results of these classifiers using some mechanisms to classify new data records [71]. Experimental results have shown that ensembles are often more accurate and robust to the effects of noisy data, and achieve lower average error rate than any of the constituent classifiers [15, 72,73,74,75].

How to construct good ensembles of classifiers is one of the most active research areas in machine learning, and many methods for constructing ensembles have been proposed in the past two decades [76]. Dietterich [71] divides these methods into five groups: Bayesian voting, manipulating the training examples, manipulating the input features, manipulating the output targets, and injecting randomness. Several comparative studies have been conducted to examine the effectiveness and performance of ensemble methods. Results of these studies indicate that bagging and boosting are very useful in improving the accuracy of certain classifiers [77], and their performances vary with added classification noise. To investigate the capabilities of ensemble methods in software defect prediction, this study concentrates on three popular ensemble methods (i.e. bagging, boosting, and stacking) and compares their performances on public-domain software defect datasets.

2.2.1.1 Bagging

Bagging combines multiple outputs of a learning algorithm by taking a plurality vote to get an aggregated single prediction [78]. The multiple outputs of a learning algorithm are generated by randomly sampling with replacement of the original training dataset and applying the predictor to the sample. Many experimental results show that bagging can improve accuracy substantially. The vital element in whether bagging will improve accuracy is the instability of the predictor [78]. For an unstable predictor, a small change in the training dataset may cause large changes in predictions [79]. For a stable predictor, however, bagging may slightly degrade the performance [78].

Researchers have performed large empirical studies to investigate the capabilities of ensemble methods. For instance, Bauer and Kohavi [77] compared bagging and boosting algorithms with a decision tree inducer and a NaÏve Bayes inducer. They concluded that bagging reduces variance of unstable methods and leads to significant reductions in mean-squared errors. Dietterich [72] studied three ensemble methods (bagging, boosting, and randomization) using decision tree algorithm C4.5 and pointed out that bagging is much better than boosting when there is substantial classification noise.

In this subsection, bagging is generated by averaging probability estimates [16].

2.2.1.2 Boosting

Similar to bagging, boosting method also combines the different decisions of a learning algorithm to produce an aggregated prediction [80]. In boosting, however, weights of training instances change in each iteration to force learning algorithms to put more emphasis on instances that were predicted incorrectly previously and less emphasis on instances that were predicted correctly previously. Boosting often achieves more accurate results than bagging and other ensemble methods. However, boosting may overfit the data and its performance deteriorates with classification noise.

This study evaluates a widely used boosting method, AdaBoost algorithm, in the experiment. AdaBoost is the abbreviation for adaptive boosting algorithm because it adjusts adaptively to the errors returned by classifiers from previous iterations [73, 81]. The algorithm assigns equal weight to each training instance at the beginning. It then builds a classifier by applying the learning algorithm to the training data. Weights of misclassified instances are increased, while weights of correctly classified instances are decreased. Thus, the new classifier concentrates more on incorrectly classified instances in each iteration.

2.2.1.3 Stacking

Stacking generalization, often abbreviated as stacking, is a scheme for minimizing the generalization error rate of one or more learning algorithms [82]. Unlike bagging and boosting, stacking can be applied to combine different types of learning algorithms. Each base learner, also called “level 0” model, generates a class value for each instance. The predictions of level-0 models are then fed into the level-1 model, which combines them to form a final prediction [16].

Another ensemble method used in the experiment is voting, which is a simple average of multiple classifiers probability estimates provided by WEKA [16].

2.2.2 Selected Classification Models

As a powerful tool that has numerous applications, classification methods have been studied extensively by several fields, such as machine learning, statistics, and data mining [83]. Previous studies have shown that an ideal ensemble should consist of accurate and diverse classifiers. [84] Therefore, this study selects 12 classifiers to build ensembles. They represent five categories of classifiers (i.e., trees, functions, Bayesian classifiers, lazy classifiers, and rules) and were implemented in WEKA.

For trees category, we chose classification and regression tree (CART), NaÏve Bayes tree, and C4.5. Functions category includes linear logistic regression, radial basis function (RBF) network, sequential minimal optimization (SMO), and Neural Networks. Bayesian classifiers include Bayesian network and NaÏve Bayes. K-nearest-neighbor was chosen to represent lazy classifiers. For rules category, decision table and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) rule induction were selected.

Classification and regression tree (CART) can predict both continuous and categorical dependent attributes by building regression trees and discrete classes, respectively [85]. NaÏve Bayes tree is an algorithm that combines NaÏve Bayes induction algorithm and decision trees to increase the scalability and interpretability of NaÏve Bayes classifiers [86]. C4.5 is a decision tree algorithm that constructs decision trees in a top–down recursive divide-and-conquer manner [87].

Linear logistic regression models the probability of occurrence of an event as a linear function of a set of predictor variables [88]. Neural network is a collection of artificial neurons that learns relationships between inputs and outputs by adjusting the weights. RBF network [89] is an artificial neural network that uses radial basis functions as activation functions. The centers and widths of hidden units are derived using k-means, and the outputs obtained from the hidden layer are combined using logistic regression [16]. SMO is a sequential minimal optimization algorithm for training support vector machines (SVM) [90, 91].

Bayesian network and NaÏve Bayes both model probabilistic relationships between the predictor variables and the class variable. While NaÏve Bayes classifier [92] estimates the class-conditional probability based on Bayes theorem and can only represent simple distributions, Bayesian network is a probabilistic graphic model and can represent conditional independencies between variables [93].

K-nearest-neighbor [94] classifies a given data instance based on learning by analogy. That is, it assigns an instance to the closest training examples in the feature space.

Decision table selects the best-performing attribute subsets using best-first search and uses cross-validation for evaluation [95]. RIPPER [96] is a sequential covering algorithm that extracts classification rules directly from the training data without generating a decision tree first.

Each of stacking and voting combines all classifiers to generate one prediction. Since bagging and boosting are designed to combine multiple outputs of a single learning algorithm, they are applied to each of the 12 classifiers and produced a total of 26 aggregated outputs.

2.2.3 The Analytic Hierarchy Process (AHP)

The analytic hierarchy process is a multicriteria decision-making approach that helps decision makers structure a decision problem based on pairwise comparisons and experts’ judgments [97, 98]. Saaty [99] summarizes four major steps for the AHP. In the first step decision makers define the problem and decompose the problem into a three-level hierarchy (the goal of the decision, the criteria or factors that contribute to the solution, and the alternatives associated with the problem through the criteria) of interrelated decision elements [100]. The middle level of criteria might be expanded to include subcriteria levels. After the hierarchy is established, the decision makers compare the criteria two by using a fundamental scale in the second step. In the third step, these human judgments are converted to a matrix of relative priorities of decision elements at each level using the eigenvalue method. The fourth step calculates the composite or global priorities for each decision alternatives to determine their ratings.

The AHP has been applied in diverse decision problems, such as economics and planning, policies and allocations of resources, conflict resolution, arms control, material handling and purchasing, manpower selection and performance measurement, project selection, marketing, portfolio selection, model selection, politics, and environment [101]. Over the last 20 years, the AHP has been studied extensively and various variants of the AHP have been proposed. [102,103,104,105].

In this study, the decision problem is to select the best ensemble method for the task of software defect prediction. The first step of the AHP is to decompose the problem into a decision hierarchy. As shown in Fig. 9.3, the goal is to select an ensemble method that is superior to other ensemble methods over public-domain software defect datasets through the comparison of a set of performance measurements. The criteria are performance measures for classifiers, such as overall accuracy, F-measure, area under ROC (AUC), precision, recall, and Kappa statistic. The decision alternatives are ensembles and individual classification methods, such as AdaBoost, bagging, stacking, C4.5, SMO, and NaÏve Bayes. Individual classifiers are included as the decision alternatives for the purpose of comparisons.

Fig. 9.3
figure 3

An AHP hierarchy for the ensemble selection problem

In step 2, the input data for the hierarchy, which is a scale of numbers that indicates the preference of decision makers about the relative importance of the criteria, are collected. Saaty [97] provides a fundamental scale for this purpose, which has been validated theoretically and practically. The scale ranges from 1 to 9 with increasing importance. Numbers 1, 3, 5, 7, and 9 represent equal, moderate, strong, very strong, and extreme importance, respectively, while 2, 4, 6, and 8 indicate inter-mediate values. This study uses 13 measures to assess the capability of ensembles and individual classifiers. Previous works have proved that the AUC is the most informative and objective measurement of predictive accuracy [106] and is an extremely important measure in software defect prediction. Therefore, it is assigned a number of 9. The F-measure, mean absolute error, and overall accuracy are very important measures, but less important than the AUC. The true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), false negative rate (FNR), precision, recall, and Kappa statistic are strongly important classification measures that are less important than the F-measure, mean absolute error, and overall accuracy. Training and test time refer to the time needed to train and test a classification algorithm or ensemble method, respectively. They are useful measures in real-time software defect identification. Since this study is not aimed at real-time software defect identification problem, they are included to measure the efficiency of ensemble methods and are given the lowest importance.

The third step of the AHP computes the principal eigenvector of the matrix to estimate the relative weights (or priorities) of the criteria. The estimated priorities are obtained through a two-step process: (1) raise the matrix to large powers (square); (2) sum and normalize each row. This process is repeated until the difference between the sums of each row in two consecutive rounds is smaller than a prescribed value. After obtaining the priority vector of the criteria level, the AHP method moves to the lowest level in the hierarchy, which consists of ensemble methods and classification algorithms in this experiment. The pairwise comparisons at this level compare learning algorithms with respect to each performance measure in the level immediately above. The matrices of comparisons of the learning algorithms with respect to the criteria and their priorities are analyzed and summarized in Sect. 9.2.1.2. The ratings for the learning algorithms are produced by aggregating the relative priorities of decision elements [107].

  • The data analysis can be found in [5]

3 Evaluation Methods for Sociology and Economics

3.1 Delivery Efficiency and Supplier Performance Evaluation in China’s E-Retailing Industry

This subsection focuses on overall and sub-process supply chain efficiency evaluation using a network slacks-based measure model and an undesirable directional distance model. Based on a case analysis of a leading Chinese B2C firm W, a two-stage supply chain structure covering procurement-stock and inventory-sale management is constructed.

In Chinese B2C e-commerce websites, two typical operation models are widely taken based on different strategic positioning. One is the third-party platform model which provides an e-commerce platform, technical support, advertising and marketing services for franchises. The leading B2C e-commerce platform in China is Taobao.com and Tmall.com. Their business revenue stems mainly from commissions and service. Another model is called the self-operated model, which has a logistics system for transferring and distributing goods. Examples include companies such as Jingdong, Dangdang, Amazon, Yihaodian and Suning. The source of their profits is that sales revenues decrease purchasing costs. According to a research report from IResearch, a leading internet consultant company and online media in China, platform model companies like Tmall accounts for most of B2C e-commerce market share, as shown in Fig. 9.4.

Fig. 9.4
figure 4

Market share of major Chinese B2C e-commerce players in 2013

However, with the ongoing rapid growth of e-commerce in virtual markets, logistics has become the largest bottleneck of e-commerce’s constant development. Most e-commerce players take the third party logistics (3PL) model in the initial development because of its advantage in reducing operations costs and capital investment. Because 3PL is either contractual or out-sourced logistics concentrating on regional operations, with business expansion, the drawbacks of 3PL are gradually arising. For example, lost packages and theft are common when using 3PL. Frequent overstocking during holidays and promotion days are also often disclosed due to the insufficient shipping capacity of 3PL. 3PL services are offered to both suppliers and customers while self-operated logistics are often built by B2C websites to improve service quality and “last mile delivery” efficiency through control of every section of the supply chain, from warehouse to consumer. As a result, a hybrid form of logistics combining 3PL and self-managed logistics is currently a popular topic of study.

From an e-retail supply chain perspective, whatever business you are in, suppliers and vendors play a crucial role in your company’s success. The merchandise quality and richness provided by suppliers determine the popularity of goods, which in turn affect inventory turnover and sales. Based on that, e-retail supply chain process can be generally divided into two stages—procurement-stock management and inventory-sale control. The first sub-stage, procurement-stock management, represents “the first mile delivery” efficiency of e-retail. The second sub-process, stock-sale control describes supplier performance due to the conversion of inventory into sales revenue, as shown in Fig. 9.5. It should be noted that the overall supply chain efficiency is measured without considering internal link activities or intermediate variables.

Fig. 9.5
figure 5

E-commerce procurement-inventory-sale supply chain structure

3.1.1 Case, Research Problem and Data

W firm, one of China’s leading B2C e-commerce firms, is chosen as our research case. The reasons are given as follows:

Firstly, W firm has established a nationwide supply chain network and has an industry-leading supply chain management system in the Chinese B2C e-commerce sector.

Secondly, W firm has the ability to realize a full online operation based on its open supply chain platform which aims to serve traditional enterprises who would like to tap into the e-commerce sector but lack online operating ability. It is similar to the third party platform model in regards to covering an integrated online operations service, improving suppliers’ supply chain efficiency and reducing operations costs by system integration, cloud-based marketing, promotion tools, logistics, warehousing and information services.

Thirdly, from “the last mile delivery”, those suppliers who choose the “shop in shop” model sell their merchandise by third party logistics (3PL), while running business operations on independently. For contrast, those suppliers choosing the third party platform model only need to provide their merchandise to the platform of W firm, while online operations-related activities are executed by W firm. E-retail supply chain for W firm is described in Fig. 9.6.

Fig. 9.6
figure 6

E-retail supply chain for W firm

In conclusion, the operations model of the suppliers in W firm can be clearly divided into the third party platform model and self-operated model, which are two predominant e-business models in china. The third party platform model and self-operated model offer different “last mile delivery” choices for e-commerce players. Thus, this case can be used to analyze the following questions:

  1. 1.

    What causes overall e-retail supply chain inefficiency? “The first mile delivery” or “the last mile delivery”?

  2. 2.

    How do self-operated mode and the third party platform mode affect supply chain efficiency respectively?

  3. 3.

    What is the way forward for product category and Geographic expansion for major Chinese B2C e-commerce players?

  4. 4.

    Which is better for e-retail supply chains: Self-logistics, 3PL or the hybrid model?

Accordingly, the data of more than 2400 suppliers covering purchasing cost, the lead time, inventory, sale, delivery and returned goods were collected from W firm. Excluding incomplete data, 1229 suppliers of the “shop in shop” model and 899 suppliers of the third party platform model were obtained. Nine major product categories are included in this data set, and the research methods are described in detail.

3.1.2 Research Methodology

3.1.2.1 Network Slacks-Based Measure of Efficiency (NSBM)

Suppose there are n DMUs (j = 1, 2, …, n) consisting of k divisions (k = 1, 2, …, k) in a supply chain. mk and rk represent the number of inputs and outputs of Division k, respectively. The set of links leading from Division h to division k is defined as L(k. h). Accordingly, the production possibility set (x k, y k, z k, h) under the assumption of variable returns-to-scale (VRS) production is defined by

$$ {\displaystyle \begin{array}{l}{x}^k\ge {\sum}_{j=1}^n{x}_j^k{\lambda}_j^k,k=1,2,\cdots, k\\ {}{y}^k\le {\sum}_{j=1}^n{x}_j^k{\lambda}_j^k,k=1,2,\cdots, k\\ {}{z}^{k,h}={\sum}_{j=1}^n{z}_j^{k,h}{\lambda}_j^k,\forall k,h\left(\text{as\ outputs\ from}\ k\ \text{and\ inputs\ to}\ h\right),\\ {}{\sum}_{j=1}^n{\lambda}_j^k=1,\forall k,{\lambda}_j^k\ge 0,\forall j,k\end{array}} $$

where, \( {\lambda}^k\in {R}_{+}^n \) is the intensity vector corresponding to Division k (k = 1, 2, …, n).

For the evaluated DMU0 (0 = 1, 2, …, n), in the case of linking activities determined freely while keeping continuity between input and output, non-oriented overall efficiency can be represented as:

$$ {\rho}^{\ast }={\min}_{\lambda^k,{s}^{k-},{s}^{k+}}\frac{\sum_{k=1}^k{w}^k\left[1-\frac{1}{m^k}\left({\sum}_{i=1}^{m_k}\frac{s_i^{k-}}{x_{is}^k}\right)\right]}{\sum_{k=1}^k{w}^k\left[1-\frac{1}{r^k}\left({\sum}_{r=1}^{r_k}\frac{s_r^{k+}}{y_{ro}^k}\right)\right]} \vspace*{-15pt}$$
(9.47)
$$ {\displaystyle \begin{array}{c}\begin{array}{c}\ \mathrm{s}.\mathrm{t}.\left\{\begin{array}{c}{x}_o^k={X}^k{\lambda}^k+{s}^{k-}\\ {}{y}_o^k={Y}^k{\lambda}^k-{s}^{k+}\\ {}{\lambda}^k=1\\ {}{X}^k=\left({x}_1^k,{x}_2^k,\cdots, x\right)\in {R}^{m_k}\times n\\ {}{Y}^k=\left({y}_1^k,{y}_2^k,\cdots, {y}_n^k\right)\in {R}^{r_k}\times n\\ {}{z}^{k,h}{\lambda}^h={z}^{k,h}{\lambda}^k,\left(\forall k,h\right)\\ {}{z}^{k,h}=\left({z}_1^{k,h},{z}_2^{k,h},\cdots, {z}_n^{k,h}\right)\in {R}^{t_{k,h}}\times n\\ {}{\lambda}^k\ge 0,{s}^{k-}\ge 0,{s}^{k+}\ge 0,\forall k\end{array}\right.\end{array}\end{array}} $$
(9.48)

where \( {\sum}_{k=1}^k{w}^k,{w}^k\ge 0\ \left(\forall k\right) \), and w k is the relative weight of division k defined by the decision makers. Non-oriented division efficiency score can be calculated by the below:

$$ {\displaystyle \begin{array}{c}{\rho}_k=\frac{1-\frac{1}{m_k}\left({\sum}_{i=1}^{m_k}\frac{s_i^{k-\ast }}{x_{io}^k}\right)}{1-\frac{1}{r_k}\left({\sum}_{r=1}^{r_k}\frac{s_r^{k+\ast }}{y_{ro}^k}\right)},k=1,2,\cdots, k\end{array}} $$
(9.49)

s k − ∗ and s k + ∗ are the excessive inputs and short outputs for the above Eq. (9.47).

3.1.2.2 Undesirable Output Directional Distance Function Model

It is important for a retail supply chain to effectively manage inventory and avoid returned purchases. It is therefore reasonable to extend the network slack-based measure (NSBM) to incorporate undesirable outputs so that it can give a comprehensive and accurate evaluation on delivery efficiency and supplier performance in a given e-retail supply chain.

The usual technical efficiency measurement is based on input and output distance functions, which cannot simultaneously contract undesirable/bad outputs and inputs and expand good/desirable outputs. Directional distance function is a generalized form of the radial model, and it allows us to explicitly increase the desirable outputs and simultaneously decrease undesirable outputs and inputs. To see this let good outputs be denoted by \( y\in {R}_{+}^M \), bad or undesirable outputs by \( b\in {R}_{+}^J, \) and inputs by \( x\in {R}_{+}^N, \). Suppose there are k (k = 1, 2, …, K) DMUs in an e-retail supply chain. Each DMU uses input \( {x}^k=\left({x}_1^k,{x}_2^k,\cdots, {x}_N^k\ \right)\in {R}_{+}^N \) to jointly produce desirable/good outputs \( y\hat{\mkern6mu} k=\left({y}_1^k,{y}_2^k,\cdots, {y}_M^k\right)\in {R}_{+}^M \) and undesirable/bad outputs \( b\hat{\mkern6mu} k=\left({b}_1^k,{b}_2^k,\cdots, {b}_J^k\right)\in {R}_J^{+} \). For a specific DMU0, a more generalized form of directional distance function is denoted by Chambers et al. [85] as follows:

$$ {\displaystyle \begin{array}{c}\theta =\mathit{\min}\frac{1-\frac{1}{m}{\sum}_{i=1}^m{w}_i\alpha {g}_{xi0}{x}_{i0}}{1+{o}_d\frac{1}{s_d}{\sum}_{d=1}^{s_d}{w}_d\beta {g}_{yd0}{y}_{d0}-{o}_u\frac{1}{s_u}{\sum}_{u=1}^{s_u}{w}_u\gamma {g}_{yu0}{y}_{u0}}\end{array}} \vspace*{-6pt}$$
(9.50)
$$ {\displaystyle \begin{array}{c}\ \mathrm{s}.\mathrm{t}.\left\{\begin{array}{c} X\lambda +\alpha {g}^x\le {x}_0\\ {}{Y}^d\lambda -\beta {g}_y^d\ge {y}_0^d\\ {}{Y}^u\lambda +\gamma {g}_y^u\le {y}_0^u\end{array}\right.\end{array}} $$
(9.51)

with \( {\sum}_{i=1}^m{w}_i=m \), \( {\sum}_{d=1}^{s_d}{w}_d={s}_d,{\sum}_{u=1}^{s_u}{w}_u={s}_u, \), o u + o d = 1, where m, sd, and su denote the number of inputs, desirable (good) outputs and undesirable (bad) outputs respectively. x0 and y0 are the inputs and outputs of the evaluated DMU0. w i, w d, and w u separately express the weights of inputs, desirable (good) outputs and undesirable (bad) outputs defined by decision makers. g x and g y represent the direction vector of inputs and outputs defined by decision makers. ou and od refer to the overall weight of undesirable (bad) and desirable (good) outputs defined by decision makers.

Noted that α, β, γ represent the expansion rate for desirable output items, contraction rate for undesirable output items and input items respectively, and α, β, γ are not necessarily the same value. Namely, it allows for different proportional contraction and expansion rate for inputs, undesirable outputs and desirable outputs.

Performance assessed by directional distance model can be flexibly applied to different analysis purposes. For example, if the direction is chosen by settingg = (−gx, gy, −gb) = (−xk, yk, −bk ), the efficiency score represents how much the percentage needed to be improved in good outputs, bad outputs and inputs [78]. If instead the direction is set by g = (−gx, gy, −gb) = (−1, 1, −1), the solution value can be interpreted as the net improvement in performance in the case of feasible expansion in good outputs and feasible contraction in bad outputs and inputs [107].

Here we choose the measurement based on the observed data, namelyg = (−gx, gy, −gb) = (−xk, yk, −bk), because we would like to observe the potential proportionate change in good outputs, bad outputs and inputs.

3.1.3 Variables Description

3.1.3.1 Input-Output Variables Description in the First Sub-process

As a non-parametric method for converting multi-inputs into multi-outputs, how to choose suitable input-output variable combination is crucial for DEA efficiency evaluation. Thus, in order to give an accurate efficiency measurement, it is necessary to give a reasonable input-output variable description based on e-retail supply chain network structure. Unlike in traditional retail, data mining techniques make demand forecasts possible. An e-commerce supply chain therefore starts with procurement management based on demand forecast. Purchasing plays an important role in cost saving and making profit. The way of orders is scheduled and the resultant lead time directly determines the performance of downstream activities and inventory levels. As a result, order-related input and output variables such as the selection of the right supplier, product variety, purchasing cost, average arrival rate, on time delivery rate are considered in the first sub-process of e-retail supply chain.

The number of brands and stock keeping unit (SKU) describe a variety and richness of the products in e-retail [108, 109]. Higher variety will lead to an increase in consumer’s utility, which in turn affects inventory turnover and finally results in an increase in gross margin [110]. Additionally, the number of dealers determines the size of the suppliers and purchasing cost denotes the total financial inputs. Therefore, the number of brands, the number of dealers, the minimal stock keeping unit (SKU) and purchasing cost can be considered as the initial inputs of procurement-delivery management.

Furthermore, gross margin is associated with stockout costs. In practice, stockouts will lead consumer to switch retailers on subsequent shopping trips due to poor shopping experience [111]. As a result, higher stockouts mean higher lost profits. Hence, an important task of procurement managers is to reduce stockout SKUs and shorten stockout days. Accordingly, the variables of stockout SKUs and stockout days are considered as undesirable outputs in the first sub-process of e-retail supply chain performance measurement.

It is crucial that purchasing management is not something stand-alone, but has close links with the measurement of overall supply chain performance. Thus, average arrival rate and on-time delivery rate are used to measure procurement-delivery efficiency. They are the outputs in the first sub-process and the inputs in the second sub-process of e-retail supply chain. The detail input-output variables are described in Table 9.8.

Table 9.8 Input and output variables description in the procurement sub-process
3.1.3.2 Input-Output Variables Description in the Second Sub-process

Efficient procurement-stock performance can accelerate inventory turnover and promote sales. It is easier for e-commerce players to turn their capital into inventory, but it is difficult for them to turn their inventory into money. According to a statistics of Slywotzky [112], there are 95% of the time used for storage, loading and transportation in a commodity production and sales process. Hence, inventory turnover plays a crucial role in supply chain efficiency measurement. Generally speaking, shorter turnover times mean greater capacity to turn stock into revenue. Accelerating inventory turnover means an increase in the liquidity of capital. Based on that, average days to turnover inventory is considered as one of outputs in the second sub-process of e-retail supply chain. It should be noted that average days to turn over inventory refers to the number of days it takes to sell all on-hand inventory, and can be calculated by the following formula:

$$ \mathrm{days}\ \mathrm{to}\ \mathrm{turnover}\ \mathrm{inventory}=365/\mathrm{inventory}\ \mathrm{turnover} $$

A change in inventory is a response to the change in sales, while dynamic sale is a key for inventory turnover. In practice, dynamic sale days is often used to illustrate inventory change and judge whether the merchandise is popular or not. In general, shorter dynamic sale days mean faster inventory turnover and less unmarketable goods. The unmarketable goods will lead to the loss of sales revenue due to an increase in stock costs. In e-retail, another loss of sales revenue can be attributed to consumer returned goods. Therefore, when associated with average days to turn inventory and sales revenue, dynamic sale days are considered as the output variables, while no-sale SKU and users’ returned goods amount are chosen as undesirable output variables of supplier performance measurement in the second sub-process of e-retail supply chain. The detail input and output variables’ illustration is shown in Table 9.9.

Table 9.9 Input and output variables description in the inventory-sale sub-process

3.1.4 Empirical Results

3.1.4.1 E-Retail Efficiency of “the First Mile Delivery” and “the Last Mile Delivery”

Procurement-stock sub-process of e-retail supply chain is called as “the first mile delivery” due to its nature of affecting inventory management. It is the first section of e-retail supply chain, and its performance directly affects subsequent inventory and sales. Therefore, we give more weight to the first stage of e-retail supply chain than to the second stage. According to network slacks-based measure (NSBM) model, for a specific division k, the weight w1k of procurement-stock sub-process is given 0.6 and w2k of inventory-sale sub-process is given 0.4. Associated with the directional distance model with undesirable output, the weights wd of desirable (good) outputs is denoted as 0.6 and the weights wd of undesirable (bad) outputs is denoted as 0.4. We simultaneously run the above two models using the software of MaxDEA 6.2, and the results are given in Fig. 9.7.

Fig. 9.7
figure 7

E-retail procurement efficiency and supplier performance

As shown in Fig. 9.7, efficiency scores of the procurement-stock stage (Node 1) are lower than those of inventory-sale stage (Node 2). We can hence conclude that it is procurement-stock conversion inefficiency that results in W firm’s overall supply chain inefficiency. The process from purchasing to putting in stock is named “first mile delivery”, which is essential to developing a healthy buyer-supplier relationship and improving inventory control level.

Specifically, the suppliers of the “shop in shop” model have higher overall supply chain efficiency in kitchen and cleaning products than others due to higher purchasing-stock efficiency in the first sub-process of supply chain. In contrast, the suppliers of the third party platform model achieve better stock-sale performance in kitchen and cleaning products than others but it has low overall supply chain efficiency due to the poor performance in purchasing-stock efficiency, referring to Table 9.10. For this discussion, we can conclude that purchasing-stock efficiency plays a more key role in affecting overall supply chain efficiency. This conclusion further verifies the finding in Fig. 9.7.

Table 9.10 Overall and sub-process efficiency comparison for two different supply chain model
3.1.4.2 Product Categories Expansion and Efficiency Analysis

As China’s leading B2C e-commerce online supermarket, W firm has more advantages in fast moving consumer goods (FMGG) like food and drink, as shown in Fig. 9.8. In line with strategic positioning of W firm, this finding displays its core business focus on online supermarket and the concept of “the home”. It is this strategic positioning that creates a barrier to potential competitors entering, thus affording a competitive advantage compared with other B2C websites such as dangdang, Suning and Redbaby. As a result, this unique positioning has allowed W firm to quickly build a loyal customer base and win a first-mover advantage.

Fig. 9.8
figure 8

The distribution of overall efficient supplier in different product categories

However, with growing orders, one-stop shopping of “the home” becomes more and more important for attracting customers. Thus, W firm gradually expands its product categories from FMCG products to electronics, apparel, auto parts, maternity, and household products. In general, all major Chinese B2C e-commerce websites experience similar product categories expansion, namely starting with a narrow, vertical product line then expanding to a broad range of categories. For example, Dangdang started with books and Jingdong with digital products. Then, with growing user and market demands, all of them are in pursuit of all-categories expansion. In other words, Chinese B2C e-commerce websites experience a development of transferring from a vertical model to an integrated model.

3.1.5 Operations Model Comparison

By the way of third party platform model, the “last mile delivery” fleet serves shops settled on the W platform while simultaneously serving merchants who sell their products on their own web page or other market platforms. The full operations service effectively reduces “the last mile delivery” cost and has allowed W firm to create higher supplier performance in the second sub-process of supply chain, referring to Fig. 9.9. However, which model is more efficient in the first stage known as “first mile delivery”, self-operated model or platform model?

Fig. 9.9
figure 9

Inventory management for W firm

From inventory management, too much stock will increase inventory cost while too little stock will affect stockout rate. Thus, it is necessary for an integrated platform to make automated procurement decisions. Figure 9.9 describes inventory management for W firm. It can be seen that a purchase order would be automatically issued and sent to the suppliers when inventory dropped below a defined safety stock, and then the order will be filled by the suppliers [113]. In this way, W firm can record the delivery time, receiving and shelving information and process payment. Therefore, it can be seen in Fig. 9.9 that platform model presents higher procurement-stock efficiency scores than the self-operated (shop in shop) model.

Is the platform model efficient for all product categories?

In response to this question, we compare the “last mile delivery” efficiency of different product categories for the platform model and self-operated model, referring to Fig. 9.10. The results show self-operated (shop in shop) model performs better in computer and Office and digital, food and drink and healthy products. This is because of the high values of computer and Office and digital, and the shorter shelf life of food and drink and healthy products, which determine their priority in order of handling, picking, stockout-compensation and delivery. Furthermore, from the consumer’s demand, products such as food and drink and healthy products are often bought based on the temporary needs of customers. Thus it is more suitable for these products to be delivered from regional distribution centers, while self-operated model is more helpful to reduce these product’s delivery cost. This is also the reason why Jingdong, the top Chinese self-operated B2C e-commerce website, starts with 3C (Computer, Communication, and Consumer electronic) products.

Fig. 9.10
figure 10

Supplier performance comparison in different product categories and operations model

For the above discussion, we can conclude that the third party platform model generally performs better than self-operated model, due to its higher efficiency in “first mile delivery” and “last mile delivery”. However, from a product categories perspective, self-operated model has greater efficiency in computer and Office and digital, food and drink and healthy products than the third party platform model due to these products’ characteristics of regional demand and delivery.

3.1.5.1 Geographic Expansion and Efficiency Evaluation on 3PL and Self-Operated Model

As e-commerce continues its rapid growth into virtually every market sector, retailers are eager to expand their presence online to capture this market share. According to a research report of i-Research, a leading organization focusing on in-depth research in China’s internet industry, China’s business-to-consumer (B2C) market is to CNY 666.1 billion in 2013, accounting for 36.2% of online shopping market, and has become a formidable force. However, because B2C is an e-commerce model directly facing the customers, the “last mile delivery” is a crucial challenge for improving users’ online shopping experience. Therefore, it is very important for e-commerce players to improve the “first mile delivery” (from order to warehouse) and the “last mile delivery” (from warehouse to consumer).

Starting with a large selection spanning many different product categories is a great challenge for the supply chain capacity of W firm. Although the FMCG category contributes to increasing traffic and consumer stickiness due to its nature of meeting daily needs, how to pick, pack and delivery these small items is a constant struggle. For example, by 2013, W firm had about 2,000,000 SKUs, which is 100 times that of a traditional supermarket, and each order of W firm has an average of 10 merchandises while each order of Jingdong has less than 2 merchandises. So it is stringent on warehouse design and the method of choosing food and drink supply chain. Most importantly, food and drink require faster inventory turnover due to their shorter shelf life. As a result, procurement-inventory-sale-delivery decisions needs to be automated as much as possible.

Like most B2C e-commerce players, W firm initially took 3PL delivery service model for the purpose of saving cost. But initial on-time delivery was only 90% and customer returns reached over 3% [113]. Coupled with growing orders, 3PL struggles to keep up with this growth. Therefore, the self-built logistics system becomes essential. In light of Amazon China’s centralized distribution model, W firm controls all decisions from its headquarters and builds multiple distribution centers. A new “line-haul + regional distribution center + last mile delivery” model is taken. It is noted that the centralized distribution model serves nationwide consumers with the same selection on one website utilizing transshipment between warehouses to ensure the availability of products from all warehouses. In contrast, the decentralized distribution model offers different selections from local branch websites and delivers products from local distribution centers to consumers.

In the term of warehousing expansion, W firm has built five large warehousing centers covering Beijing, Shanghai, Guangzhou, Wuhan and Chengdu. By the way self-established logistics system and the third party platform operations model, W firm has borne fruit with a drastically enhanced customer experience and a 10% improvement in consumer satisfaction. The results in Table 9.10 verify that the third party platform model with self-operated logistics has better delivery efficiency, supplier performance and supply chain efficiency than self-operated (shop in shop) model.

In summary, both the self-operated model and the third platform model are more efficient in supplier performance than that in purchasing-stock efficiency, as shown in Fig. 9.10 and Table 9.10. Thus, it is urgent for W firm to strengthen their “first mile delivery” efficiency because the “first mile delivery” plays a more crucial role in supplier selection and inventory control. From an e-commerce logistics view, self-operated logistics can improve service quality and efficiency through controlling each section from warehouse to consumers, including “the last mile delivery” and is hence more efficient in the coordination of supply chain. But the complicated supply chain network and growing product categories make most e-retail players tend towards a hybrid form of 3PL and self-logistics.

3.2 Credit Risk Evaluation with Kernel-Based Affine Subspace Nearest Points Learning Method

This subsection presents a novel kernel-based method named kernel affine subspace nearest point (KASNP) method for credit evaluation. KASNP method is an extension of a new method named affine subspace nearest point method (ASNP) [114, 115] by kernel trick. Compared with SVM, KASNP is an unconstrained optimal problem, which avoids the convex quadratic programming process and directly computes the optimum solution by training set. On three credit datasets, our experimental results show that KASNP is more effective and competitive.

3.2.1 Affine Subspace Nearest Point Algorithm

The idea of affine subspace nearest point algorithm is derived from the geometric SVM and its nearest-points problem. Here we first give a brief overview of the geometric interpretation and the nearest point problem of SVM in original space.

3.2.1.1 Nearest Point Problem of SVM

Given a set S, co(S) denotes the convex hull of S, and is the set of convex combinations of all elements of S:

$$ {\displaystyle \begin{array}{c} \text{co}\left(\boldsymbol{S}\right)=\left\{{\sum}_k{\alpha}_k{\boldsymbol{x}}_k|{\boldsymbol{x}}_k\in \boldsymbol{s},{\alpha}_k\ge 0,\sum \limits_k{\alpha}_k=1\right\}\end{array}} $$
(9.52)

For the linearly separable binary case, given training data, (x 1, y 1), (x 2, y 2), …, (x l, y l), x i ∈ R d, y i ∈ {+1, −1}, i = 1, …, l, yi is the class label, i.e. S 1 = {(x i, y i)| y i =  + 1} and S 2 = {(x i, y i)| y i =  − 1}, then the convex hulls of the two sets are

$$ {\displaystyle \begin{array}{c} \text{co}\left({\boldsymbol{S}}_1\right)=\left\{{\sum}_{i:{y}_i=+1}{\alpha}_i{\boldsymbol{x}}_i|{\sum}_{i:{y}_i=+1}{\alpha}_i=1,{\alpha}_i\ge 0\right\}\end{array}} $$
(9.53)
$$ {\displaystyle \begin{array}{c} \text{co}\left({\boldsymbol{S}}_2\right)=\left\{{\sum}_{i:{y}_i=-1}{\alpha}_i{\boldsymbol{x}}_i|{\sum}_{i:{y}_i=-1}{\alpha}_i=1,{\alpha}_i\ge 0\right\}\end{array}} $$
(9.54)

As we know, the aim of normal SVM is to find the hyperplane, which separates training data without errors and maximizes the distance (called margin) from the closest vectors to it. In fact, from geometric view, the optimal separating hyperplane is just the one that is orthogonal to and bisects the shortest line segment joining the convex hulls of two sets, and the optimal problem of SVM is equivalent to finding the nearest point problem in the convex hulls [116]. The geometric interpretation and nearest point problem (NNP) of SVM can be easily understood by Fig. 9.11.

$$ {\displaystyle \begin{array}{c}\begin{array}{c}{\min}_{\alpha }{\left\Vert {\sum}_{i:{y}_i=+1}{\alpha}_i{\boldsymbol{x}}_i-{\sum}_{i:{y}_i=-1}{\alpha}_i{\boldsymbol{x}}_i\right\Vert}^2\\ {}\ \mathrm{s}.\mathrm{t}.{\sum}_{i:{y}_i=+1}{\alpha}_i=1,{\sum}_{i:{y}_i=-1}{\alpha}_i=1\\ {}{\alpha}_i\ge 0,i=1,\dots, l\end{array}\end{array}} $$
(9.55)
Fig. 9.11
figure 11

The geometric interpretation and nearest point problem of SVM. co(S1) and co(S2) are two smallest convex sets (convex hulls) shown with dashed lines which contain each class. c and d are the nearest points on them

If \( {\boldsymbol{\alpha}}^{\ast }=\left({\alpha}_1^{\ast },{\alpha}_2^{\ast },\dots, {\alpha}_l^{\ast}\right) \) is the solution to the convex quadratic optimization Eq. (9.55), then the nearest points in two convex hulls are \( \boldsymbol{c}={\sum}_{i:{y}_i=+1}{\alpha}_i^{\ast }{\boldsymbol{x}}_i \) and \( \boldsymbol{d}={\sum}_{i:{y}_i=-1}{\alpha}_i^{\ast }{\boldsymbol{x}}_i \). Constructing the decision boundary f(x) = w ⋅ x + b to be the perpendicular bisector of the line segment joining the two nearest points means that w lies along the line segment and the midpoint p of the line segment satisfies the function f(x) = 0. w and p can be computed by c and d: w = c d, p = (1/2)(c + d), then b = w p. In the end, the classification discriminant function can be written as: f(x) = sgn(w x + b), where sgn( ) is the sign function.

Similar to the above process of the geometric method of SVM, ASNP method [114] extends the areas searched for the nearest points from the convex hulls in SVM to affine subspaces, and constructs the decision hyperplane separating the affine subspaces with equivalent margin.

3.2.2 Affine Subspace Nearest Points (ASNP) Algorithm

Definition 9.1

(Affine subspace). Lee and Seung [117] Given a sample set S = {x 1, …, x m}, x i ∈ R d, the affine subspace spanned by S can be written as Eq. (9.56) or Eq. (9.57):

$$ {\displaystyle \begin{array}{c}H\left(\boldsymbol{S}\right)=\left\{{\sum}_{i=1}^m{\alpha}_i{\boldsymbol{x}}_i|{\sum}_{i=1}^m{\alpha}_i=1\right\}\end{array}}\vspace*{-12pt} $$
(9.56)
$$ {\displaystyle \begin{array}{c}H\left(\boldsymbol{S}\right)=\left\{{\boldsymbol{x}}_0+{\sum}_{i=1}^m{\alpha}_i\left({\boldsymbol{x}}_i-{\boldsymbol{x}}_0\right)\right\},{\boldsymbol{x}}_0\in H\left(\boldsymbol{S}\right)\end{array}} $$
(9.57)

For Eq. (9.56), we can get rid of the constraint \( {\sum}_{i=1}^m{\alpha}_i=1 \) by taking a point in H(S) as a new origin x 0. Therefore the equivalent of Eq. (9.56) can be written as Eq. (9.57). We can let x 0 be the average of all samples, \( {x}_0=\frac{1}{m}{\sum}_{i=1}^m{x}_i \).

In order to interpret the affine subspace, we simply depict the affine subspace in geometry, see, for example in Fig. 9.12.

Fig. 9.12
figure 12

The affine subspace H(S) created by the three samples set S. F is the space three samples lie in. The inner area of the triangle shown with dashed lines is the convex hull co(S), whereas the minimum hyperplane that contains the triangle is the affine subspace H(S)

Compared with the convex hull co(S), the affine subspace contains the convex hull, but is not constrained by α i ≥ 0 (see Eq. 9.56). The convex hull only contains the interpolations of the basis vectors, whereas the affine subspace contains not only the convex hull but also the linear extrapolations.

For a binary-class problem with training sets S 1 = {x 1, x 2, …, x m} and S 2 = {x m + 1, x m + 2, …, x n}. Two affine subspaces respectively spanned by them are

$$ H\left({\boldsymbol{S}}_1\right)=\left\{{\sum}_{i=1}^m{\alpha}_i{\boldsymbol{x}}_i|{\sum}_{i=1}^m{\alpha}_i=1\right\} $$
(9.58)
$$ H\left({\boldsymbol{S}}_2\right)=\left\{{\sum}_{i=m+1}^n{\alpha}_i{\boldsymbol{x}}_i|{\sum}_{i=m+1}^n{\alpha}_i=1\right\} $$
(9.59)

Then the problem of finding the closest points in affine subspaces can be written as the following optimization problem:

$$ {\displaystyle \begin{array}{c}{\min}_{\boldsymbol{\alpha}}{\left\Vert {\sum}_{i=1}^m{\alpha}_i{\boldsymbol{x}}_i-{\sum}_{i=m+1}^n{\alpha}_i{\boldsymbol{x}}_i\right\Vert}^2\\ {}\ \mathrm{s}.\mathrm{t}.{\sum}_{i=1}^m{\alpha}_i=1,{\sum}_{i=m+1}^n{\alpha}_i=1,i=1,\dots, l\end{array}} $$
(9.60)

Compared with Eq. (9.56), Eq. (9.60) is not under constraint α i ≥ 0 which can be also converted into an unconstrained optimal problem as follows:

As Eq. (9.57) is represented, Eqs. (9.58) and (9.59) can be written in unconstrained Eqs. (9.61) and (9.62).

$$ H\left({\boldsymbol{S}}_1\right)=\left\{{\overline{\boldsymbol{u}}}_1+{\sum}_{i=1}^m{\alpha}_i\left({\boldsymbol{x}}_i-{\overline{\boldsymbol{u}}}_1\right)\right\} \vspace*{-12pt}$$
(9.61)
$$ H\left({\boldsymbol{S}}_2\right)=\left\{{\overline{\boldsymbol{u}}}_2+{\sum}_{i=m+1}^n{\alpha}_i\left({\boldsymbol{x}}_i-{\overline{\boldsymbol{u}}}_2\right)\right\} $$
(9.62)

where \( {\overline{\boldsymbol{u}}}_1=\frac{1}{m}{\sum}_{i=1}^m{\boldsymbol{x}}_i \) and \( {\overline{\boldsymbol{u}}}_2=\left(\frac{1}{n-m}\right){\sum}_{i=m+1}^n{\boldsymbol{x}}_i \).

So Eq. (9.60) can be rewritten as

$$ {\displaystyle \begin{array}{c}{\min}_{\alpha }{\left\Vert \left({\boldsymbol{u}}_1+{\sum}_{i=1}^m{\alpha}_i\left({\boldsymbol{x}}_i-{\overline{\boldsymbol{u}}}_1\right)\right)-\left({\boldsymbol{u}}_2+{\sum}_{i=m+1}^n{\alpha}_i\left({\boldsymbol{x}}_i-{\overline{\boldsymbol{u}}}_2\right)\right)\right\Vert}^2\end{array}} $$
(9.63)

where α = {α 1, α 2, …, α m}T.

Equation (9.63) is an unconstrained optimal problem, which can be computed directly, and α is

$$ {\displaystyle \begin{array}{c}\alpha ={\left({\boldsymbol{A}}^{\mathrm{T}}\boldsymbol{A}\right)}^{+}{\boldsymbol{A}}^{\mathrm{T}}\left({\overline{\boldsymbol{u}}}_1-{\overline{\boldsymbol{u}}}_2\right)\end{array}} $$
(9.64)

Or

$$ {\displaystyle \begin{array}{c}\alpha ={\left({\boldsymbol{A}}^T\boldsymbol{A}+\sigma \boldsymbol{I}\right)}^{-1}{\boldsymbol{A}}^T\left({\overline{\boldsymbol{u}}}_1-{\overline{\boldsymbol{u}}}_2\right)\end{array}} $$
(9.65)

where \( \boldsymbol{A}=\left(\left({\overline{\boldsymbol{u}}}_1-{\boldsymbol{x}}_1\right),\dots, \left({\overline{\boldsymbol{u}}}_1-{\boldsymbol{x}}_m\right),\left({\boldsymbol{x}}_{m+1}-{\overline{\boldsymbol{u}}}_2\right),\dots, \left({\boldsymbol{x}}_n-{\overline{\boldsymbol{u}}}_2\right)\right) \), and (A T A)+ is the pseudo-inverse of A T A; σ ≥ 0, and I is n*n identity Matrix.

Then the two nearest points in affine subspaces are

$$ {\displaystyle \begin{array}{c}c={\overline{\boldsymbol{u}}}_1+{\sum}_{i=1}^m{\alpha}_i\left({\boldsymbol{x}}_i-{\overline{\boldsymbol{u}}}_1\right)\end{array}} \vspace*{-18pt}$$
(9.66)
$$ {\displaystyle \begin{array}{c}\boldsymbol{d}={\overline{\boldsymbol{u}}}_2+{\sum}_{i=m+1}^n{\alpha}_i\left({\boldsymbol{x}}_i-{\overline{\boldsymbol{u}}}_2\right)\end{array}} $$
(9.67)

The midpoint of the line segment joining c and d is p = (1/2) (c + d). Similar to the nearest point problem of SVM, the decision boundary w x + b = 0 is the perpendicular bisector of the line segment. Thus, w = c − d and b = −w*p. Correspondingly, the decision function is.

$$ {\displaystyle \begin{array}{c}f\left(\boldsymbol{x}\right)=\operatorname{sgn}\left(\boldsymbol{w}\cdot \boldsymbol{x}+b\right)\\ {}=\operatorname{sgn}\left({\sum}_{i=1}^n{y}_i{\alpha}_i\left({\boldsymbol{x}}_i\cdot \boldsymbol{x}\right)-(12){\sum}_{i=1}^n{\sum}_{j=1}^n{y}_i{\alpha}_i{\alpha}_j\left({\boldsymbol{x}}_i\cdot {\boldsymbol{x}}_j\right)\right)\end{array}} $$
(9.68)

From the above process, we can see that ASNP computing the nearest points in the affine subspaces avoids convex quadratic programming routine and can directly obtain the optimum solution as Eq. (9.67) or Eq. (9.68).

We have introduced the linear ASNP above. But in real world, some data distribution is more complex and nonlinear. When convex hulls intersect (i.e. nonlinearly separating), the distance of nearest points from convex hulls will be zero. Similar with that, when the affine subspaces intersect, the distance in ASNP will also be zero. For the nonlinear distribution data, SVM introduces kernel trick to transform the nonlinear problem to a linear problem (i.e. convex hulls are non-intersection) theoretically. Now kernel method has been widely applied in classification problem, and it has been an effective method for nonlinear or complex data problems. In order to deal with nonlinear problems, we extend the ASNP algorithm to a nonlinear KASNP algorithm by the kernel trick in this section.

3.2.3 Kernel Affine Subspace Nearest Points (KASNP) Algorithm

3.2.3.1 Kernel Method and Kernel Trick

Kernel method [91, 118] is an algorithm that, by replacing the inner product with an appropriate positive definite function, implicitly performs a nonlinear mapping U of the input data from Rd into a high-dimensional feature space H. To compute dot products of (U(x) U(x0)), we employ kernel representation of the form k(x, x0) = (U(x) U(x0)), which allows us to compute the value of the dot products in H without having to actually carry out the map U.

Cover’s theorem states that if the transformation is nonlinear and the dimensionality of the feature space is high enough, then the input space may be transformed into a new feature space where the patterns are linearly separable with high probability [119]. That is, when the decision function is not a linear function of the data, the data can be mapped from the input space into a high dimensional feature space by a nonlinear transformation. In this high dimensional feature space, a generalized optimal separating hyperplane is constructed. This nonlinear transformation just can be performed in an implicit way through the kernel methods. Thus the basic principle behind kernel-based algorithms is that a nonlinear mapping is used to extend the input space into a higher-dimensional feature space. Implementing a linear algorithm in the feature space then corresponds to a nonlinear version of the algorithm in the original input space. Kernel-based classification algorithms, primarily in Support Vector Machines (SVM), have gained a great deal of popularity in machine learning fields [91, 118, 120, 121].

Common choices of kernel function are the linear kernel k(x, y) = (x y), the polynomial kernel k(x, y) = (1 + (x y))d, and the radial basis function (RBF) kernel k(x, y) = exp (1/2)(kx yk/r)2 and the sigmoid kernel k(x, y) = tanh(b(x y) c). In this section, we adopt linear kernel and RBF kernel for experiments.

3.2.3.2 Kernel Affine Subspace Nearest Points (KASNP) Algorithm

Suppose a nonlinear mapping U of the input data in Rd into a high-dimensional feature space H. In space H, we construct the ASNP classifier. Similar to the linear case (see Eq. 9.63), the optimal problem of the closest points in H can be written as the following optimization problem:

$$ {\min}_{\alpha }{\left\Vert \left({\boldsymbol{u}}_1+{\sum}_{i=1}^m{\alpha}_i\left(\varPhi \left({\boldsymbol{x}}_i\right)-{\overline{\boldsymbol{u}}}_1\right)\right)-\left({\boldsymbol{u}}_2+{\sum}_{i=m+1}^n{\alpha}_i\left(\varPhi \left({\boldsymbol{x}}_i\right)-{\overline{\boldsymbol{u}}}_2\right)\right)\right\Vert}^2 $$
(9.69)

Where \( {\overline{\boldsymbol{u}}}_1=\frac{1}{m}{\sum}_{i=1}^m\varPhi \left({\boldsymbol{x}}_i\right),{\overline{\boldsymbol{u}}}_2=\frac{1}{n-m}{\sum}_{i=m+1}^n\varPhi \left({\boldsymbol{x}}_i\right) \).

Let \( \boldsymbol{A}=\Big({\overline{\boldsymbol{u}}}_1-\varPhi \left({\boldsymbol{x}}_1\right),\dots, {\overline{\boldsymbol{u}}}_1-\varPhi \left({\boldsymbol{x}}_m\right),\varPhi \left({\boldsymbol{x}}_{m+1}\right)-{\overline{\boldsymbol{u}}}_2,\dots, \varPhi \left({\boldsymbol{x}}_n\right)-{\overline{\boldsymbol{u}}}_2, \) Formula (9.69) can written as

$$ {\displaystyle \begin{array}{c}{\min}_{\alpha }f\left(\boldsymbol{\alpha} \right)={\min}_{\alpha }{\left\Vert \left({\overline{\boldsymbol{u}}}_1-{\overline{\boldsymbol{u}}}_2\right)-\boldsymbol{A}\alpha \right\Vert}^2\end{array}} $$
(9.70)

By solving \( \frac{\partial f}{\partial \alpha }=0 \), we have

$$ {\displaystyle \begin{array}{c}{\boldsymbol{A}}^T A\alpha ={\boldsymbol{A}}^T\left({\overline{\boldsymbol{u}}}_1-{\overline{\boldsymbol{u}}}_2\right)\end{array}} $$
(9.71)

In Eq. (9.71) \( {\boldsymbol{A}}^T\boldsymbol{A}\ and\ {\boldsymbol{A}}^T\left({\overline{\boldsymbol{u}}}_1-{\overline{\boldsymbol{u}}}_2\right) \) can be cast in terms of dot products (Φ(x i) ⋅ Φ(x j)) as follows:

$$ {\boldsymbol{A}}^T\boldsymbol{A}={\left({\boldsymbol{M}}^T\boldsymbol{F}+\boldsymbol{E}\right)}^T\left({\varPhi}^T\varPhi \right)\left({\boldsymbol{M}}^T\boldsymbol{F}+\boldsymbol{E}\right) $$
(9.72)
$$ {\boldsymbol{A}}^T\left({\overline{\boldsymbol{u}}}_1-{\overline{\boldsymbol{u}}}_2\right)={\left({\boldsymbol{M}}^T\boldsymbol{F}+\boldsymbol{E}\right)}^T\left({\varPhi}^T\varPhi \right){\boldsymbol{F}}^T{\boldsymbol{m}}^T $$
(9.73)

Where Φ = (Φ(x 1), …, Φ(x m), Φ(x m + 1), …, Φ(x n)),

$$ \boldsymbol{M}=\left(\begin{array}{c}\frac{1}{m}\\ {}0\end{array}\ \begin{array}{c}0\\ {}\frac{1}{n-m}\end{array}\ \right){\left(\begin{array}{c}1\cdots 1\\ {}0\cdots 0\end{array}\begin{array}{c}0\cdots 0\\ {}1\cdots 1\end{array}\right)}_{2\times 0}, $$
$$ F={\left(\begin{array}{c}1\cdots 1\\ {}0\cdots 0\end{array}\begin{array}{c}0\cdots 0\\ {}1\cdots 1\end{array}\right)}_{2\times n},m=\left(\frac{1}{m},\frac{1}{n-m}\right), $$
$$ E=\left[\begin{array}{cc}\begin{array}{ccc}-1& & \\ {}& \ddots & \\ {}& & -1\end{array}&\ \\ {}\ & \begin{array}{ccc}1& & \\ {}& \ddots & \\ {}& & 1\end{array}\end{array}\right],, $$
$$ {\boldsymbol{\Phi}}^T\boldsymbol{\Phi} =\left(\begin{array}{ccc}\left(\varPhi \left({\boldsymbol{x}}_1\right)\cdot \varPhi \left({\boldsymbol{x}}_1\right)\right)& \cdots & \left(\varPhi \left({\boldsymbol{x}}_1\right)\cdot \varPhi \left({\boldsymbol{x}}_n\right)\right)\\ {}\vdots & \ddots & \vdots \\ {}\left(\varPhi \left({\boldsymbol{x}}_n\right)\cdot \varPhi \left({\boldsymbol{x}}_1\right)\right)& \cdots & \left(\varPhi \left({\boldsymbol{x}}_n\right)\cdot \varPhi \left({\boldsymbol{x}}_n\right)\right)\end{array}\right). $$

Employing kernel representations of the form k(x i, x j) = (Φ(x i) ⋅ Φ(x j)), Φ T Φ is

$$ \boldsymbol{K}={\Phi}^T\Phi =\left(\begin{array}{c}\ \\ {}\ \\ {}\ \\ {}\ \end{array}\right.{\displaystyle \begin{array}{cccc}k\left({\boldsymbol{x}}_1,{\boldsymbol{x}}_1\right)& k\left({\boldsymbol{x}}_1,{\boldsymbol{x}}_2\right)& \dots & k\left({\boldsymbol{x}}_1,{\boldsymbol{x}}_n\right)\\ {}k\left({\boldsymbol{x}}_2,{\boldsymbol{x}}_1\right)& k\left({\boldsymbol{x}}_2,{\boldsymbol{x}}_2\right)& \dots & k\left({\boldsymbol{x}}_2,{\boldsymbol{x}}_n\right)\\ {}\vdots & \vdots & \ddots & \vdots \\ {}k\left({\boldsymbol{x}}_n,{\boldsymbol{x}}_1\right)& k\left({\boldsymbol{x}}_n,{\boldsymbol{x}}_2\right)& \dots & k\left({\boldsymbol{x}}_n,{\boldsymbol{x}}_n\right)\end{array}}\left.\begin{array}{c}\ \\ {}\ \\ {}\ \\ {}\ \end{array}\right) $$

Equations (9.72) and (9.73) can be kernelized:

$$ {\displaystyle \begin{array}{c}{\boldsymbol{A}}^T\boldsymbol{A}={\left({\boldsymbol{M}}^T\boldsymbol{F}+\boldsymbol{E}\right)}^T\boldsymbol{K}\left({\boldsymbol{M}}^T\boldsymbol{F}+\boldsymbol{E}\right)\end{array}} $$
(9.74)
$$ {\displaystyle \begin{array}{c}{\boldsymbol{A}}^T\left({\overline{\boldsymbol{u}}}_1-{\overline{\boldsymbol{u}}}_2\right)={\left({\boldsymbol{M}}^T\boldsymbol{F}+\boldsymbol{E}\right)}^T{\boldsymbol{KF}}^T{\boldsymbol{m}}^T\end{array}} $$
(9.75)

So we can directly obtain the solution α of Eq. (9.69):

$$ {\displaystyle \begin{array}{c}\boldsymbol{\alpha} ={\left({\boldsymbol{A}}^T\boldsymbol{A}\right)}^{+}\left({\boldsymbol{A}}^T\left({\overline{\boldsymbol{u}}}_1-{\overline{\boldsymbol{u}}}_2\right)\right)\end{array}} $$
(9.76)

or

$$ {\displaystyle \begin{array}{c}\boldsymbol{\alpha} ={\left({\boldsymbol{A}}^T\boldsymbol{A}+\sigma \boldsymbol{I}\right)}^{-1}\left({\boldsymbol{A}}^T\left({\overline{\boldsymbol{u}}}_1-{\overline{\boldsymbol{u}}}_2\right)\right)\end{array}} $$
(9.77)

where A T A + is pseudo-inverse of A T A; σ ≥ 0, and I is n*n identity Matrix.

After getting the optimal solution α, two nearest point c and d can be represented by α:

$$ {\displaystyle \begin{array}{l}\boldsymbol{c}={\overline{\boldsymbol{u}}}_1+{\sum}_{i=1}^m{\alpha}_i\left(\Phi \left({\boldsymbol{x}}_i\right)-{\overline{\boldsymbol{u}}}_1\right)\\ {}={\sum}_{i=1}^m\sum \left(\frac{1}{m}\left(1-{\sum}_{i=1}^m{\alpha}_i\right)+{\alpha}_i\right)\Phi \left({\boldsymbol{x}}_i\right)\end{array}} $$
(9.78)
$$ {\displaystyle \begin{array}{l}\boldsymbol{d}={\overline{\boldsymbol{u}}}_2+{\sum}_{i=m+1}^n{\alpha}_i\left(\Phi \left({\boldsymbol{x}}_i\right)-{\overline{\boldsymbol{u}}}_2\right)\\ {}={\sum}_{i=m+1}^n\left(\frac{1}{n-m}\left(1-{\sum}_{i=m+1}^n{\alpha}_i\right)+{\alpha}_i\right)\Phi \left({\boldsymbol{x}}_i\right)\end{array}} $$
(9.79)

then, w, p and b can be written as:

$$ {\displaystyle \begin{array}{c}\boldsymbol{w}=\boldsymbol{c}-\boldsymbol{d}=\Phi {\boldsymbol{v}}_1\end{array}} $$
(9.80)
$$ {\displaystyle \begin{array}{c}\boldsymbol{p}=(12)\left(\boldsymbol{c}+\boldsymbol{d}\right)=\frac{1}{2}\Phi {\boldsymbol{v}}_2\end{array}} $$
(9.81)
$$ {\displaystyle \begin{array}{c}b=-\boldsymbol{w}\cdot \boldsymbol{p}=-\frac{1}{2}{v}_1^T{\Phi}^T\Phi {v}_2=-\frac{1}{2}{v}_1^T\boldsymbol{K}{v}_2\end{array}} $$
(9.82)

where

$$ {\displaystyle \begin{array}{c}{\boldsymbol{v}}_1=\left(\begin{array}{c}\frac{1}{m}\left(1-{\sum}_{i=1}^m{\alpha}_i\right)+{\alpha}_1\\ {}\begin{array}{c}\frac{1}{m}\left(1-{\sum}_{i=1}^m{\alpha}_i\right)+{\alpha}_m\\ {}\frac{-1}{n-m}\left(1-{\sum}_{i=m+1}^n{\alpha}_i\right)-{\alpha}_{m+1}\\ {}\frac{-1}{n-m}\left(1-{\sum}_{i=m+1}^n{\alpha}_i\right)-{\alpha}_n\end{array}\end{array}\right)\end{array}} $$
(9.83)
$$ {\displaystyle \begin{array}{c}{\boldsymbol{v}}_1=\left(\begin{array}{c}\frac{1}{m}\left(1-{\sum}_{i=1}^m{\alpha}_i\right)+{\alpha}_1\\ {}\begin{array}{c}\frac{1}{m}\left(1-{\sum}_{i=1}^m{\alpha}_i\right)+{\alpha}_m\\ {}\frac{-1}{n-m}\left(1-{\sum}_{i=m+1}^n{\alpha}_i\right)-{\alpha}_{m+1}\\ {}\frac{-1}{n-m}\left(1-{\sum}_{i=m+1}^n{\alpha}_i\right)-{\alpha}_n\end{array}\end{array}\right)\cdot Z\end{array}} $$
(9.84)

So the decision boundary (w ⋅ Φ(x)) + b = 0 is

$$ {\displaystyle \begin{array}{c}{\boldsymbol{v}}_2^T{\boldsymbol{k}}_x-\frac{1}{2}{\boldsymbol{v}}_2^TK{\boldsymbol{v}}_1=0\end{array}} $$
(9.85)

Where k x = Φ T Φ(x) = (k(x 1, x), k(x 2, x), …, k(x n, x))T.

The decision function f(x) =  sgn (w ⋅ Φ(x) + b) is

$$ {\displaystyle \begin{array}{c}f\left(\boldsymbol{x}\right)=\operatorname{sgn}\left(\boldsymbol{w}\cdot \Phi \left(\boldsymbol{x}\right)+b\right)=\operatorname{sgn}\left({\boldsymbol{v}}_2^T{\boldsymbol{k}}_x-\frac{1}{2}{\boldsymbol{v}}_2^T{\boldsymbol{Kv}}_1\right)\end{array}} $$
(9.86)

According to the previous descriptions, the overall process of KASNP learning algorithm can be summarized into the following three steps:

  • Step 1: Computing the optimal solution α of the nearest points problem of KASNP by training set:

    $$ \boldsymbol{\alpha} ={\left({\boldsymbol{A}}^T\boldsymbol{A}\right)}^{+}\left({\boldsymbol{A}}^T\left({\overline{\boldsymbol{u}}}_1-{\overline{\boldsymbol{u}}}_2\right)\right)\;\mathrm{or}\;\boldsymbol{\alpha} ={\left({\boldsymbol{A}}^T\boldsymbol{A}+\sigma \boldsymbol{I}\right)}^{-1}\left({\boldsymbol{A}}^T\left({\overline{\boldsymbol{u}}}_1-{\overline{\boldsymbol{u}}}_2\right)\right) $$
  • Step 2: Constructing decision boundary by α:

    $$ {\boldsymbol{v}}_2^T{\boldsymbol{k}}_x-\frac{1}{2}{\boldsymbol{v}}_2^T{\boldsymbol{Kv}}_1=0 $$

    Correspondingly, the decision function is

$$ f\left(\boldsymbol{x}\right)=\operatorname{sgn}\left({\boldsymbol{v}}_2^T{\boldsymbol{k}}_x-\frac{1}{2}{\boldsymbol{v}}_2^T{\boldsymbol{Kv}}_1\right) $$
  • Step 3: Testing a sample y,

    $$ \text{If}\ f\left(\boldsymbol{y}\right)\ge 0,\boldsymbol{y}\in \text{the\ class\ of}\ {\boldsymbol{S}}_1; \text{otherwise}, \boldsymbol{y}\in \text{the\ class\ of}\ {\boldsymbol{S}}_2 $$

3.2.4 Two-Spiral Problem Test

2D two-spiral classification is a classical nonlinear problem and has been particularly popular for testing novel statistical pattern recognition classifiers. The problem is a difficult test case for learning algorithms [122, 123] and is known to give neural networks severe problems, but it can be successfully solved by nonlinear kernel SVMs [124, 125]. In this section, we also tested our KASNP with RBF kernel \( k\left(x,y\right)=\mathit{\exp}\left(\frac{1}{2}\right){\left(x-y/\sigma \right)}^2 \) on a 2D two-spiral dataset accessible from the Carnegie Mellon repository [126]. The benchmark dataset, download from http://www.cgi.cs.cmu.edu/afs/cs.cmu.edu/project/vairepository/ai/areas/ai/areas/neural/bench/cmu/0.html, has two classes of spiral-shaped training data points, with 97 points for each, and is illustrated in Fig. 9.13. In order to visualize the separating surface by KASNP, the nodes of a 2D grid (0.05 space per grid) are tested and marked with different color (gray and white) to show their class. Figure 9.14 shows the decision region by KASNP. The parameter r of RBF kernel for KASNP is 0.8.

Fig. 9.13
figure 13

2D two-spiral dataset: “o” spiral 1, “*” spiral 2

Fig. 9.14
figure 14

The separation generated by RBF kernel KASNP

In Fig. 9.14, our KASNP constructs a smooth nonlinear spiral-shaped separating surface for the 2D two-spiral dataset, which implies that the KASNP classification method can achieve an excellent generalization for nonlinear data.

3.2.5 Credit Evaluation Applications and Experiments

Credit risk evaluation is a very typical classification problem to identify “good” and “bad” creditors. In this section, we apply KASNP for credit risk evaluation. To test the efficacy of our proposal KASNP for creditor evaluation, we compare it with SVM by linear kernel and RBF kernel on three real world credit datasets: Australian credit dataset, German credit dataset and a major US credit dataset. The compared linear kernel KASNP is equivalent to original ASNP method [114], that is, ASNP method is a special case of KASNP when kernel function is linear kernel.

3.2.5.1 Experiment Design

In our experiments, three accuracies will be tested to evaluate the classifiers, “Good” accuracy, “Bad” accuracy and Total accuracy:

$$ { {\begin{array}{ll}\displaystyle {}^{"}{Good}^{"} Accuracy=\frac{{number\ of\ correctly\ classified}^{"}{Good}^{"}\ samples\kern0.5em in\ test\ set}{{number\ of}^{"}{Good}^{"}\ samples\ in\ test\ set}\end{array}}} $$
$$ {}^{"}{Bad}^{"} Accuracy=\frac{{number\ of\ correctly\ classified}^{"}{Bad}^{"}\ samples\kern0.5em in\ test\ set}{{number\ of}^{"}{Bad}^{"}\ samples\ in\ test\ set} $$
$$ Total\ Accuracy=\frac{number\ of\ correct\ classification\ in\ test\ set}{number\ of\ samples\ in\ test\ set} $$

where “Good” accuracy and “Bad” accuracy respectively measure the capacity of the classifiers to identify “Good” or “Bad” clients. In the real world, for the special purposes to prevent the credit fraud, the accuracy of classification for the risky class must be improved to reach an acceptable standard but not excessively affecting the accuracy of classification for other classes. Thus, improving “Bad” accuracy is one of the most important tasks in credit scoring [127].

In our experiments of each dataset, we randomly select p (p = 40, 60, 80, …, 180) samples from each class to train the compared classifiers and the remaining for the test. We repeat the test 20 times and report the mean of “Bad”, “Good” and Total accuracies for each compared classifiers. All of our experiments are carried out on Matlab 7.0 platform. The convex quadratic programming problem of SVM is solved utilizing Matlab optimal tools. The experimental results on three credit datasets are separately given in the following subsections.

3.2.5.2 Results on Australian Credit Dataset

The Australian credit dataset from the UCI Repository of Machine Learning Databases (http://archive.ics.uci.edu/ml/) contains 690 instances of MasterCard applicants, 307 of which are classified as positive and 383 as negative. Each instance has 14 attributes, and all attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. With the number variety (40, 60, …, 180) of randomly selected training samples per class, the “Bad” accuracy, “Good” accuracy and total accuracy comparisons of different methods on Australian credit dataset, are shown in Tables 9.11, 9.12, and 9.13 respectively. Parameter r of RBF kernel is set to 50,000 for both RBF SVM and RBF KASNP, and the penalty constant C of SVM is ∞.

Table 9.11 “Bad” accuracy (%) comparisons of different methods on Australian dataset
Table 9.12 “Good” accuracy (%) comparisons of different methods on Australian dataset
Table 9.13 Total accuracy (%) comparisons of different methods on Australian dataset

In above experimental results, for “Bad” accuracy, nonlinear classifiers RBF SVM and RBF KASNP outperform other two linear classifiers, and RBF KASNP is better than RBF SVM. For “Good” accuracy, linear kernel KASNP is the best of all classifiers, and its “Good” accuracy can get 89.74–92.24% (see Table 9.12). From the total accuracy comparisons, KASNP dominates SVMs. Linear KASNP can reach the highest total accuracy when the number of training samples p = 40, …, 120, and RBF KASNP is the best one when p = 140, 160, 180 (see Table 9.13).

3.2.5.3 Results on German Credit Dataset

The German credit dataset from the UCI Repository of Machine Learning Databases (http://archive.ics.uci.edu/ml/) concludes 1000 instances, 700 instances of creditworthy applicants and 300 instances whose credit should not be extended. For each instance, 24 numerical attributes describe the credit history, account balances, loan purpose, loan amount, employment status, and personal information. The different accuracy comparisons of the classifiers on German dataset are given in Tables 9.11, 9.12, and 9.13 respectively. The parameter r of RBF kernel for SVM and KASNP is set to r = 20,000, and the penalty constant C of SVM is set to 1.

From the experimental results in Tables 9.14, 9.15, and 9.16, we can see that our proposed RBF KASNP is slightly better than others. RBF KASNP has five highest accuracies (when p = 40, 80, 100, 120, 160) in “Bad” accuracy comparison, and six best results (when p = 40, 60, 80, 100, 160, 180) for “Good” clients identification. For total accuracy, RBF KASNP continuously achieves the highest accuracy in eight comparison results.

Table 9.14 “Bad” accuracy (%) comparisons of different methods on German dataset
Table 9.15 “Good” accuracy (%) comparisons of different methods on German dataset
Table 9.16 Total accuracy (%) comparisons of different methods on German dataset
3.2.5.4 Results on USA Credit Dataset

The last credit card dataset used in our experiments is provided by a major U.S. bank. It contains 6000 records and 66 derived attributes. Among these 6000 records, 960 are bankruptcy accounts and 5040 are “good” status accounts [128]. The “Bad”, “Good” and total accuracy comparisons of the classifiers are shown in Tables 9.17, 9.18, and 9.19 respectively. Parameter r of RBF kernel of SVM and KASNP is r = 10,000, and the penalty constant C of SVM is C = 1.

Table 9.17 “Bad” accuracy (%)comparisons of different methods on USA dataset
Table 9.18 “Good” accuracy (%) comparisons of different methods on USA dataset
Table 9.19 Total accuracy (%) comparisons of different methods on USA dataset

Comparing the results reported in Tables 9.17, 9.18, and 9.19, we find the following results: (1) RBF KASNP is superior to other classifiers in finding “Bad” clients. As we can see from Table 9.17, only using 80 training samples (40 per class), RBF KASNP can achieve best “Bad” classification results 81.32% which is at least higher 15% than the accuracies of other approaches. (2) For identifying “Good” clients, four approaches have not clear difference, and RBF SVM and linear KASNP respectively have four best results in Table 9.18. (3) From the general view (see Table 9.19), the two KASNP approaches dominate SVMs. RBF KASNP performs the best when p = 40, …, 120, and linear KASNP outperforms the others when p = 140, 160, 180.

3.2.6 Discussion

From above experimental results of three credit datasets, we can conclude that as a whole the proposed KASNP is comparable with SVM for creditor classification. As we know, the capacity of finding “Bad” clients is an important measure for credit risk evaluation approaches. From “Bad” accuracy comparison experimental results in Tables 9.11, 9.12, and 9.13, we note that our proposed KASNP with RBF kernel can achieve the best performance for identifying “Bad” creditors. Especially for US dataset, KASNP obviously outperformed other approaches. In total performance, RBF KASNP also performed better than SVMs. Thus, RBF KASNP classifier made a better risky classification performance. Moreover, we also note that, for “Good” clients identification, linear KASNP is a good classifier. Especially on Australian dataset, linear KASNP obtained wonderful “Good” accuracies, while its “Bad” accuracies also kept acceptable standard.

3.3 A Dynamic Assessment Method for Urban Eco-Environmental Quality Evaluation

This subsection provides an urban eco-environmental quality assessment system with a dynamic assessment of the Yangtze River Delta and the Pearl River Delta economic zones are proposed and analyzed.

3.3.1 Related Works

3.3.1.1 Assessment of Urban Eco-Environmental Quality

With the rapid surge in urbanization around the world, there are a series of urban eco-environmental problems. In 1962, Carson described the destruction of urban eco-environment in Silent Spring for the first time, which led to the wide-range attention. In 1971, the United Nations Educational, Scientific and Cultural Organization developed the ‘Man and the Biosphere’ research project, which focused on the eco-environment of human settlements and carried out the urban research subject in human ecology theories and views [129]. Schneider pointed out: ‘in contrast with common sense of many urban sociologist and environmentalists, that the urban basic issues are not clean air and water, not endangered species or environment, not energy, nor the urban housing construction and renovation investment, but the association structure of the human environment—the city, it is necessary to build up a harmonious developing city to solve the problem’ [29]. In 1984, Yanitsky established a human residence where economy, society and nature are coordinated in development. In 1998, Bohm studied the special urban development process of Vienna in Australia. Although the number of population has not changed significantly, the residential area, road area, and energy consumption have increased significantly, and urban green space reduced significantly. The United Nations human environment and development conference held in Rio de Janeiro, Brazil, pointed out that environmental issue will be the largest challenge in the twenty-first century. The urban eco-environmental quality problem has been an active research fields for years [115, 130,131,132].

3.3.1.2 Sensitivity Analysis

Multi-attribute evaluation (MAE) is used in assessment when the known options available are fixed, and the number of the evaluation alternatives are limited [133]. The reliability of the evaluation results is tested in the sensitivity analysis. For a limited alternative set, there are two parameters to determine their ranking of the alternatives: one is the relative importance among attributes, that is, attribute weights; and the other is attribute value correspondent to each alternative.

The early studies of the sensitivity analysis focused on the key attribute weights [134, 135]. Starr [136], Isaacs [137], Fishbum [138] and Evans [139], studied the maximum regional-changed issues when the alternative order remained constant. French and Insua [140] determined the potential competitors in the current optimal solution with the minimum distance method. Masuda [141] and Armacost and Hosseini [142] studied the sensitivity analysis on the analytic hierarchy process (AHP). Ringuest [143] studied the distance sensitivity analysis between the set closest to the original weight and original weight when the optimal solution remained unchanged.

3.3.1.3 Urban Eco-Environmental Quality Index System

Here, an Urban Eco-Environmental Quality Index System is proposed to assess urban eco-environmental development and quality level.

To build an Urban Eco-Environmental Quality Index System, the following principles should be followed.

  • People-oriented principle. The core of urban eco-environment is ‘human’, who is both the creator and the bearer of urban eco-environment. Therefore, the assessment index system should not only reflect on what are closely related with people’s living, but also reflect the objective and subjective experience on the environment.

  • Comprehensiveness principle. The construction of the assessment index system must reflect all aspects of urban eco-environment, including the living conditions, natural environment, social environment, and infrastructure indicators, as well as all aspects of urban environment.

  • Representative principle. The assessment index system should reflect the main features of urban eco-environment. Both qualitative indicators and quantitative indicators should be included.

3.3.2 Selecting Indicators

According to the previous studies [144,145,146], we selected 25 comprehensive evaluation index, from four perspectives—population ecological indicators, nature ecological indicators, economy ecological indicators, and society ecological indicators to establish the index system, which includes both the cost-based indicators and efficiency-based indicators [147]. The details of all indicators are shown in Table 9.20.

Table 9.20 Urban eco-environmental quality index system

These indicators are collected from the ‘China City Statistical Yearbook’ and the ‘China Statistical Yearbook for Regional Economy’, in order to increase the comparability of the index, we unify the indicators to the relative ratio, such as

$$ {\displaystyle \begin{array}{l}\mathrm{percentage}\ \mathrm{of}\ \mathrm{hospital}\ \mathrm{doctors}\ \mathrm{in}\ \mathrm{urban}\ \mathrm{population}=\frac{\ \mathrm{hosptial}\ \mathrm{doctors}\ }{\ \mathrm{urban}\ \mathrm{population}} \times 100\%\\ {}\mathrm{percentage}\ \mathrm{of}\ \mathrm{in}\mathrm{vestment}\ \mathrm{in}\ \mathrm{science}\ \mathrm{and}\ \mathrm{education}\ \mathrm{in}\ \mathrm{fiscal}\ \mathrm{expenditure}\\ \quad =\frac{\ \mathrm{investment}\ \mathrm{in}\ \mathrm{science}\ \mathrm{and}\ \mathrm{education}\ }{\ \mathrm{fiscal}\ \mathrm{expenditure}} \times 100\%\end{array}} $$
3.3.2.1 Evaluation Method

The proposed evaluation method includes three steps: The first step is the data preprocessing, the second step is the Dynamic Assessment, and the third step is the sensitivity analysis.

In data preprocessing, evaluation index system is setup and data is processed. The evaluation index system is based on ecological theory, and advices of experts. In data processing, data is cleaned and transformed. A Dynamic Assessment model to evaluate the urban eco-environmental quality is proposed. The sensitivity of attributes weights and values are analyzed.

Figure 9.15 shows the structure of the proposed evaluation model. In the following subsections, we will present the details of the models and methods in proposed framework.

Fig. 9.15
figure 15

The evaluation framework flow chart

3.3.2.2 Multi-criteria Decision Making Method

Multi-criteria decision making method (MCDM) is a decision making analysis method, which has been developed since 1970s. MCDM is the study of methods and procedures by which concerns about multiple conflicting criteria can be formally incorporated into the management planning process and the optimum one can be identified from a set of alternatives. In the following subsections, MCDM related methods, Entropy Method, Grey Relation Analysis (GRA) and Technique for order preference by similarity to ideal solution (TOPSIS), which are integrated in this research, are discussed.

3.3.2.3 Entropy Method

In this research, we introduced the concept of entropy to measure the information, which is a term in information theory, also known as the average amount of information. The index weight is calculated by the Entropy Method. According to the degree of index dispersion, the weight of all indicators is calculated by information entropy. Entropy method is highly reliable and can be easily adopted in information measurement. The calculation steps are as follows:

Suppose we have a decision matrix B with m alternatives and n indicators:

  1. 1.

    In matrix B, feature weight p ij is of the ith alternative to the jth factor:

    $$ {\displaystyle \begin{array}{c}{\mathrm{p}}_{ij}={y}_{ij}{\sum}_{i=1}^m{y}_{ij}\left(1\le i\le m,1\le j\le n\right)\end{array}} $$
    (9.87)
  2. 2.

    The output entropy e j of the jth factor becomes

    $$ {\displaystyle \begin{array}{c}{e}_j=-k{\sum}_{i=1}^m{p}_{ij}\;\ln {p}_{ij}\left(k=1\;\ln m;1\le j\le n\right)\end{array}} $$
    (9.88)
  3. 3.

    Variation coefficient of the jth factor: g j can be defined by following equation:

    $$ {g}_j=1-{e}_j,\kern0.5em \left(1\le j\le \mathrm{n}\right) $$
    (9.89)

    Note that the larger g j is, the higher the weight should be.

  4. 4.

    Calculate the weight of entropy α j:

    $$ {\alpha}_j={g}_j{\sum}_{j=1}^m{g}_j,\kern0.5em \left(1\le j\le n\right) $$
    (9.90)
3.3.2.4 Grey Relational Analysis Method

Grey relational analysis is a part of grey theory, which can handle imprecise and incomplete information in grey systems. GRA only requires small sample data, simple calculation and the precision is quite high. Specifically, weights are calculated as [148].

Suppose we have the initial matrix R

$$ \mathrm{R}=\left[\begin{array}{cccc}{x}_1& {x}_{12}& \cdots & {x}_{1n}\\ {}{x}_{21}& {x}_{22}& \cdots & {x}_{2n}\\ {}\vdots & \vdots & \cdots & \vdots \\ {}{x}_{m1}& {x}_{m2}& \cdots & {x}_{mn}\end{array}\right] $$
  1. 1.

    Standardize the raw matrix R

    $$ {\displaystyle \begin{array}{c}\mathrm{R}=\left[\begin{array}{cccc}{x}_1^{\prime }& {x}_{12}^{\prime }& \cdots & {x}_{1n}^{\prime}\\ {}{x}_{21}^{\prime }& {x}_{22}^{\prime }& \cdots & {x}_{2n}^{\prime}\\ {}\vdots & \vdots & \cdots & \vdots \\ {}{x}_{m1}^{\prime }& {x}_{m2}^{\prime }& \cdots & {x}_{mn}^{\prime}\end{array}\right]\end{array}} $$
    (9.91)
  2. 2.

    Generate the reference sequence \( {x}_0^{\prime } \)

    $$ {x}_0^{\prime }=\left({x}_0^{\prime }(1),{x}_0^{\prime }(2),\cdots, {x}_0^{\prime }(n)\right) $$
    (9.92)

    \( {x}_0^{\prime }(j) \) is the largest and normalized value in the jth factor.

  3. 3.

    Calculate the difference Δ 0i(j) between the normalize sequences and the reference sequence \( {x}_0^{\prime } \)

    $$ \begin{array}{c}\Delta_{0i}(j)=|x_0^{\prime}(j)-x_{ij}^{\prime}|\\ \Delta =\left[\begin{array}{cccc}{\Delta}_{01}(1)& {\Delta}_{01}(2)& \cdots & {\Delta}_{01}(n)\\ {}{\Delta}_{02}(1)& {\Delta}_{02}(2)& \cdots & {\Delta}_{02}(n)\\ {}\vdots & \vdots & \vdots & \vdots \\ {}{\Delta}_{0m}(1)& {\Delta}_{0m}(2)& \cdots & {\Delta}_{0m}(n)\end{array}\right]\end{array} $$
    (9.93)
  4. 4.

    Compute the grey coefficient: r 0i(j)

    $$ {r}_{0i}(j)=\frac{\min_{i=1}^n{\min}_{j=1}^m{\Delta}_{0i}(j)+\delta {\max}_{i=1}^n{\max}_{j=1}^m{\Delta}_{0i}(j)}{\Delta_{0i}(j)+\delta {\max}_{i=1}^n{\max}_{j=1}^m{\Delta}_{0i}(j)} $$
    (9.94)

    where δ is a distinguished coefficient. Usually, the value of d often is set to 0.5, to offer moderate distinguishing effects and good stability.

  5. 5.

    Obtain the grey relational degree value: b i

    $$ {b}_i=\frac{1}{n}{\sum}_{j=1}^n{r}_{0i}(j) $$
    (9.95)
  6. 6.

    Finally, calculate the weight of GRA: β i

    $$ {\beta}_i=\frac{b_i}{\sum_{i=1}^n{b}_i} $$
    (9.96)

In this research, we use Entropy and the GRA method to calculate the normalized weight of the indicators.

3.3.2.5 Technique for Order Preference by Similarity to Ideal Solution Method

Technique for order preference by similarity to ideal solution TOPSIS was initially developed to rank alternatives over multiple criteria. TOPSIS finds the best alternatives by minimizing the distance to the ideal solution and maximizing the distance to the nadir or negative-ideal solution [34]. All alternative solutions can be ranked according to their closeness to the ideal solution. Because its first introduction, a number of extensions and variations of TOPSIS have been developed over the years. The calculation steps are as follows:

  1. 1.

    Calculate the normalized decision matrix A. The normalized value a ij is calculated as

    $$ {a}_{ij}=\frac{x_{ij}}{\sqrt{\sum_{i=1}^m{\left({x}_{ij}\right)}^2}}\left(1\le i\le m,1\le j\le n\right) $$
    (9.97)
  2. 2.

    Calculate the weighted normalized decision matrix

    $$ D=\left({a}_{ij}\ast {w}_j\right)\left(1\le i\le \mathrm{m},1\le \mathrm{j}\le \mathrm{n}\right) $$
    (9.98)

    where w j is the weight of the ith criterion, and \( {\sum}_{j=1}^n{w}_j=1 \).

  3. 3.

    Calculate the ideal solution V and the negative ideal solution V

    $$ {\displaystyle \begin{array}{l}{V}^{\ast }=\left\{{v}_1^{\ast },{v}_2^{\ast },\cdots, {v}_n^{\ast}\right\}=\left\{\left(\underset{i}{\max }{v}_{ij}|j\in J\right),\left(\underset{i}{\min }{v}_{ij}|j\in {J}^{\prime}\right)\right\}\\ {}{V}^{-}=\left\{{v}_1^{-},{v}_2^{-},\cdots, {v}_n^{-}\right\}=\left\{\left(\underset{i}{\min }{v}_{ij}|j\in J\right),\left(\underset{i}{\max }{v}_{ij}|j\in {J}^{\prime}\right)\right\}\end{array}} $$
    (9.99)
  4. 4.

    Calculate the separation measures, using the m-dimensional Euclidean distance

    $$ {\displaystyle \begin{array}{l}{S}_i^{+}=\sqrt{\sum_{j=1}^n{\left({V}_i^j-{V}^{\ast}\right)}^2}\left(1\le i\le m,1\le j\le n\right)\\ {}{S}_i^{-}=\sqrt{\sum_{j=1}^n{\left({V}_i^j-{V}^{-}\right)}^2}\left(1\le i\le m,1\le j\le n\right)\end{array}} $$
    (9.100)
  5. 5.

    Calculate the relative closeness to the ideal solution

    $$ {\displaystyle \begin{array}{c}{Y}_i=\frac{S_i^{-}}{S_i^{+}+{S}_i^{-}}\left(1\le i\le m\right)\end{array}} $$
    (9.101)

    where Y i ∈ (0, 1). The larger Y i is, the closer the alternative is to the ideal solution.

  6. 6.

    Rank the preference order

The larger TOPSIS value, the better the alternative.

3.3.2.6 Dynamic Assessment Method

Dynamic assessment has been introduced by Feuesrtein in the ‘theory, tools, techniques of learning potential assessment—the dynamic assessment on hysteresis operators’ in 1979. The root of its theory can be traced back to ‘the zone of proximal development’ by Vygotsky [149]. Over time and the accumulation of the data, people have many chronological sequence data of the plane data table series, called ‘time series data sheet.’ Comprehensive evaluation with time series data, its parameter values are dynamic, which is defined as ‘dynamic comprehensive evaluation’ problem [150].

3.3.2.7 Dimension Reduction for Time Series Data

With the proposed dynamic TOPSIS model, the three-dimensional time series data is reduced to two-dimensional data using the time–weight vector described in the following subsection. The time-weighted vector w = (w1, w2, wn) T represents the degree of emphasis on different time, according to different criteria. The ‘time–weight vector entropy’ I is given as \( I=-{\sum}_{k=1}^p{w}_k\;\ln\;{w}_k \), and the ‘time degree’ T is \( T={\sum}_{k=1}^p{w}_k\frac{p-k}{p-1} \), where p is the number of years.

The ‘time degree’ T indicates the degree to which the aggregation operator values a time interval. It can take a value between 0 and 1 to reflect the attitude of a decision maker as shown in Table 9.21. T = 0 implies that time weighted vector w becomes (0, 0, …, 1) and the element with the latest time value obtains the largest weight. T = 1 implies that time weighted vector w becomes (1, 0, …, 0) and the element with the earliest time value obtains the largest weight. T = 0.5 implies that data elements of different years have the same importance.

Table 9.21 The Mean of the time degree T

The criterion to determine the time–weight vector is that in the condition of a given ‘time degree’ T, to mine sample information as much as possible and consider different information of evaluated samples in the timing. The time weighted vector can be calculated:

$$ \left\{\begin{array}{l}\mathit{\operatorname{MAX}}\left(-{\sum}_{k=1}^p{w}_k\ln {w}_k\right)\\ {}s.t.T={\sum}_{k=1}^p{w}_k\frac{p-k}{p-1}\\ {}{\sum}_{k=1}^p{w}_k=1,{w}_k\in \left[0,1\right],k=1,2,\cdots, p\end{array}\right. $$
(9.102)

3.3.3 Dynamic Technique for Order Preference by Similarity to Ideal Solution Evaluation Method

The dynamic TOPSIS evaluation method based on a dynamic assessment model is used to assess eco-environmental quality, and the proposed method considers the time weight vector to construct three-dimensional time series data [151]. In this model, through the MCDM (TOPSIS), the two-dimensional data is reduced to one-dimensional data to dynamically assess the quality of the urban eco-environment. The steps of proposed dynamic assessment method are as follows:

  1. 1.

    Determine the evaluation index system, according to the ecological theory.

  2. 2.

    Data preprocessing and standardization.

  3. 3.

    Use multi-attribute evaluation method to determine the combination weight.

  4. 4.

    Use MCDM: TOPSIS method to assess the level of urban eco-environmental quality from 2005 to 2009.

  5. 5.

    Create a dynamic assessment model as

    $$ {\displaystyle \begin{array}{c}Z={\alpha}_1{Y}_1+{\alpha}_2{Y}_2+\cdots {\alpha}_i{Y}_i+\cdots +{\alpha}_n{Y}_n\left(i=1,2,\cdots n\right)\end{array}} $$
    (9.103)

    Where Y i is defined in Eq. (9.101) used by TOPSIS method to determine relative closeness degree of the urban eco-environmental quality each year. ai is defined in Eq. (9.102) and is the time–weight vector w i.

Calculate the utility value of urban eco-environmental quality.

3.3.3.1 Dynamic Sensitivity Analysis

There are two aspects of sensitivity analysis—one is the sensitivity analysis of attribute weight, and the other is the sensitivity analysis of attribute value. However, previous studies on sensitivity analysis are static assessment, which does not show the influence of time [152].

The Dynamic sensitivity analysis is to consider the influence of the Dynamic time weight vector for decision-maker to make the final decisions. Because of the uncertainty of the time–weight vector, the assessment results are uncertain. It is necessary and critical to do sensitivity analysis of dynamic assessment method.

Assume that the weight w k of index T k has small fluctuations w k, the changes in weight value are defined as \( {w}_k^{\ast }={w}_k+\Delta {w}_k \), whereas the other weights remain unchanged. After the normalization, we obtain

$$ {\displaystyle \begin{array}{c}{w}_k^{\prime }=\frac{w_k}{w_1+{w}_2+\cdots {w}_k+\Delta {w}_k+\cdots {w}_n}\\ {}=\frac{w_i}{\left({w}_1+{w}_2+\cdots {w}_k^{\ast }+\cdots {w}_n\right)\left(k=1,2,\cdots n\right)}\end{array}} $$
(9.104)

The stable range of the index T k is

$$ \left\{\begin{array}{l}\Delta {w}_k>-{w}_k,{y}_{ik}={y}_{tk}\\ {}-{w}_k<\Delta {w}_k<{\sum}_{j=1}^n\frac{\left({y}_{ij}-{y}_{tj}\right){w}_k}{y_{tj}-{y}_{ij}},{y}_{ik}<{y}_{tk}\\ {}\Delta {w}_k>\max \left[{\sum}_{j=1}^n\frac{\left({y}_{ij}-{y}_{tj}\right){w}_k}{y_{tj}-{y}_{ij}},-{w}_k\right],{y}_{ik}>{y}_{tk}\end{array}\right. $$
(9.105)
3.3.3.2 K-Means Clustering Algorithm

Clustering analysis divides data set into several different classes, making the data in the same class as similar as possible, but in the different class, as dissimilar as possible [10]. The higher the degree of similarities among similar objects and the more differences among the dissimilar objects, the better the cluster quality.

Cluster is ‘the process of dividing physical or abstract objects into similar object classes’ [15]. The steps of the K-means cluster algorithm are as follows:

  1. 1.

    Put n objects into k non-empty set.

  2. 2.

    Select random seed value as the current center of clusters.

  3. 3.

    Assign each object with the nearest seed value.

  4. 4.

    Repeat the second step, until there are no new assignments.

In this study, we complete the K-means clustering method by using the WEKA software [16], the specific processes are showed in Fig. 9.16.

Fig. 9.16
figure 16

K-means clustering algorithm based on WEKA flow chart

The data of empirical study is collected from the ‘China City Statistical Yearbook’ and ‘China Statistical Year-book for Regional Economy’ between 2005 and 2009 in [8].

3.4 An Empirical Study of Classification Algorithm Evaluation for Financial Risk Prediction

This subsection is to develop an approach to evaluate classification algorithms for financial risk prediction. It constructs a performance score to measure the performance of classification algorithms and introduces MCDM methods to rank the classifiers. An empirical study is designed to assess nine classification algorithms using five performance measures over seven real-life credit risk and fraud risk datasets from six countries. For each performance measure, a performance score is calculated for each selected classification algorithm. The classification algorithms are then ranked using three MCDM methods (i.e., TOPSIS, PROMETHEE, and VIKOR) based on the performance scores.

Another problem in financial risk detection is that the knowledge gap [58] between the results classification methods can provide and taking actions based on them remains large. The lack of interaction between industry practitioners and academic researchers makes it hard to discover financial risks or opportunities and hence weakens the value that classification methods may bring to financial risk detection. To deal with the knowledge gap problem, this section combines the classification results, the knowledge discovery in database (KDD) process, and the concept of chance discovery to build a knowledge-rich financial risk management process in an attempt to increase the usefulness of classification results in financial risk prediction.

3.4.1 Evaluation Approach for Classification Algorithms

This section develops a two-step process to evaluate classification algorithms for financial risk prediction. In the first step, a performance score is created for each selected classification algorithm. The second step applies three MCDM methods (i.e., TOP-SIS, PROMETHEE, and VIKOR) to rank the selected classification algorithms using the performance scores as inputs. This section describes how the performance scores are calculated and gives an overview the three MCDM methods used in the study.

3.4.1.1 Performance Score

There are a variety of measures for classification algorithms and these measures have been developed to evaluate very different things. Some studies have shown that the classification algorithm achieves the best performance according to a given measure on a dataset, may not be the best method using a different measure [106, 153]. In addition, characteristics of datasets, such as size, class distribution, or noise, can affect the performance of classifiers. Hence, evaluating the performance of classification algorithms using one or two measures on one or two datasets often proves to be inadequate.

Based on these two considerations, this study constructs a performance metric that assesses the quality of classifiers using a set of measures on a collection of financial risk datasets in an attempt to give a comprehensive evaluation of classification algorithms. The basic idea of this performance metric is similar to ranking methods, which use experimental results generated by a set of classification algorithms on a set of datasets to rank those algorithms [154]. Specifically, it resembles the significant wins (SW) ranking method by conducting pairwise comparisons of classifiers using tests of statistical significance.

3.4.1.1.1 Selection of Performance Measures

Accuracy and error rates are important measures of classification algorithms in financial risk prediction. This work utilizes overall accuracy, precision, true positive rate, true negative rate, and the area under the receiver operating characteristic curve (AUC) to build the performance score. The following paragraphs define and describe these measures.

  • Accuracy is the percentage of correctly classified instances [15]. It is one the most widely used classification performance metrics.

    $$ \mathrm{overall}\ \mathrm{accuracy}=\frac{\mathrm{TN}+\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}+\mathrm{FN}+\mathrm{TN}} $$

    where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. TP and TN are defined below. FP is the number of non-fault-prone instances that is misclassified as fault-prone class. FN is the number of fault-prone instances that is misclassified as non-fault-prone class.

  • Precision is the number of classified positive or abnormal instances that actually are positive instances.

    $$ \mathrm{precision}=\frac{\ \mathrm{TP}\ }{\mathrm{TP}+\mathrm{FP}} $$
  • TP (true positive) is the number of correctly classified positive or abnormal instances. TP rate measures how well a classifier can recognize abnormal records. It is also called sensitivity measure. In the case of financial risk detection, abnormal instances are bankrupt, fraudulent or erroneous accounts. A classifier with a higher TP rate can help financial institutions reduce their potential credit losses than a classifier with a lower TP rate.

    $$ \mathrm{true}\ \mathrm{positive}\ \mathrm{rate}/\mathrm{sensitivity}=\frac{\ \mathrm{TP}\ }{\mathrm{TP}+\mathrm{FN}} $$
  • TN (true negative) is the number of correctly classified negative or normal instances. TN rate measures how well a classifier can recognize normal records. It is also called specificity measure.

    $$ \mathrm{true}\ \mathrm{negative}\ \mathrm{rate}/\mathrm{specificity}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}} $$
  • ROC stands for receiver operating characteristic, which shows the tradeoff between TP rate and FP rate [15]. The area under the ROC (AUC) represents the accuracy of a classifier. The larger the area, the better the classifier.

3.4.1.1.2 Calculation of the Performance Score

The performance score is generated by conducting paired t tests with a significance level of 5% for each classifier. The goal of a paired statistical significance test is to evaluate whether the superior or inferior performance of one classifier over another is statistically significant. The performance score for each classifier is calculated as follows:

  • Step 1: for each dataset, compare the tenfold cross-validation results of individual performance measure for two classifiers. The null hypothesis is that the two classifiers are the same. If the paired statistical significance (0.05) test indicates that one classifier is better than the other classifier, the performance scores of the superior classifier and the inferior classifier equals to 1 and −1, respectively. If the paired statistical significance (0.05) test indicates that the null hypothesis cannot be rejected, then the performance scores for both classifiers equal to 0 in this case.

  • Step 2: repeat Step 1 for all classifier pairs for the dataset tested in Step 1. Then we get performance scores of all classifiers for the specific dataset and specific performance measure.

  • Step 3: repeat Steps 1 and 2 for other datasets included in the experiment. The sum of performance scores from all datasets is the performance score of this classifier for the current performance measure. The larger the score is, the better the classifier performs in this measure.

  • Step 4: repeat Steps 1, 2 and 3 for other four performance measures to get the performance scores of all classifiers for all performance measures.

3.4.1.2 MCDM Methods

To evaluate classification algorithms, normally more than one criterion needs to be examined, such as accuracy, AUC, and misclassification rate. Thus algorithm selection can be modeled as multiple criteria decision making (MCDM) problems [155]. This subsection uses three MCDM methods, i.e., TOPSIS, PROMETHEE, and VIKOR, and explains how they can be used to rank classification algorithms.

3.4.1.2.1 Experiment

The experiment is designed to validate the proposed two-step evaluation approach using nine classification methods over seven real-life credit risk and fraud risk datasets from six countries. The first and second parts of this section give an overview of classification algorithms and financial risk datasets used in the empirical study. The third and fourth parts describe the experimental design and the evaluation results.

3.4.2 Classification Algorithms

The classification algorithms used in the experiment include eight well-known classification techniques and ensemble method. The eight classification methods are Bayesian Network [93], Naïve Bayes [92], support vector machine (SVM) [90], linear logistic regression [156], k-nearest neighbor [94], C4.5 [87], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) rule induction [96], and radial basis function (RBF) network [89]. All algorithms were implemented using Weka 3.6, a free data mining software package [16].

Bayesian Network and Naïve Bayes both model probabilistic relationships between predictor variables and the class variable. While Naïve Bayes classifier estimates the class-conditional probability based on Bayes theorem and can only represent simple distributions, Bayesian Network is a probabilistic graphic model and can represent conditional independencies between variables. SVM classifier uses a nonlinear mapping to transform the training data into a higher dimension and search for the linear optimal separating hyperplane, which is then used to separate data from different classes [15]. Linear logistic regression models the probability of occurrence of an event as a linear function of a set of predictor variables. k-nearest neighbor classifies a given data instance based on learning by analogy, that is, assigns it to the closest training examples in the feature space. C4.5 is a decision tree algorithm that constructs decision trees in a top-down recursive divide-and-conquer manner. RIPPER is a sequential covering algorithm that extracts classification rules directly from the training data without generating a decision tree first [15]. RBF network is an artificial neural network that uses radial basis functions as activation functions.

In addition to the eight classification techniques, ensemble method was included in the experiment. An ensemble consists of a set of individually trained classifiers whose predictions are combined when classifying novel instances. There are two fundamental elements of ensembles: a set of properly trained classifiers and an aggregation mechanism that organizes these classifiers into the output ensemble. This study uses the vote algorithm in Weka to perform the ensemble method. Vote combines classifiers by averaging their probability estimates [16].

3.4.3 Financial Risk Datasets

The datasets used in this study come from six countries and represent four aspects of financial risk: credit approval (credit card application), credit behavior, bankruptcy risk, and fraud risk.

3.4.3.1 German Credit Card Application Dataset (UCI MLR)

The German credit card application dataset comes from UCI machine learning databases. It contains 1000 instances with 24 predictor variables and 1 class variable (UCI). The 24 variables describe the status of existing checking account, credit history, education level, employment status, personal status, age, and so on. The class variable indicates whether an application is accepted or declined. Seventy percent of the instances are accepted applications and 30% are declined instances.

3.4.3.2 Australian Credit Card Application Dataset [87]

This dataset was provided by a large bank and concerns consumer credit card applications. It has 690 instances with 15 predicator variables plus 1 class variable. The class variable indicates whether an application is accepted or declined. 55.5% of the instances are accepted applications and 44.5% are declined instances.

3.4.3.3 USA Credit Cardholders’ Behavior Dataset [157]

The dataset was from a major US bank and contains 6000 credit card data with 64 predictor variables plus 1 class variable. Each instance has a class label indicating its credit status: either good or bad. Eighty-four percent of the data are good accounts and 16% are bad accounts. Good indicates good status accounts and bad indicates accounts with late payments, delinquency, or bankruptcy. The predictor variables describe account balance, purchase, payment, cash advance, interest charges, date of last payment, times of cash advance, and account open date.

3.4.3.4 China Credit Cardholders’ Behavior Dataset

This dataset was collected by a commercial bank in China and contains 5456 credit card data with 13 attributes. These attributes describe credit cardholders’ daily balance, abnormal usage, limit usage rate, first time used, revoking pay, suspend pay, transactions detail, and personal information. Each record in the dataset has a class label denotes the status of a credit card account: either good or bad. There are 91.9% good accounts and 8.1% bad accounts.

3.4.3.5 Japanese Bankruptcy Dataset [158]

This set collects 37 bankrupt Japanese firms and 111 non-bankrupt Japanese firms from various sources during the post-deregulation period of 1989–1999. Final sample firms are ones traded in the First Section of Tokyo Stock Exchange, and their financial data are available from 2000 PACAP database for Japan compiled by the Pacific-Basin Capital Market (PACAP) Research Center at the University of Rhode Island. Each case has 13 predictor variables and 1 class variable (bankrupt or non-bankrupt). The predictor variables describe financial state and performance of firms.

3.4.3.6 Korean Bankruptcy Dataset [159]

This dataset collects bankrupt firms in Korea from 1997 to 2003 from public sources. It consists of 65 bankrupt and 130 non-bankrupt firms whose data are available and publicly trading firms in the Korean Stock Exchange. Each case has 13 predictor variables with one class variable (bankrupt or non-bankrupt).

3.4.3.7 Insurance Dataset [160]

The data was provided by an anonymous US corporation. Each record concerns about an insurance claim. The set has 18,875 instances with 103 variables. A binary class attribute indicates whether an instance is a normal claim or abnormal claim. There are 353 abnormal claims and 18,522 normal claims. The abnormal instances represent fraudulent or erroneous claims and were manually collected and verified.

3.4.4 Experimental Design

The calculation process of the performance score and the three MCDM methods were applied to the nine classifiers over the seven financial risk datasets. The experiment was carried out according to the following process:

  • Input: a financial risk related dataset.

  • Output: ranking of classification algorithms.

  • Step 1: understand business requirements, dataset structure and data mining task.

  • Step 2: prepare target datasets: select and transform relevant features; data cleaning; data integration. Communicate any findings during data preparation with domain experts.

  • Step 3: train and test multiple classification models in randomly sampled partitions (i.e., tenfold cross-validation) using Weka 3.6 [19].

  • Step 4: calculate the performance scores following the process discussed in section “Performance Score”.

  • Step 5: evaluate classification algorithms using TOPSIS, PROMETHEE II, and VIKOR. The performance scores for each classifier obtained from Step 4 are used as inputs to the MCDM methods. All the MCDM methods are implemented using MATLAB.

  • Step 6: generate three separate tables of the final ranking of classification algorithms provided by each MCDM method.

  • Step 7: discuss the results with domain experts. Explore potential chance(s) from data mining results. Go back to Step 1 if new business questions are raised during the process.

  • END

Measures have different importance in financial risk prediction. For example, false negative (FN) is the number of positive or abnormal instances that is misclassified as normal class. Since positive instances are bankrupt, fraudulent or erroneous accounts in financial risk detection, a classifier with a high FN rate can cause huge lost to creditors. Thus FN measure should have higher importance in financial risk prediction than other measures, such as false positive measure [161]. Another important measure in financial risk prediction is AUC because it selects optimal models independently from the class distribution and the cost associated with each class.

Weights of each performance measure used in TOPISIS, PROMETHEE, and VIKOR are defined according to these findings from previous research. In this study, FN rate is not included because it equals to one minus TP rate. The importance of FN rate in financial risk prediction is then reflected in the weight of TP rate. The weights of the five performance measures are defined as: TP rate and AUC are set to 10 and other three measures (i.e., over-all accuracy, precision, and TN rate) are set to 1. The weights are normalized and the sum of all weights equal to 1.

3.4.5 Results and Discussion

The results of test set overall accuracy, precision, AUC, TP rate, and TN rate of all classifiers on the seven datasets are reported in Table 9.22. In the dataset column of Table 9.22, Australian indicates the Australian credit card application data; USA indicates the credit cardholders’ behavior data from the United States; China refers to the credit cardholders’ behavior data collected from a Chinese bank; IN indicates the insurance data; German indicates the German credit card application data; and Japan and Korea indicate the Japanese and the Korean bankruptcy data, respectively. The nine classification methods were applied to each dataset using tenfold cross-validation. For each dataset, the best result of a specific performance measure is highlighted in boldface.

Table 9.22 Classification results

When the distribution of classes is highly skewed, as in the IN dataset (1.87% abnormal instances versus 98.13% normal cases), Naïve Bayes and Bayesian Network outperform other classifiers. Naïve Bayes has the highest TP rate (0.9065), which indicates that it captured 90.65% of the abnormal records, while Bayesian Network achieves a good TN rate (0.8291). Although SVM and RBF network got perfect overall accuracy (100%), they failed to identify any abnormal behavior (TP = 0 and FN = 1). For evenly distributed dataset, such as the Australian data, all classifiers have good over-all accuracy and AUC. For small datasets, such as the Japanese bankruptcy data, no classifier produces satisfactory results on AUC and TP rate. However, SVM and ensemble obtained good AUC and TP rate for the small size Korea bankruptcy dataset. For medium sized datasets, such as the credit cardholders’ behavior datasets, linear logistic generates the best AUC, while Naïve Bayes and SVM produce the best TP rates. There is no classification algorithm which achieves the best results across all measures for a single dataset or has the best outcomes for a single performance measure across all datasets.

Based on the classification results presented in Table 9.22, the performance scores of all classifiers are calculated following the process discussed in Sect. 9.3.4.6 and the results are summarized in Table 9.23. For each performance measure, the best result generated by a classification algorithm is highlighted in boldface and italic. Since the performance scores are generated by conducting paired t tests with a significance level of 5% for all classifier pairs across all datasets, a classification algorithm with the highest performance score indicates that it performs significantly better than other classifiers for that specific performance measure over the seven datasets. Similar to the classification results reported in Table 9.22, no classifier has the highest performance scores for all five measures and classifiers with the best scores on some measures may perform poorly on other measures. For example, SVM achieves the best performance scores on overall accuracy and TN rate, but its scores on precision and AUC are quite low. Therefore the MCDM methods are introduced to provide a final ranking of classification algorithms.

Table 9.23 Performance scores of classifiers

The ranking of classifiers generated by TOPSIS, PROMETHEE II, and VIKOR is summarized in Tables 9.23, 9.24, 9.25, and 9.26, respectively. The results of TOPSIS and PROMETHEE are straightforward: the higher the ranking, the better the classifier. Linear logistic, Bayesian Network, and ensemble methods are the top-three ranked classifiers using the TOPSIS approach. The same set of classifiers is ranked as the top-three classifiers by the PROMETHEE II, however, the order of Bayesian Network and ensemble is reversed.

Table 9.24 Results of the TOPSIS approach
Table 9.25 Results of the PROMETHEE II approach
Table 9.26 Results of the VIKOR approach

Since VIKOR provides compromised solutions, the ranking of classifiers needs to be determined by the Step 5 of the VIKOR algorithm.

The classifier with the first position in the ranking list by Q cannot be proposed as the compromise solution because the condition (a) Q(a ) − Q(a ) ≥ 1(J − 1) is not satisfied. Therefore, alternatives a , a , and a ″′ are proposed as compromise solutions, since a is the maximum number of alternative determined by the relation Q(a M) − Q(a ) < 1(J − 1). That is, the rankings of linear logistic, Bayesian Network, and ensemble methods are in closeness according to VIKOR.

The results of Tables 9.23, 9.24, 9.25, and 9.26 indicate that TOPSIS, PROMETHEE II, and VIKOR provide similar top-ranked classification algorithms for financial risk prediction.

3.4.6 Knowledge-Rich Financial Risk Management Process

Even though classification has become a crucial tool in financial risk prediction, most studies focus on developing algorithms or improving existing algorithms that can identify suspicious patterns and have not paid enough attention to the involvement of end users and the actionability of the classification results [83]. This is mainly due to two reasons: (1) the difficulty in accessing real-life financial risk data and (2) limited access to domain experts and background information. The lack of interaction between industry practitioners and academic researchers makes it hard to discover financial risks or opportunities and hence weaken the value that classification methods may bring to financial risk detection.

In an attempt to improve the usefulness of classification results and increase the probability of identifying unusual chances in financial risk analysis, this section proposes a knowledge-rich financial risk management process (Fig. 9.17). Chance discovery (CD) is defined as “the awareness of a chance and the explanation of its significance” [162]. Ohsawa and Fukuda [162] suggested three keys to chance discovery: communicating the significance of an event; enhancing user’s awareness of an event’s utility using mental imagery; and revealing the causalities of rare events using data mining methods. Figure 9.17 combines the knowledge discovery in database (KDD) process model [113], the chance discovery process [162], and the CRISP-DM process model [163]. It emphasizes three keys to chance discovery and knowledge-rich data mining: users, communication and data mining techniques. Users refer to domain experts and decision makers. Domain experts are knowledgeable of the field information, data collection procedures and meaning of variables. With the assistance of data miners, domain experts can gain insights of financial risk data from different aspects and potentially observe new chances. To turn the identified knowledge into financial or strategic advantages, decision makers, who understand the operational and strategic goals of a company, are required to provide feedbacks on the importance of the potential new chances and determine what actions should be taken. Moving back and forth between steps is always required. The cyclical nature is illustrated by the outer circle of the chance discovery process in Fig. 9.17.

Fig. 9.17
figure 17

Knowledge-rich financial risk management process

This study chose the insurance data as an example to examine the proposed process. The business objective(s) of this project was to develop classification model(s) to assist human inspection of suspicious claims. After the business objective has been deter-mined, the dataset was preprocessed for classification task. During the preparation stage, two issues were brought up by the data miners: first, there are several attributes with missing values for all the instances in the dataset; second, the definitions of four attributes are conflicting. From the data miner’s point of view, an attribute with completely missing values is useless in data mining tasks and should be simply removed. But from the domain expert’s perspective, this is an unusual situation and represents a potential chance for operational improvement. Any attribute stored in the database was designed to capture relevant information and an attribute with complete missing value may indicate errors in the data collecting process. After careful examination, domain experts found out the reasons for missing values and took corrective actions.

Then nine classifiers were applied to the insurance data using tenfold cross-validation. A classifier with low false negative (FN) rate can minimize insurance fraud risk because FN rate denotes the percentage of high-risk claims that were misclassified as normal claims. For this dataset, Naïve Bayes has the lowest FN rate (1 − 0.9065 = 0.00935). Because it achieves the lowest FN rate and provides classification results that can be easily understood and used by domain experts, Naïve Bayes was chosen as the decision classifier. This model can be used to predict high-risk claim; narrow down the size suspicious records; and accelerate the claim-handling process. The classification results obtained from data mining step can further be analyzed to provide additional insights about the data. For instance, if some general features of high- or low-risk claims can be identified from the classification results, it may help the insurance company to establish profiles for each type of claims, which potentially may bring profits to the company.

To summarize, the empirical study demonstrates that introducing the concept of chance discovery into the KDD process can help users choose the most appropriate classifier, promote the awareness of previously unnoticed chances, and increase the usefulness of data mining results.