1 Introduction

Feature selection (also known as variable selection or attribute selection) plays an important role in machine learning and pattern recognition (Hu et al. 2010; Guyon and Elisseeff 2003). It is to select some most effective features from the original feature set to reduce the dimension of the feature space according to certain criteria (Sheng 2000). By feature selection, some irrelevant or redundant features are removed, thereby reducing the computational complexity, improving the estimation accuracy of the learning model and facilitating the intelligibility of the model (Amiri et al. 2011; Cakır et al. 2011).

A great number of feature selection approaches have been developed in recent years. Two key issues in constructing a feature selection approach are the search strategy and the evaluating criteria (Yao et al. 2012; Mao et al. 2007). According to the search strategy, global (Somol et al. 2004), heuristic (Dash and Liu 2003) and random (Oh et al. 2004) strategies were introduced in the literatures. An overall review on this issue is presented in Monirul Kabir et al. (2011). With respect to the evaluation criteria, feature selection approaches can be classified into three categories (Monirul Kabir et al. 2011): the filter, the wrapper and the hybrid approach. The wrapper approach (Hsu et al. 2002; Verikas and Bacauskiene 2002; Wang et al. 2008; Zhu et al. 2007) assesses feature subset with the training accuracy of the learning model. The filter approach (Ke et al. 2008; Sun 2007;Fleuret 2004) assesses features with statistical properties of the training data, and is independent from the learning model. In the hybrid approach (Hu et al. 2006; Hsu et al. 2011; Yang et al. 2011), features are first filtered, and then determined by the wrapper model. It is often found that, the hybrid approach is capable of locating a good solution, while a single technique often traps into an immature solution.

In another view of point, feature selection can also be classified into discrete and continuous approaches. Discrete approaches consider that all features take values in a finite set, while continous approaches assume that samples are described with a set of numerical variables. In a whole, existing feature selection approaches are mainly designed for classification problems with discrete or continuous attributes (Liu and Yu 2005; Dash and Liu 1997). However, actual data usually tend to have mixed attributes. For example in Software cost estimation, the data collected include both discrete and continuous attributes.

For mixed attributes data, existing approaches consist of two categories. One approach is to perform a discretization for continuous attribute (Ferreira and Figueiredo 2011), but the discretization brings an inevitable loss of information. Another approach is to do the granulation of mixed attributes (Hu et al. 2008a). In Hu et al. (2008b), the granulation approach is summarized, and neighborhood rough set is used to handle mixed attributes. Farther, the neighborhood mutual information is defined in (Hua et al. 2011) to do feature selection for high-dimensional mixed attributes data. However, the shortcoming of the granulation approach lies in that its scale parameter is not easy to determine, which leads to instability.

To deal with mixed attributes data, a hybrid feature selection scheme is constructed in this work. The rest of this paper is organized as follows. In Sect. 2, related works on feature selection for mixed attributes data are studied. In Sect. 3, a new correlation measure which is to be used in the filter model is defined based on mutual information, by solving the calculation of mutual information between mixed attributes. Section 4 gives a hybrid feature selection scheme: the features are first filtered with a filter model, and then the final feature subset is determined by a wrapper model. Section 5 describes the evaluation metrics and datasets used in our study. Experiments and results are presented in Sect. 6. Finally, conclusions and future works are given in Sect. 7.

2 Related works

In this section, related works on the search strategy and evaluating criteria in feature selection are discussed with respect to the processing of mixed attributes.

According to the search strategy, feature selection can be categorized into three categories: the global search, the random search and the heuristic search (Sun et al. 2004; Muni et al. 2006). Global search strategy can find the best feature subset corresponding to its evaluating criteria, with the number of features to be selected known in advance. However, the difficulty is that the number is hard to be determined in advance, and the computing complexity is too high with a searching space \(O(2^N)\)(where \(N\) is the number of original features) (Somol et al. 2004; Liu and Sun 2007). Random search strategies, such as genetic algorithm (Ooi and Tan 2003) and ant colony optimization (Ke et al. 2008), find an approximation of the optimum solution in the whole search space. But they bear high uncertainty, and their parameters take a great influence on the results. Heuristic search strategies include sequential forward selection (SFS) (Guan et al. 2004), sequential backward selection (SBS) (Abe 2005) and so on. They obtain high computing efficiency at the cost of global optimal solution. In Peng et al. (2005), a fast feature selection scheme is designed, where the minimal redundancy and maximal relevance (mRMR) criteria are used to improve the feature selection performance. In our work, the idea of mRMR criteria is adopted and improved to match with the characteristic of mixed attributes data.

According to the evaluating criteria, feature selection can be categorized into three categories: the filter, the wrapper and the hybrid approach. The filter model gets higher calculating speed by assessing features with statistical properties of the training data, without any learning model assumed between outputs and inputs of the data. In the framework of feature selection given by Yu and Liu (2004), the target is to maximize the correlation between selected features and the decision variable, and minimize the correlation between selected features. Therefore, how to measure the correlation between features is a crucial point in a filter model. It relies on various measures of the general characteristics of the training data, such as distance, information, dependency, and consistency (Liu and Motoda 1998). Among these measures, mutual information (Kwak and Choi 2002) is mostly used because it does not require to assume knowing the sample distribution, does not need to transform the data, and can measure the degree of uncertainty between features in a quantified form. However, the mutual information-based correlation measure can only be defined between continuous variables or discrete variables. Kwak and Choi (2002) proposed a method for calculating mutual information between mixed attributes, with an assumption that all samples having the same probability of occurrence. In our work, a new method for calculating mutual information between mixed attributes is proposed with this assumption removed. Then, a new correlation measure is defined based on mutual information.

The wrapper approach assesses feature subset with the training accuracy of the learning model, and usually yields high fitting accuracy at the cost of high computational complexity. In Hsu (2004), the genetic algorithm is used to find a feature subset with the smallest classifying error rate of decision tree. In Chiang and Pell (2004), the Fisher discriminant analysis is combined with the genetic algorithm to identify the pivotal variables in the failure process of chemical engineering. In Guyon et al. (2002), the importance of features is measured by the classification performance of support vector machine, based on which a classifier is constituted. In Michalak and Kwasnicka (2006), a wrapper model is constituted based on a two-pronged correlation strategy. In Monirul Kabir et al. (2010), the neural networks is used to present a wrapper feature selection algorithm. All the above models are focused on the classification problem, in which the target is to improve the classification accuracy. However, when the decision variable is continuous, the wrapper model is to be re-designed. To deal with mixed attributes, the case-based reasoning (CBR) approach is adopted in this paper. CBR is a well-established methodology with broad applications and is good at dealing with mixed attributes data. The fundamental principle of CBR is when provided a new project, the most similar historical projects are selected to estimate the new project using similarity measure.

The hybrid approach attempts to take advantage of the filter and wrapper approaches, while a single technique often traps into an immature solution. Therefore, a hybrid feature selection scheme is proposed in this paper.

3 The correlation measure based on mutual information

3.1 The calculation of mutual information between mixed attributes

3.1.1 Entropy and mutual information

In information theory, entropy is a measure to describe the uncertainty of a random variable. Let \(X\) be a discrete random variable with a range of \(\Phi \), and the probability distribution function is \(p(x)=P\{X=x\}\), then the entropy of \(X\) is defined as:

$$\begin{aligned} H(X)=-\sum \limits _{X\in \Phi } {p(x)\log p(x)} \end{aligned}$$
(1)

For two discrete random variables \(X\) and \(Y\) (the range of \(Y\) being \(\Omega )\), let their joint probability density function be \(p(x,y)\), then the joint entropy of \(X\) and \(Y\) is defined as:

$$\begin{aligned} H(X,Y)=-\sum \limits _{x\in \Phi } {\sum \limits _{y\in \Omega } {p(x,y)\log p(x,y)} } \end{aligned}$$
(2)

When \(X\) is known, the conditional entropy of \(Y\) is defined as:

$$\begin{aligned} H(Y\left| X \right. )=\sum \limits _{x\in \Phi } {p(x)H(Y\left| x \right. )=-\sum \limits _{x\in \Phi } {\sum \limits _{y\in \Omega } {p(x,y)} } \log p(y\left| x \right. )} \end{aligned}$$
(3)

Therefore, the relationship between joint entropy and conditional entropy is:

$$\begin{aligned} H(X,Y)=H(X)+H(Y\left| X \right. )=H(Y)+H(X\left| Y \right. ) \end{aligned}$$
(4)

Mutual information defines the shared information between two random variables:

$$\begin{aligned} I(X,Y)=\sum \limits _{x\in \Phi } {\sum \limits _{y\in \Omega } {p(x,y)} } \log \frac{p(x,y)}{p(x)p(y)} \end{aligned}$$
(5)

in which \(p(x)=\sum \nolimits _{y\in \Omega } {p(x,y)} \), \(p(y)=\sum \nolimits _{x\in \Phi } {p(x,y)} \). The more information shared between \(X\) and \(Y\), the larger is \(I(X,Y)\). When \(I(X,Y)=0\), \(X\) and \(Y\)are independent.

By the above definition, the relationship between entropy, conditional entropy, joint entropy and mutual information is:

$$\begin{aligned} I(X,Y)=H(X)-H(X\left| Y \right. )=H(Y)-H(Y\left| X \right. )=H(X)+H(Y)-H(X,Y) \end{aligned}$$
(6)

The relationship between the above four is illustrated in Fig. 1.

Fig. 1
figure 1

Relationships between entropy, mutual information, joint entropy, and conditional entropy

When \(X\) and \(Y\) are continuous, the entropy, conditional entropy, joint entropy and mutual information are, respectively, defined as:

$$\begin{aligned} H(X)=-\int {p(x)\log p(x)\mathrm{d}x} \end{aligned}$$
(7)
$$\begin{aligned} H(Y\left| X \right. )=-\int {\int {p(x,y)\log p(y\left| x \right. )\mathrm{d}x} } \mathrm{d}y \end{aligned}$$
(8)
$$\begin{aligned} H(X,Y)=-\int {\int {p(x,y)\log p(x,y)\mathrm{d}x} } \mathrm{d}y \end{aligned}$$
(9)
$$\begin{aligned} I(X,Y)=\int {\int {p(x,y)\log \frac{p(x,y)}{p(x)p(y)}\mathrm{d}x} } \mathrm{d}y \end{aligned}$$
(10)

in which \(p(x)=\int _y {p(x,y)\mathrm{d}y} \), \(p(y)=\int _x {p(x,y)\mathrm{d}x} \)

3.1.2 The calculation of mutual information between continuous attributes

For two discrete variables, the mutual information can be calculated with Eq. (5), after their joint distribution and marginal distribution are estimated. However, when \(X\) and \(Y\) are continuous, it is practically impossible to find exact integration in Eq. (10). Therefore, the approximation estimators are proposed. Existing methods include the histogram method, the kernel density method and neighbor method (Schaffernicht et al. 2010). Paper (Schaffernicht et al. 2010) came up with the Gaussian kernel density method by comparing the above methods on different datasets. Therefore, the mutual information is estimated with the kernel density method in this paper for continuous variables.

Let \(\mathrm{\mathbf{X}}=\left\{ {\mathrm{\mathbf{x}}_1 ,\mathrm{\mathbf{x}}_2 ,\ldots ,\mathrm{\mathbf{x}}_n } \right\} \) be a dataset with \(n \quad d\)-dimensional samples. The approximate of the density function has the following form:

$$\begin{aligned} \hat{p}(\mathrm{\mathbf{x}})=\frac{1}{n}\sum \limits _{i=1}^n {\delta (\mathrm{\mathbf{x}}-\mathrm{\mathbf{x}}_i ,h)}, \end{aligned}$$
(11)

where \(\delta (\cdot )\) is the Parzen window function, \(h\) is the window width. Parzen has proven that with proper chosen \(\delta (\cdot )\) and \(h\), the estimation \(\hat{p}(\mathrm{\mathbf{x}})\) can converge to the true density \(p(\mathrm{\mathbf{x}})\) when \(n\) tends to infinity. Usually, \(\delta (\cdot )\)is chosen as the Gaussian window:

$$\begin{aligned} \delta (z)=\frac{1}{(2\pi )^{d/2}h^d| \sum |^{1/2}}\exp \left( {-\frac{z\sum ^{-1}{z}}{2h^2}}\right) , \end{aligned}$$
(12)

where \(\mathbf{z }=\mathbf{x }-\mathbf{x }_i \), \(\sum \) is the covariance of \(\mathbf{z }\). The window width is practically set as \(h=\left( {\frac{4}{d+2}}\right) ^{1/{(d+4)}}n^{-\frac{1}{d+4}}\).

The mutual information between \(X\) and \(Y\) can be estimated with Eqs. (1012) and \(d=2\).

3.1.3 The calculation of mutual information between mixed attributes

Let \(X\) be a continuous variable with a range of \(\Phi \), and \(Y\) be a discrete variable with a range of \(\left\{ {y_1 ,y_2 ,\ldots ,y_m } \right\} \), By Eq. (6), the mutual information is represented as follows:

$$\begin{aligned} I(X,Y)=H(X)-H(X\left| Y \right. )=H(X)-\sum \limits _{i=1}^m {p(y_i )H(X\left| {y_i } \right. )} \end{aligned}$$
(13)

In Eq. (13), we need to calculate \(H(X)\) and \(H(X\left| {y_i } \right. )\). To calculate \(H(X)\), the density function \(p(x)\) of \(X\) is estimated using Eq. (11):

$$\begin{aligned} \hat{p}(x)=\frac{1}{n}\sum \limits _{i=1}^n {\delta (x-x_i ,h_Y )}, \end{aligned}$$
(14)

where \(h_Y =\left( {\frac{4}{3}}\right) ^{1/(4)} n^{-\frac{1}{4}}\).

Replace the integration with a summation of the sample points, and the estimation of \(H(X)\) is received as:

$$\begin{aligned} \hat{H}(X)=-\sum \limits _{i=1}^n {\hat{p}(x_i )\log \hat{p}(x_i )} \end{aligned}$$
(15)

To calculate \(H(X\left| {y_i } \right. )\), we have

$$\begin{aligned} H(X\left| {y_i } \right. )=\int \limits _x {p(x\left| {y_i } \right. )\log p(x\left| {y_i } \right. )\mathrm{d}x} \end{aligned}$$
(16)

Let \(n_k \) be the number of examples with \(Y=y_k \)and \(I_k \) be the set of indices of the samples with \(Y=y_k \), then the estimation of \(p(x\left| {y_i } \right. )\) is

$$\begin{aligned} \hat{p}(x\left| {y_i } \right. )=\frac{1}{n_k }\sum \limits _{i\in I_k } {\delta (x-x_i ,h_k )}, \end{aligned}$$
(17)

where \(h_k =( {\frac{4}{3}})^{1 /(4)}n_k ^{-\frac{1}{4}}\). Replace the integration with a summation of the sample points in Eq. (16), and the estimation of \(H(X)\) is received as

$$\begin{aligned} \hat{H}(X\left| {y_i } \right. )=-\sum \limits _{i\in I_k } {\hat{p}(x\left| {y_i } \right. )\log \hat{p}(x\left| {y_i } \right. )} \end{aligned}$$
(18)

Let the estimation of \(p(y_i )\) be \(\hat{p}(y_i )={n_i }/n\), then the mutual information is:

$$\begin{aligned} \hat{I}(X,Y)=-\sum \limits _{i=1}^n {\hat{p}(x_i )\log \hat{p}(x_i )} + \frac{1}{n}\sum \limits _{i=1}^m {\left[ {n_i \sum \limits _{i\in I_k } {\hat{p}(x\left| {y_i } \right. )\log \hat{p}(x\left| {y_i } \right. )} } \right] } \end{aligned}$$
(19)

3.2 The correlation measure

For two random variables \(X\) and \(Y\), the correlation measure is defined as the mutual information in paper (Ooi and Tan 2003), based on which a maximum correlation minimum redundancy algorithm is given. However, experiments show that this correlation measure tends to choose features with more values. Therefore, by normalizing the correlation measure into \([0,1]\), a new correlation measure \(C(X,Y)\) is defined as:

$$\begin{aligned} C(X,Y)=\frac{1}{2}\left[ {\frac{I(X,Y)}{H(X)}+\frac{I(X,Y)}{H(Y)}} \right] \end{aligned}$$
(20)

Obviously, the above definition meets the symmetry, with the range of \(C(X,Y)\) being \([0,1]\) where \(C(X,Y)=1\) means knowing any of \(X\) or \(Y\), the other is determined, and \(C(X,Y)=0\) means \(X\) and \(Y\) are independent from each other.

4 A hybrid feature selection scheme for mixed attributes data

In this section, a hybrid feature selection scheme is proposed taking advantages of both the filter and the wrapper model. In this scheme, \(N\) features are firstly filtered, and then the number \(N\) is optimized by minimizing the estimation accuracy of the CBR.

4.1 Filter feature selection for mixed attributes data

In filter feature selection based on information criteria, the primary issue is to find a feature subset as more correlative as possible with the decision variable, and meanwhile the correlation between features in the subset is as small as possible. However, in high dimensional space, to estimate the probability density is difficult and slow. Therefore, the algorithm of mRMR algorithm gave an evaluation criteria based on mutual information (Liu and Sun 2007) to select \(N\) features from the original feature set. The algorithm is as follows:

  1. (1)

    (Initialization) Set \(F\leftarrow ^{\prime }\text{ whole } \text{ feature } \text{ set }^{\prime }\), \(S\leftarrow ^{\prime }\text{ empty } \text{ set }^{\prime }\), \(y\leftarrow ^{\prime }\text{ decision }{\begin{array}{*{20}c} \\ \end{array} }\text{ variable' }\).

  2. (2)

    \(\forall f_i \in F\), compute \(I(f_i ,y)\).

  3. (3)

    Find the feature \(f_i \) that maximizes \(I(f_i ,y)\), set \(F\leftarrow F\backslash \{f_i \}\), \(S\leftarrow \{f_i \}\).

  4. (4)

    Repeat until desired number \(N\) of features is selected.

    1. (a)

      \(\forall f_i \in F\), \(f_s \in S\), compute \(I(f_i ,f_s )\), if it is not yet available.

    2. (b)

      Choose the feature \(f_i \in F\) that maximizes\(J(f_i )=I(f_i ,y)-\frac{1}{\left| S \right| }\sum \nolimits _{s\in S} {I(f_i ,f_s )} \); set \(F\leftarrow F\backslash \{f_i \}\), \(S\leftarrow S\cup \{f_i \}\).

  5. (5)

    Output the subset \(S\) containing \(N\) selected features.

However, mRMR based on \(J(f_i )\) tends to select features with more values. Therefore, we replace \(J(f_i )=I(f_i ,y)-\frac{1}{\left| S \right| }\sum \nolimits _{s\in S} {I(f_i ,f_s )}\) in (b) as \(J(f_i )=I(f_i ,y)-\frac{1}{\left| S \right| }\sum \nolimits _{s\in S} {C(f_i ,f_s )} \), and denote the new algorithm as C-mRMR.

4.2 Determination of the filter’s parameter based on CBR

In the C-mRMR algorithm, the parameter \(N\) is to be determined. In this study, it is determined by optimizing the estimation accuracy of the CBR. The CBR estimates the target case by similar historical cases, and usually consists of three sub-problems (Li et al. 2009): similarity measure, number of analogies and analogy adaptation.

4.2.1 Similarity measure

Similarity measure describes the level of similarity between different samples. Several similarity functions have been proposed, however, the measures used in this study are the Euclidean distance, and the Manhattan distance, since they have been reported with good results in software cost estimation studies (Chiu and Huang 2007).

The Euclidean distance measures the Euclidean distance \(d(p,{p}^{\prime })\) between two samples after the continuous features have been normalized:

$$\begin{aligned} d(p,{p}^{\prime })=\sqrt{\sum \limits _{i=1}^d {w_i \mathrm{Dis}( {f_i ,{f}^{\prime }_i })} } \end{aligned}$$
(21)
$$\begin{aligned} \mathrm{Dis}\left( {f_{i} ,f^{\prime }_{i} } \right) = \left\{ {\begin{array}{*{20}c} {\left( {f_{i} - f^{\prime }} \right) ^{2} ,} &{} {f_{i} \;\text{ and }\;f^{\prime }\;\text{ are } \text{ numeric } \text{ or } \text{ ordinal }} \\ {1,} &{} {f_{i} \;\text{ and }\;f^{\prime }\;\text{ are } \text{ nominal }\;\text{ and }\;f_{i} = f^{\prime }} \\ {0,} &{} {f_{i} \;\text{ and }\;f^{\prime }\;\text{ are } \text{ nominal }\;\text{ and }\;f_{i} \ne f^{\prime }} \\ \end{array} } \right. \end{aligned}$$
(22)

The Manhattan distance is the sum of the absolute distances for each pair of features:

$$\begin{aligned} d(p,{p}^{\prime })=\sum \limits _{i=1}^d {w_i \mathrm{Dis}( {f_i ,{f}^{\prime }_i })} \end{aligned}$$
(23)
$$\begin{aligned} \mathrm{Dis}\left( {f_{i} ,f^{\prime }_{i} } \right) = \left\{ {\begin{array}{*{20}c} {\left| {f_{i} - f^{\prime }} \right| ,} &{} {f_{i} \text{ and }\;f^{\prime }\;\text{ are } \text{ numeric } \text{ or } \text{ ordinal }} \\ {1,} &{} {f_{i} \;\text{ and } f^{\prime }\;\text{ are } \text{ nominal }\;\text{ and }\;f_{i} = f^{\prime }} \\ {0,} &{} {f_{i} \;\text{ and }f^{\prime }\;\text{ are } \text{ nominal }\;\text{ and }\;f_{i} \ne f^{\prime }} \\ \end{array} } \right. , \end{aligned}$$
(24)

where \(p_1 \) and \({p}^{\prime }\) denote the samples, \(f_i \) and \({f}^{\prime }_i \) denote the \(i\text{ th }\)feature value of \(p_1 \) and \({p}^{\prime }\), \(w_i =\{0,1\}\) is the weight of the \(i\text{ th }\)feature, where \(w_i =1\) means the \(i\text{ th }\)feature is selected and \(w_i =0\) means the \(i\text{ th }\)feature is not selected, \(d\) is the total number of features.

4.2.2 Number of analogies

The number of analogies refers to the number of most similar samples that will be used to generate the estimation. \(K=1\) means the closest analogy. However, in this study \(K=\{1,2,3,4,5\}\) are considered since it could cover most of the suggested numbers (Jørgensen et al. 2003).

4.2.3 Analogy adaptation

After the analogies are selected, the final estimation for the new sample is determined by computing certain statistic based on the selected samples. The adaptation techniques used in this study are the closet analogy, the mean of closet analogies and the inverse distance weighted mean.

The mean is the average of the costs of \(K(K>1)\) analogies. It is a classical measure of central tendency and treats all analogies as being equally influential on the cost estimates. The median is the median of the costs of \(K(K>1)\) analogies. It is another measure of central tendency and a more robust statistic when the number of analogies increases.

The inverse distance weighted mean allows more similar analogies to have more influence than less similar ones. The formula for weighed mean is shown in (25):

$$\begin{aligned} \omega _k =\frac{1/( {\delta +d(p,p_k )})}{\sum \nolimits _{i=1}^K {1 /( {\delta +d(p,p_i )})}}, \end{aligned}$$
(25)

where \(K\) is the number of analogies, \(p_k \) represents the \(k\)th closet analogy with the new sample \(p\), \(d(p,p_k )\) is the distance measure between \(p_k\) and \(p, \delta \) is a small constant and in our study \(\delta \) is set to 0.001.

5 Evaluation criteria and data sets

5.1 Evaluation criteria

To evaluate the performance of the method in this study, it is compared with existing methods on the feature selection results and the estimation accuracies. On the estimation accuracy, three evaluation criteria are used out of the majorities of existing studies, which are the mean magnitude of relative error (MMRE), the median magnitude of relative error (MdMRE) and the PRED(0.25). The MMRE is defined as below:

$$\begin{aligned} \mathrm{MMRE}=\frac{1}{n}\times \sum \limits _{i=1}^n {\mathrm{MRE}_i } \end{aligned}$$
(26)
$$\begin{aligned} MRE_i =\frac{\left| {E_i -\hat{E}_i } \right| }{E_i }, \end{aligned}$$
(27)

where \(n\) denotes the number of samples, \(E_i \) denotes the actual effort of the \(i\text{ th }\) sample, \(\hat{E}_i \) denotes the estimated effort of the \(i\text{ th }\) sample. Small MMRE value indicates the low level of estimation error. However, this metric is unbalanced and penalizes overestimation more than underestimation.

The MdMRE is the median of all MREs:

$$\begin{aligned} \mathrm{MdMRE}= \mathrm{median\,(MRE)} \end{aligned}$$
(28)

MdMRE is an aggregate measure which is less sensitive extreme values. It exhibits a similar pattern to MMRE, but is more likely to select the true model especially in the underestimation cases.

The PRED(0.25) is the percentage of estimations that fall within 25% of actual value:

$$\begin{aligned} \mathrm{PRED(0.25)}=\frac{1}{n}\times \sum \limits _{i=1}^n {I\left\{ {MRE_i \le 0.25} \right\} } \end{aligned}$$
(29)

5.2 Data sets description

To conveniently compare with other methods, two representative datasets (the Desharnais dataset and the Maxwell dataset) are used for experiments, which have been used by many recent research works, such as Li et al. (2009), Mair et al. (2000), Maxwell (2002), Sentas et al. (2005).

The Desharnais dataset contains two discrete variables (‘YearEnd’ and ‘Language’) and nine continuous variables (the rest variables), while the discrete variables can be further classified into one nominal variable and one ordinal variable. The decision variable ‘Effort’ is continuous. This dataset contains a total of 81 samples, and 4 out of 81 samples are excluded due to the missing of feature values. A more detailed description of all features is shown in Table 1.

Table 1 Feature definition in Desharnais dataset

The Maxwell dataset with 62 samples from one of the biggest commercial banks in Finland is a relative new software projects dataset. The features are described in Table 2. There are 3 continuous variables (‘Duration’, ‘Size’ and ‘Effort’) and 24 discrete variables (the rest variables), while the discrete variables can be further classified into 5 nominal variables and 19 ordinal variables. The decision variable ‘Effort’ is continuous. The variable ‘Time’ is eliminated due to its same meaning as the variable ‘Syear’. As the dataset only contains one sample with the variables ‘subapp’ and ‘subhar’ at the value ‘4’, respectively, by following (Maxwell 2002), two new variables ‘subapp’ and ‘subhar’ are used instead of the variables ‘app’ and ‘har’, while new variables are subsets of the original ones. That is, subapp = {1,2,3,5}, subhar = {1,2,3,5}.

Table 2 Feature definition in Maxwell dataset

6 Experiments

To validate the proposed HFS, the feature selection results and the estimation accuracy of HFS are compared with published works based on the above two datasets.

6.1 The results on Desharnais dataset

6.1.1 Analysis of feature selection

To conduct the feature selection, the parameters (similarity measure, number of analogies and analogy adaptation) of HFS are firstly to be determined. Consulting Mair et al. (2000), 87% samples are selected as the training set, and the rest 13% as the testing set. Table 3 summarizes the results with considerations of different parameter configurations: two distance measures (Euclidean distance and Manhattan distance), five \(K\) values (1, 2, 3, 4 and 5), and four adaptation techniques [closest analogy (CA), mean, inverse distance weighted mean (IWM), and median].

Table 3 Results of different parameters on Desharnais dataset

The results show that, in general, the choice of different distance measures has an insignificant influence on the estimation accuracy. As to the adaptation, the ‘Median’ is steadier than the others, and gets slightly better results than ‘Mean’ and ‘IWM’ when \(K=4\) and \(K=5\). The choice of \(K\) values has some influence on the accuracies. The best configuration on the training set is ‘the Manhattan distance’, ‘\(K=4\)’ and ‘Median’.

Figure 1 shows a histogram for the mutual information values of each feature \(f\) and the decision variable \(y\). When using ‘the Manhattan distance’, ‘\(K=4\)’ and ‘Median’, the selected features in turn are ‘PointsAjust’, ‘Entities’, ‘Transactions’, ‘PointsNonAdjust’ and ‘Language’, which are the 10th, 7th, 6th, 8th and 5th variable in Fig. 2 (Table 1). From the meaning of variables, knowing any two of ‘PointsAjust’, ‘PointsNonAdjust’ and ‘Adjustment’, the third can be determined, therefore C-mRMR algorithm choose ‘PointsAjust’ and ‘PointsNonAdjust’ for their greater mutual information. The duration of the project is strongly correlative with ‘PointsAjust’ and ‘PointsNonAdjust’, so C-mRMR algorithm did not choose ‘Length’. ‘Transactions’ and ‘Entities’ are important informations on the software size, so it is retained by C-mRMR algorithm. In terms of the same development unit, team and manager experience are relatively fixed, so C-mRMR algorithm did not select ‘TeamExp’ and ‘ManagerExp’. Though the mutual information of ‘language’ with the decision variable is relatively small, it is relatively independent of other variables, therefore, it is retained.

Fig. 2
figure 2

MI between features and decision variable

Next, HFS (using ‘the Manhattan distance’, ‘\(K=4\)’ and ‘Median’) is compared with NMI in Hua et al. (2011), mRMR in Peng et al. (2005) and MICBR in Li et al. (2009). In Li et al. (2009), the three-folder cross-validation is used to test the performance of candidate methods, in which the Desharnais dataset is randomly divided into three different training splits and three testing splits. By following it, the 87% split (87% in the training set and 13% in the validating set) is used in this paper.

Results are shown in Table 4, where the scale parameter of NMI is taken as 0.15 as suggested in Hua et al. (2011), the number of selected features of mRMR in Peng et al. (2005) is determined by optimizing the estimation accuracy of CBR, and the result of MICBR is extracted from Li et al. (2009). The symbol ‘1’ denotes that the feature in its corresponding row is selected by the feature selection method in its corresponding column.

Table 4 Selected features in three data subsets

It can be seen from Table 4 that, though the neighborhood mutual information is able to handle mixed attributes, its scale parameter is not easy to be determined, which leads to unstable results. The mRMR algorithm is more stable, but its direct use of mutual information values to measure the correlation leads to the selected features being all continuous, and all three variables ‘PointsAjust’, ‘PointsNonAdjust’ and ‘Adjustment’ being selected indicate that it is not good at eliminating redundancy for mixed attributes data. The MICBR in Li et al. (2009) excluded the variable ‘Language’, so its result is less interpretable. In comparison, the proposed HFS in this paper is more stable, better able to remove redundancy and stronger interpretability.

6.1.2 Analysis of estimation accuracy

In Table 5, the estimation accuracy of HFS is compared with the above three methods with 87% samples being selected as training set and the rest 13% as testing set. The result of HFS is extracted from Table 3, and the result of MICBR is extracted from Table 6 in Li et al. (2009) with ‘the Manhattan distance’, ‘\(K=4\)’ and ‘Median’.

Table 5 Comparison of estimation accuracy on the Desharnais dataset

It can be seen from Table 5 that the HFS obtains the best results on MMRE, PRED(0.25) and MdMMRE than the other three methods.

6.2 The results on Maxwell dataset

6.2.1 Analysis of feature selection

Consulting Maxwell (2002) and Sentas et al. (2005), the 50 projects finished before 1992 are used as training set, and the 12 projects finished from 1992 to 1993 are used as testing set. Feature selection is conducted on the training set with the same configuration used in Desharnais dataset: ‘the Manhattan distance’, ‘\(K=4\)’ and ‘Median’.

Figure 3 shows a histogram for the mutual information values of each feature \(f\) and the decision variable \(y\). With HFS, the selected features in turn are ‘Size’, ‘Dba’, ‘T12’, ‘Source’ and ‘T15’ and ‘T02’, which are the 25th, 4th, 20th, 6th, 23rd and 10th variable in Fig. 3 (Table 2).

Fig. 3
figure 3

MI between features and decision variable

The result of HFS is compared with NMI in 27, mRMR in 35 and MICBR in 46, where the scale parameter of NMI is taken as 0.1, 0.15 and 0.2, the number of selected features of mRMR is determined by optimizing the estimation accuracy of CBR, and the result of MICBR is extracted from Li et al. (2009). The symbol ‘1’ denotes that the feature in its corresponding row is selected by the feature selection method in its corresponding column.

It can be seen from Table 6 that the scale parameter of NMI in Hua et al. (2011) is not easy to be determined, which leads to unstable results. The mRMR algorithm in Peng et al. (2005) is more stable, but it selects both ‘Duration’ and ‘Size’, while the two variables are strongly correlative with each other. This again indicates that the mRMR is not good at eliminating redundancy for mixed attributes data. In Li et al. (2009), ‘Time’ and ‘Duration’ are treated as numerical variables. This leads to the mutual information value between ‘Time’ and ‘Effort’ being different from the mutual information value between ‘Duration’ and ‘Effort’, while ‘Time’ and ‘Duration’ have exactly the same meanings. This indicates that the results of Li et al. (2009) are less interpretable. In comparison, the proposed HFS in this paper is more stable, better able to remove redundancy and stronger interpretability.

Table 6 Selected features for training set of Maxwell dataset

6.2.2 Analysis of estimation accuracy

In Table 7, the estimation of HFS is compared with the above three methods with the 50 projects finished before 1992 being selected as training set and the 12 projects finished from 1992 to 1993 as testing set. And the result of MICBR is extracted from Table 11 in Li et al. (2009) with ‘the Euclidean distance’, ‘\(K=4\)’ and ‘Mean’.

Table 7 Comparison of estimation accuracy on the Maxwell dataset

It can be seen from Table 5 that the HFS obtains the best MMRE and PRED(0.25) on training set and the best MMRE and MdMMRE on testing set than the other three methods.

7 Conclusions

Feature selection plays an important role in pattern recognition and machine learning. Traditional feature selection methods are mainly designed for the handling classification problems with discrete or continuous features. However, in many practical problems (such as software cost estimation problem), the collected data often have mixed attributes, with the decision variable being continuous. To deal with these problems, a hybrid feature selection scheme for mixed attributes data is proposed which takes advantages of both the filters and the wrappers.

To do feature selection, a proper correlation measure for features is essential. In this paper, we first give a method for calculating mutual information between discrete and continuous variables. Then, we use the mutual information to define a new correlation measure suitable for mixed attributes data. With this correlation measure, features are filtered with a undefined parameter \(N\). Finally, a CBR-based wrapper model is proposed to determine the parameter \(N\). Examples show that this method is applicable for feature selection of the mixed attributes data, being more stable, interpretable, and with better estimation accuracy.

However, only the Desharnais dataset is used for experiments in the study, and the future work could include the application on other datasets such as the ISBSG database.