1 Introduction

Feature selection is one of the major problems in machine learning [1,2,3]. It is a crucial challenge for several reasons. First it improves the understandability of the considered model and allows to discover the relationship between features and the class (target) variable. Secondly, it helps to devise approaches with better generalization and larger predictive power [4]. Finally, it allows to reduce the computational cost of fitting the model.

In this paper, we focus on mutual information (MI) based feature selection. This approach has several important advantages. First MI, unlike some classical measures (e.g. Pearson correlation), is able to capture both linear and non-linear dependencies among random variables. Secondly MI based criteria do not depend on any particular model which allows to find all features associated with the class variable, not only those which are captured by an employed model. This is particularly important in the domains where feature selection itself is the main goal of the analysis, e.g. in human genetics where finding mutations of genes influencing the disease is a crucial problem. Moreover, some advanced MI based criteria are able to discover interactions between features as well as to take redundancy between features into account. Finally information-theoretic approach can be used for both classification and regression tasks, i.e. nominal and quantitative class variable as well as for any type of the features. In this work we focus on classification problem, but the method can be easily extended to regression.

In recent years many algorithms based on mutual information have been proposed. A clear limitation of the existing methods is that they usually take into account only low-order interactions (up to 3rd order). This can be a serious drawback when some complex dependencies exist in our data. For example recent studies in genetics indicate that high-order interactions between genes may contribute to many complex traits [5] and it is crucial to identify them in order to efficiently predict the trait. Taylor et al. [5] give two examples of high-order interactions: one example of three-locus interactions that influence body weight in a cross of two chicken lines and another that showed a pair of genetic interactions involving five or more loci that determine colony morphology in a cross of two yeast strains. We propose a novel criterion called Interaction Information Feature Selection (IIFS) that takes into account both 3-way and 4-way interactions and can be possibly extended to the case of higher order terms. The basic component of our contribution is interaction information, which is a non-parametric measure of interaction strength derived from information theory. Our method is a generalization of Conditional Infomax Feature Extraction (CIFE) criterion [6] whose limitation is that it only considers 3-way interaction terms. We show that our method is able to find interactions which remain undetected when using standard approaches. We also prove some theoretical properties of 4-way interaction information and of the novel criterion. Moreover we experiment with two different methods of multivariate entropy estimation: plug-in estimator based on data discretization and knn-based Kozachenko-Leonenko estimator [7].

The paper is structured as follows. In Sect. 2 we recall the definition of interaction information and prove some new theoretical properties of 4-way interaction information. In Sect. 3 we define the problem and review the existing methods. In Sect. 4 we present our method and discuss its theoretical properties, Sect. 5 contains the results of numerical experiments.

2 Interaction Information

First we define basic quantities used in Information Theory. We consider the discrete class variable Y and features \(X_1,\ldots ,X_p\), which can be either continuous or discrete. For sake of simplicity we write definitions only for discrete variables. We first recall the definition of the entropy for discrete class variable:

$$\begin{aligned} H(Y)=-\sum _{y}P(Y=y)\log P(Y=y). \end{aligned}$$
(1)

Entropy quantifies the uncertainty of observing random values of Y. If large mass of the distribution is concentrated on one particular value of Y then the entropy is low. If all values are equally likely then H(Y) is maximal. Let \(S=(X_1,\ldots ,X_m)\) be a subset of the original feature set of size \(m=1,\ldots ,p\). The entropy of S is defined analogously to (1), with a difference that multivariate probability is used instead of univariate probability. The conditional entropy of S given class variable Y can be written as

$$\begin{aligned} H(S|Y)=\sum _{y}P(Y=y)H(S|Y=y). \end{aligned}$$
(2)

The joint mutual information between S and class variable Y is

$$\begin{aligned} I(S,Y)=H(S)-H(S|Y). \end{aligned}$$
(3)

This can be interpreted as the amount of uncertainty in S which is removed when Y is known which is consistent with the intuitive meaning of mutual information as the amount of information that one variable provides about another. Moreover the conditional mutual information between S and Y given variable Z is defined as

$$\begin{aligned} I(S,Y|Z)=H(S|Z)-H(S|Y,Z). \end{aligned}$$
(4)

We recall a definition of m-way interaction information (II) [8, 9]

$$\begin{aligned} II(S)=II(X_1,\ldots ,X_m)=-\sum _{T\subseteq S}(-1)^{|S|-|T|}H(T), \end{aligned}$$
(5)

which generalizes the 3-way interaction information proposed in [10]. For \(m=2\), interaction information reduces to mutual information. The definition of interaction information is identical to that of multivariate mutual information I(S) [10] except for a change in sign in the case of an odd number of variables, i.e. \(II(S)=(-1)^{|S|}I(S)\). II can be understood as the amount of information common to all variables (or set of variables), but that is not present in any subset of these variables. Interestingly, m-way interaction information can be also defined using recursive formula

$$\begin{aligned} II(X_1,\ldots ,X_m) = II(X_1,\ldots ,X_{m-1}|X_m)-II(X_1,\ldots ,X_{m-1}), \end{aligned}$$
(6)

where \(II(X_1,\ldots ,X_{m-1}|X_m)=\sum _{x}P(X_m=x)II(X_1,\ldots ,X_{m-1}|X_m=x)\). The next formula (also known as Möbius representation) [11,12,13,14] shows the relationship between II and joint mutual information I(SY) which will be useful in the context of the proposed feature selection method

$$\begin{aligned} I(S,Y)=I((X_1,\ldots ,X_m),Y)=\sum _{k=1}^{m}\sum _{T\subseteq S:|T|=k}II(T\cup Y). \end{aligned}$$
(7)

To better grasp the concept of II, let us discuss in more detail 3-way and 4-way interactions. It follows from Möbius representation (7) that

$$\begin{aligned} II(X_1,X_2,Y)=I((X_1,X_2),Y)-I(X_1,Y)-I(X_2,Y), \end{aligned}$$
(8)

which indicates that interaction information can be interpreted as a part of the mutual information of \((X_1,X_2)\) and Y which is due solely to interaction between \(X_1\) and \(X_2\) in predicting Y i.e. the part of \(I((X_1, X_2), Y)\) which remains after subtraction of individual informations between Y and \(X_1\) and Y and \(X_2\). In other words, II is obtained by removing the main effects from the term describing the overall dependence between Y and the pair \((X_1,X_2)\). Here let us mention that 3-way interaction information is a commonly used measure for detecting interactions between genes in genome-wide case- control studies [15, 16]. For 4-way interaction we have from (7) and (8) that

$$\begin{aligned} II(X_1,X_2,X_3,Y)= & {} I((X_1,X_2,X_3),Y) \nonumber \\- & {} I((X_1,X_2),Y)-I((X_1,X_3),Y) -I((X_2,X_3),Y) \nonumber \\+ & {} I(X_1,Y)+I(X_2,Y)+I(X_3,Y). \end{aligned}$$
(9)

Observe that both terms \(I((X_1,X_2),Y)\) and \(I((X_1,X_3),Y)\) in (9) contain \(I(X_1,Y)\) as summands (cf. (8)) and as a result \(I(X_1,Y)\) is subtracted twice. To account for it we add \(I(X_1,Y)\) in the last line of (9). The remaining pairs are treated analogously. The simplest examples of 3-way and 4-way interactions are XOR problems. In XOR \(Y=1\) when the number of input variables taking value 1 is odd. It is easy to check that input binary variables are mutually independent and marginally independent from a class variable. For 3-dimensional case we have \(I(X_1,Y)=I(X_2,Y)=0\) and \(II(X_1,X_2,Y)=I((X_1,X_2),Y)=H(Y)-H(Y|X_1,X_2)=H(Y)=\log (2)\). For 4-dimensional case all terms, except the first one, are zero. i.e. \(II(X_1,X_2,X_3,Y)=I((X_1,X_2,X_3),Y)=H(Y)-H(Y|X_1,X_2,X_3)=H(Y)=\log (2)\).

Some properties of 4-way Interaction Information which has not been discussed in the literature are discussed below. For the sake of clarity we assume that all variables are discrete and let \(p_{ijkl}=P(X_1=x_i,X_2=x_j,X_3=x_k,Y=y_l)\), where P denotes the distribution of \((X_1,X_2,X_3,Y)\). Moreover, KL(P||Q) stands for Kullback-Leibler divergence between P and Q, defined as \(KL(P||Q)=\sum _{i,j,k}p_{ijk}\log (p_{ijk}/q_{ijk})\).

Theorem 1

We have (i) \(II(X_1,X_2,X_3,Y)=KL(P||P_K)\), where \(P_K\) corresponds to mass function \(p^K\) defined as

$$\begin{aligned} p^K_{ijkl}=\frac{\prod _{S:|S|=3} p_S\prod _{S:|S|=1}p_S}{\prod _{S:|S|=2} p_S}= \frac{p_{ijk}p_{ijl}p_{jkl}p_{ikl}p_ip_jp_kp_l}{p_{ij}p_{ik}p_{il}p_{jk}p_{jl}p_{kl}}. \end{aligned}$$
(10)

(ii) If \(X_1\perp X_2|W\), where W is any subset (including \(\emptyset \)) of \(\{X_3,Y\}\) then \(II(X_1,X_2,X_3,Y)=0\).

(iii) Let \(\eta =\sum _{i,j,k,l} p^K_{ijkl}\). If \(\eta \le 1\) and \(II(X_1,X_2,X_3,Y)=0\) then \(P=P_K\).

Proof

(i) follows from (5) and definition of Kullback-Leibler divergence. (ii) is a consequence of (10) and assumptions. In order to prove (iii) note that \(KL(P||Q)=0\) implies \(P=Q\) not only in the case when Q is probability distribution but also in the case when total mass of Q does not exceed 1. This yields the result when applied to \(Q=P_K\).

Observe that \(P_K\) is not necessarily probability distribution. Condition \(\eta \le 1\) is sufficient condition which ensures that \(P=P_K\) when \(II=0\). \(P_K\) is generalization of Kirkwood approximation [17] to four-dimensional case.

3 Problem Formulation and Previous Work

In this work we focus on feature selection based on mutual information (MI). MI-based feature selection is concerned with identifying a fixed-size subset \(S\subset \{1,\ldots ,p\}\) of the original feature set that maximizes the joint mutual information between S and class variable Y. Finding an optimal feature set is usually infeasible because the search space grows exponentially with the number of features. As a result various greedy algorithms have been developed including forward selection, backward elimination and genetic algorithms. Today sequential forward selection is the most commonly adopted solution. Forward selection algorithms start from an empty set of features and add, in each step, the feature that jointly, i.e. together with already selected features, achieves the maximum joint mutual information with the class. Formally, assume that S is a set of already chosen features, \(S^c\) is its complement and \(X_k\in S^c\) is a candidate feature. The score for feature \(X_k\) is

$$\begin{aligned} J(X_k)= I(S\cup X_k,Y)-I(S,Y). \end{aligned}$$
(11)

Obviously the second term in (11) does not depend on \(X_k\) and it can be omitted, however it is more convenient to use this form. In each step we add a feature that maximizes \(J(X_k)\). Criterion (11) is equivalent to

$$\begin{aligned} J(X_k)= I(X_k,Y|S), \end{aligned}$$
(12)

see [18] for the proof. We also refer to [19] who proposed a fast feature selection method based on conditional mutual information and min-max approach. Observe that (12) indicates that we select a feature that achieves the maximum association with the class given the already chosen features. Criterion (11) (or equivalently (12)) is appealing and attracted a significant attention. However in practice the estimation of joint mutual information is problematic even for small set S. This makes a direct application of (11) infeasible. A rich body of work in the MI-based feature selection literature approaches this difficulty by approximating the high-dimensional joint MI with low-dimensional MI terms. These approximations may by accurate provided some additional conditions on data distribution are satisfied. A comprehensive review of the existing methods can be found in [18], here we review some representative methods. One of the most popular methods is Mutual Information Feature Selection (MIFS) proposed in [20]

$$\begin{aligned} J_{\text {MIFS}}(X_k)=I(X_k,Y)-\sum _{j\in S}I(X_j,X_k). \end{aligned}$$
(13)

This includes the \(I(X_k,Y)\) term to ensure feature relevance, but introduces a penalty to enforce low correlations with features already selected in S. The similar idea is used in Minimum-Redundancy Maximum-Relevance (MRMR) criterion [21]

$$\begin{aligned} J_{\text {MRMR}}(X_k)=I(X_k,Y)-\frac{1}{|S|}\sum _{j\in S}I(X_j,X_k). \end{aligned}$$
(14)

with the difference that the second term is averaged over features in S. Both MIFS and MRMR criteria focus on reducing redundancy, however they do not take into account interactions between features. Brown et al. [18] have shown that if the selected features from S are independent and class-conditionally independent given any unselected feature \(X_k\) then (11) reduces to so-called CIFE criterion [6]

$$\begin{aligned} J_{\text {CIFE}}(X_k)=I(X_k,Y)+\sum _{j\in S}[I(X_j,X_k|Y)-I(X_j,X_k)]. \end{aligned}$$
(15)

In view of (8), the second term in (15) is equal \(\sum _{j\in S}II(X_j,X_k,Y)\), so it is seen that CIFE is able to detect 3-way interactions. Yang and Moody [22] have proposed using Joint Mutual Information (JMI)

$$\begin{aligned} J_{\text {JMI}}(X_k)=\sum _{j\in S}I((X_j,X_k),Y), \end{aligned}$$
(16)

which is equal up to a constant to

$$\begin{aligned} J_{\text {JMI}}(X_k)=|S|I(X_k,Y)+\sum _{j\in S}[I(X_j,X_k|Y)-I(X_j,X_k)]. \end{aligned}$$
(17)

JMI is a similar to CIFE, with the difference that in JMI the marginal relevance term plays more important role than the overall interaction term.

4 Feature Selection Based on Interaction Information

In this Section we describe a proposed approach which can be seen as a generalization of CIFE. Our method considers not only 3-way interactions but also 4-way interactions.

4.1 Proposed Criterion: IIFS

In our method we make use of Möbius representation. Recall that S is a set of already selected features of size m and \(X_k\) is a candidate feature. First observe that it follows from Möbius representation (7) that

$$\begin{aligned} J(X_k)=I(S\cup X_k,Y)-I(S,Y)=\sum _{k=0}^{m}\sum _{T\subset S: |T|=k}II(T\cup X_k\ \cup Y). \end{aligned}$$
(18)

In the proposed method IIFS (Interaction Information Feature Selection ) we define a score

$$\begin{aligned} J_{\text {IIFS}}(X_k)=I(X_k,Y)+\sum _{j \in S}II(X_j,X_k,Y)+\sum _{i,j\in S: i<j}II(X_i,X_j,X_k,Y), \end{aligned}$$
(19)

which is a third order approximation of (18). The first term in (19) takes into account marginal relevance of the candidate feature whereas the second and the third terms describe the 3 and 4-way interactions, respectively. Note that IIFS can be seen as an extended version of CIFE which is a second order approximation of \(J(X_k)\), namely

$$\begin{aligned} J_{\text {IIFS}}(X_k)=J_{\text {CIFE}}(X_k)+\sum _{i,j\in S: i<j}II(X_i,X_j,X_k,Y). \end{aligned}$$
(20)

It is possible to consider higher order terms in (18), however it would increase the computational cost and make the estimation even more difficult. Below we state some properties of the introduced criteria.

Theorem 2

The following properties hold.

  1. (i)

    Assume that \(X_k \perp Y\). Then

    $$\begin{aligned} J_{\text {CIFE}}(X_k)=\sum _{j\in S} I(X_k,Y|X_j). \end{aligned}$$
    (21)
  2. (ii)

    Assume that \(X_k \perp Y\) and \(X_k\perp Y|X_j\) for any \(X_j\in S\). Then

    $$\begin{aligned} J_{\text {IIFS}}(X_k)=\sum _{i,j\in S: i<j} I(X_k,Y|X_i,X_j). \end{aligned}$$
    (22)
  3. (iii)

    Assume that \(X_i\perp X_j|X_k\) and \(X_i\perp X_j|X_k,Y\), for some \(X_i,X_j\in S\). Then \(II(X_i,X_j,X_k,Y)\) does not depend on \(X_k\).

  4. (iv)

    If \(|S|=2\) then \(\mathrm{argmax}_{X_k\in S^c}J_{\text {IIFS}}(X_k)=\mathrm{argmax}_{X_k\in S^c}J(X_k)\).

Proof

To prove (i) observe that property (6) implies

$$\begin{aligned} II(X_j,X_k,Y)=I(X_k,Y|X_j)-I(X_k,Y). \end{aligned}$$
(23)

Under assumption \(X_k \perp Y\) we have \(I(X_k,Y)=0\) which, together with (23) and (15) yields (21). Let us now prove (ii). It follows from (6) that

$$\begin{aligned} II(X_i,X_j,X_k,Y)=II(X_j,X_k,Y|X_i)-II(X_j,X_k,Y) \end{aligned}$$
(24)

and

$$\begin{aligned} II(X_j,X_k,Y|X_i)=I(X_k,Y|X_j,X_i)-I(X_k,Y|X_i). \end{aligned}$$
(25)

Under assumption (ii) we have that \(I(X_k,Y)=0\), \(II(X_j,X_k,Y)=0\) and \(I(X_k,Y|X_i)=0\) and thus \(II(X_i,X_j,X_k,Y)=I(X_k,Y|X_j,X_i)\) which yields (22). Let us now prove (iii). Using (6) we can write

$$\begin{aligned}&II(X_i,X_j,X_k,Y)=II(X_i,X_j,Y|X_k)-II(X_i,X_j,Y) \\&=I(X_i,X_j|X_k,Y)-I(X_i,X_j|X_k)-II(X_i,X_j,Y). \end{aligned}$$
(26)

Assumptions of (iii) implies that \(I(X_i,X_j|X_k,Y)=I(X_i,X_j|X_k)=0\), which yields the assertion in view of (26). Finally note that (iv) follows from the fact that for \(|S|=2\) Eqs. (18) and (19) are equivalent. i.e. Möbius representation gives an exact value of \(J(X_k)\).

Let us briefly comment the above statements. Items (i) and (ii) of Theorem 2 indicate that under additional assumptions CIFE and IIFS reduce to simpler and more intuitive forms. Using the forms given in (i) and (ii) one may easily give an example showing the advantage of IIFS over CIFE. Indeed, under assumption (ii) we have \(J_{\text {CIFE}}(X_k)=0\) and we may conclude that \(J_{\text {IIFS}}(X_k)>0\) if there exists a pair \(X_i,X_j\in S\) such that \(I(X_k,Y|X_i,X_j)>0\). In this case IIFS recognizes \(X_k\) as a relevant whereas CIFE treats \(X_k\) as a spurious feature. In addition [18] has showed that if assumptions of (iii) hold for any \(k\in S^c\), maximization of \(J_{\text {CIFE}}(X_k)\) is equivalent to maximization of \(J(X_k)\). In (iii) we confirm that indeed in this case the 4-way interaction term can be omitted.

5 Experiments

The aim of the experiments is to compare the performance of the proposed method IIFS with other popular methods discussed in Sect. 3: MIFS, MRMR, JMI and CIFE.

5.1 Artificial Data

The main advantage of the experiments on artificial data is that we can directly investigate which method is able to detect the particular types of interactions. We consider two simulation models, including 3-way and 4-way interactions, respectively. To make a task more challenging we assume in both cases that features are continuous. To assess the quality of the methods we introduce the following measure. Let t be a set of relevant features influencing Y and \(j_1,j_2,\ldots , j_p\) be features sequentially selected by the given method. The selection rate (SR) is defined as

$$\begin{aligned} SR = \frac{|\{j_1,\ldots ,j_{|t|}\}\cap t|}{|t|}, \end{aligned}$$
(27)

i.e. SR is a fraction of relevant features among first |t| selected. For example if we have two relevant features \(X_1,X_2\) then \(t=\{1,2\}\). When the method produces a list \(\{1,2,5,\ldots \}\) then \(SR=1\). On the other hand if the method gives \(\{1,5,2,\ldots \}\) then \(SR=0.5\), as one spurious feature \(X_5\) is ranked higher than the relevant feature \(X_2\). In the following we describe two simulation models.

Simulation Model 1 (3-Way Interaction Model). We consider 50 uniformly distributed features: \(X_1\sim U[0,3]\), \(X_j\sim U[0,2]\), for \(j=2,\ldots ,50\). Only two first features \(X_1\) and \(X_2\) are relevant, i.e. class variable Y depends only on \(X_1\) and \(X_2\), the remaining features are spurious. Table 1 shows the joint distribution of \(X_1,X_2,Y\). This model is an extension of 2-dimensional XOR; note that \(Y=1\) when \(X_1\in A, X_2\in B\) or \(X_1\in B, X_2\in A\). It is easy to verify that for this model we have: \(I(X_1,Y)>0\), \(I(X_j,Y)=0\), for \(j=2,\ldots ,50\) and \(II(X_1,X_2,Y)>0\), thus we have one main effect corresponding to \(X_1\) and one 3-way interaction.

Simulation Model 2 (4-Way Interaction Model). We consider 50 uniformly distributed features: \(X_1, X_2\sim U[0,3]\), \(X_j\sim U[0,2]\), for \(j=3,\ldots ,50\). Class variable Y depends on \(X_1,X_2,X_3\) whereas the remaining features are spurious. Table 2 shows the joint distribution of \(X_1,X_2,Y\). This model is an extension of 3-dimensional XOR. It is easy to verify that for this model we have: \(I(X_1,Y), I(X_2,Y)>0\), \(I(X_j,Y)=0\), for \(j=3,\ldots ,50\) and \(II(X_1,X_2,X_3,Y)>0\), thus we have two main effects corresponding to \(X_1\) and \(X_2\) and moreover one 4-way interaction.

Table 1. Simulation model 1 (3-way interaction model). Notation: \(A=[0,1]\), \(B=(1,2]\), \(C=(2,3]\) and constant p equals 1/6.
Table 2. Simulation model 2 (4-way interaction model). Notation: \(A=[0,1]\), \(B=(1,2]\), \(C=(2,3]\) and constant p equals 1/16.
Table 3. Computational times.

Figure 1 shows how selection rate (SR) depends on sample size n. In the case of model 1 the methods which take into account 3-way interactions (JMI, CIFE, IIFS) produce the same rankings. They detect successfully both relevant features: \(X_1\) and \(X_2\). MIFS and MRMR are able to detect only one relevant feature. In the case of model 2, MIFS, MRMR, JMI and CIFE are able to detect only 2 relevant features \(X_1, X_2\) but they fail to select feature \(X_3\). Selection rate (SR) for MIFS, MRMR, JMI and CIFE converges to 2/3. As expected only IIFS chooses all 3 relevant features, which results in \(SR=1\) for sufficiently large sample size. The above experiment shows that there is no significant difference between IIFS, JMI, CIFE when only 3-way interactions occur. In the case of 4-way interaction model, IIFS is significantly superior to other methods. Moreover we analyse how the method of entropy estimation influences the results. We used two methods: standard plug-in method based on data discretization with b bins (solid line) and knn-based Kozachenko-Leonenko estimator [7], with \(k=10\) (dashed line). For small \(b=2\) it is seen that knn-based method is superior to plug-in method. For \(b=5\), plug-in method works better than knn-based method in the case of model 1, whereas knn-based method is a winner for model 2.

Fig. 1.
figure 1

Selection rate w.r.t. sample size n for simulation models 1 (a)–(b) and 2 (c)–(d). Parameter b corresponds to the number of bins in discretization, ‘knn’ in brackets corresponds to knn-based entropy estimation.

Fig. 2.
figure 2

Validation error curves for MADELON (a), GISETTE (b), MUSK (c) and BREAST (d) datasets.

5.2 Benchmark Data

For more thorough assessment of developed criterion we used datasets from the NIPS Feature Selection Challenge [23] (MADELON and GISETTE) and UCI repository [24] (BREAST and MUSK). NIPS datasets consist of training sets (2000 observations for MADELON and 6000 for GISETTE) and validation sets (600 observations for MADELON and 1000 for GISETTE), whereas for UCI datasets we used 10-fold cross-validation in order to calculate error rates. We carried out the same experiment as that described in [18, Sect. 6.1]. In addition to methods considered in [18] we investigate the performance of the proposed method IIFS. Each criterion was used to generate a ranking for the top features. Then the original datasets were used to classify the validation data. As in [18] we used kNN method with \(k=3\) neighbours as a classifier. As an evaluation measure we considered Balanced Error Rate defined as

$$\begin{aligned} BER = 1 - 0.5 \cdot (\frac{TP}{TP+FN} + \frac{TN}{TN+FP}), \end{aligned}$$
(28)

where TPTNFPFN denote true positives, true negatives, false positives and false negatives, respectively. Results of our experiments are presented in Fig. 2. We only present curves corresponding to plug-in estimator as knn-based entropy estimator worked much worse in this case possibly due to a prior discretization of the original data. For MADELON and MUSK datasets there is no significant improvement of IIFS compared to CIFE and JMI. So we may conclude that considering interactions of order higher than 3 does not improve the performance in this case. Note that for MADELON interactions play an important role; the methods which do not take into account interactions at all (MIFS and MRMR) fail. For GISETTE dataset the proposed criterion IIFS has the lowest error rate when the number of features varies between 20 and 100. For BREAST IIFS is also a winner. This suggests that taking into account high-order interactions helps in these cases. Interestingly, for GISETTE and BREAST, IIFS is significantly better than CIFE, which additionally indicates that including 4-way interaction term improves the performance. The computational times for IIFS are longer than for competitors (see Table 3) which is a price for taking into account high-order interactions. Note however that the times for IIFS, although longer than for CIFE, are of the same order.

6 Conclusions

In this paper we presented a novel feature selection method, named IIFS. Feature selection score in IIFS, based on interaction information, is derived from so-called Möbius representation of joint mutual information. Our method in an extension of CIFE criterion consisting in taking into account 4-way interaction terms. We discussed theoretical properties of 4-way interaction information (Theorem 1) as well as feature selection methods: CIFE and IIFS (Theorem 2). The numerical experiments for artificial datasets show that there is no significant difference between IIFS, JMI and CIFE when only the interactions of order up to 3 are present. This means that estimation of absent 4-way interactions does not cause significant deterioration of IIFS performance. In the case when 4-way interactions occur IIFS is significantly superior to other methods. Future work will include the development of methods considering high-order interactions as well as the comparison of IIFS with such methods, for example with a novel method proposed in [25].