Keywords

1 Introduction

Conditional Independence (CI) testing is at the core of causal discovery (Sect. 1.1), but particularly challenging in many real-world scenarios (Sect. 1.2). Therefore, we propose a data-adaptive CI test for mixed discrete-continuous data (Sect. 1.3).

1.1 Conditional Independence in Causal Discovery

Causal discovery has received widespread attention as the knowledge of underlying causal structures improves decision support within many real-world scenarios [17, 46]. For example, in discrete manufacturing, causal discovery is the key to root cause analysis of failures and quality deviations, cf. [25].

Causal structures between a finite set of random variables \(\textbf{V}=\{X, Y, \dots \}\) are encoded in a Causal Graphical Model (CGM) consisting of a Directed Acyclic Graph (DAG) \(\mathcal {G}\), and the joint distribution over the variables \(\textbf{V}\), denoted by \(P_{\textbf{V}}\), cf. [38, 46]. In \(\mathcal {G}\), a directed edge \(X \rightarrow Y\) depicts a direct causal mechanism between the two respective variables X and Y, for \(X,Y \in \textbf{V}\). Causal discovery aims to derive as many underlying causal structures in \(\mathcal {G}\) from observational data as possible building upon the coincidence between the causal structures of \(\mathcal {G}\) and the CI characteristics of \(P_{\textbf{V}}\) [46]. Therefore, constraint-based methods, such as the well-known PC algorithm, apply CI tests to recover the causal structures, cf. [8]. For instance, if a CI test states the conditional independence of variables X and Y given a (possibly empty) set of variables \(Z \subseteq \textbf{V} \setminus \{X,Y\}\), denoted by \(X \! \mathrel {\perp \!\!\!\perp }\!Y \vert \ Z\), then there is no edge between X and Y. Constraint-based methods are flexible and exist in various extensions, e.g., to allow for latent variables or cycles [42, 46, 47], or are used for causal feature selection [50]. Hence, they are popular in practice [33].

1.2 Challenges in Practice

In principle, constraint-based methods do not make any assumption on the functional form of causal mechanisms or parameters of the joint distribution. However, they require access to a CI oracle that captures all CI characteristics such that selecting an appropriate CI test is fundamental and challenging [17, 33]. In practice, the true statistical properties are mostly unknown such that inadequate assumptions, e.g., of parametric CI tests, yield incorrect learned causal structures [46]. For example, the well-known partial Pearson’s correlation-based CI test via Fisher’s Z transformation assumes that \(P_\textbf{V}\) is multivariate Gaussian [3, 27]. Hence, the underlying causal mechanisms are assumed to be linear and conditional independence cannot be detected if the mechanisms are non-linear. Further, the omnipresence of mixed discrete-continuous data, e.g., continuous quality measurements and discrete failure messages in discrete manufacturing [20], impedes the selection of appropriate CI tests in real-world scenarios [19, 33]. In this case, parametric models that allow for mixed discrete-continuous data usually make further restrictions, such as conditional Gaussian models assuming that discrete variables have discrete parents only [40]. Hence, for simplification in practice, continuous variables are often discretized to use standard CI tests such as Pearson’s \(\chi ^2\) test for discrete data, cf. [20, 23, 35], to the detriment of the accuracy of the learned causal structures [12, 40].

1.3 Contribution and Structure

In this work, we propose mCMIkNNFootnote 1, a data-adaptive CI test for mixed discrete-continuous data and its application to causal discovery. Our contributions are:

  • We propose a kNN-based local conditional permutation scheme to derive a non-parametric CI test using a kNN-based CMI estimator as a test statistic.

  • We provide theoretical results on the CI test’s validity and power. In particular, we prove that mCMIkNN is able to control type I and type II errors.

  • We show that mCMIkNN allows for consistent estimation of causal structures when used in constraint-based causal discovery.

  • An extensive evaluation on synthetic and real-world data shows that mCMIkNN outperforms state-of-the-art competitors, particularly for low sample sizes.

The remainder of this paper is structured as follows. In Sect. 2, we examine the problem of CI testing and related work. In Sect. 3, we provide background on kNN-based CMI estimation. In Sect. 4, we introduce mCMIkNN and prove theoretical results. In Sect. 5, we empirically evaluate the accuracy of our CI test mCMIkNN compared to state-of-the-art approaches. In Sect. 6, we conclude our work.

2 Conditional Independence Testing Problem

In this section, we provide a formalization of the CI testing problem (Sect. 2.1) together with existing fundamental limits of CI testing (Sect. 2.2) before considering related work on CI testing for mixed discrete-continuous data (Sect. 2.3).

2.1 Problem Description

Let \((\mathcal {X}\times \mathcal {Y}\times \mathcal {Z},\mathcal {B},P_{XYZ})\) be a probability space defined on the metric space \(\mathcal {X}\times \mathcal {Y}\times \mathcal {Z}\) with dimensionality \(d_X + d_Y + d_Z\), equipped with the Borel \(\sigma \)-algebra \(\mathcal {B}\), and a regular joint probability measure \(P_{XYZ}\). Hence, we assume that the \(d_X\), \(d_Y\), and \(d_Z\)-dimensional random variables X, Y, and Z take values in \(\mathcal {X}\), \(\mathcal {Y}\), and \(\mathcal {Z}\) according to the marginal mixed discrete-continuous probability distributions \(P_X\), \(P_Y\), and \(P_Z\). I.e., single variables in X, Y, or Z may follow a discrete, a continuous, or a mixture distribution.

We consider the problem of testing the CI of two random vectors X and Y given a (possibly empty) random vector Z sampled according to the mixed discrete-continuous probability distribution \(P_{XYZ}\), i.e., testing the null hypothesis of CI \(H_0: X \! \mathrel {\perp \!\!\!\perp }\!Y \, \vert \, Z\) against the alternative hypothesis of dependence \(H_1: X \! \mathrel {\not \!\perp \!\!\!\perp }\!Y \, \vert \, Z\). Therefore, let \((x_i,y_i,z_i)_{i=1}^n\) be n i.i.d. observations sampled from \(P_{XYZ}\) such that we aim to derive a CI test \(\Phi _n : \mathcal {X}^n \times \mathcal {Y}^n \times \mathcal {Z}^n \times [0,1] \rightarrow \{0,1\}\) that rejects \(H_0\) if \(\Phi _n=1\) given a nominal level \(\alpha \in [0,1]\).

2.2 Fundamental Limits of CI Testing

The general problem of CI testing is extensively studied, as it is a fundamental concept beyond its application in constraint-based causal discovery [11]. In this context, it is necessary to note that Shah and Peters [45] provided a no-free lunch theorem for CI that, given a continuously distributed conditioning set Z, it is impossible to derive a CI test that is able to control the type I error, via for instance a permutation scheme, and has nontrivial power without additional restrictions. But, under the restriction that the conditional distribution \(P_{X\vert Z}\) is known or can be approximated sufficiently, conditional permutation (CP) tests can calibrate a test statistic guaranteeing a controlled type I error [4]. Further, the recent work of Kim et al. [28] shows that the problem of CI testing is more generally determined by the probability of observing collisions in Z.

2.3 Related Work

We consider the problem of CI testing and its application in causal discovery. In this context, constraint-based methods require CI tests that (R1) yield accurate CI decisions, and (R2) are computationally feasible as they are applied hundreds of times. Generally, CI testing for mixed discrete-continuous data can be categorized into discretization-based, parametric, and non-parametric approaches.

Discretization-Based Approaches: As CI tests for discrete variables are well-studied, continuous variables are often discretized, cf. [23, 35]. In this context, commonly used CI tests for discrete data are Pearson’s \(\mathcal {X}^2\) and likelihood ratio tests [13, 39, 46]. Although discretization simplifies the testing problem, the resulting information loss yields a decreased accuracy [12, 40], cf. (R1).

Parametric CI Testing: Postulating an underlying parametric functional model allows for a regression-based characterization of CI that can be used to construct valid CI tests. Examples are well-known likelihood ratio tests, e.g., assuming conditional Gaussianity (CG) [1, 44] or using multinomial logistic regression models [48]. Another stream of research focuses on Copula models to examine CI characteristics in mixed discrete-continuous data, where variables are assumed to be induced by latent Gaussian variables such that CI can be determined by examining the correlation matrix of the latent variables model [9, 10]. As these approaches require that the postulated parametric models hold, they may yield invalid CI decisions if assumptions are inaccurate [46], cf. (R1).

Non-Parametric CI Testing: Non-parametric CI testing faces the twofold challenge to, first, derive a test statistic from observational data without parametric assumptions, and second, derive the p-value given that the test statistic’s distribution under \(H_0\) may be unknown. In continuous data, a wide range of methods is used for non-parametric CI testing, as reviewed by Li and Fan [32]. For example, kernel-based approaches, such as KCIT [52], test for vanishing correlations within Reproducing Kernel Hilbert Spaces (RKHS). Another example is CMIknn from Runge [43], which uses a kNN-based estimator to test for a vanishing Conditional Mutual Information (CMI) in combination with a local permutation scheme. The recent emergence of non-parametric CMI estimators for mixed discrete-continuous data provides the basis for new approaches to non-parametric CI testing. For example, the construction of adaptive histograms derived following the minimum description length (MDL) principle allows for estimating CMI from mixed discrete-continuous data [6, 34, 36, 51]. In this case, CMI can be estimated via discrete plug-in estimators as the data is adaptively discretized according to the histogram with minimal MDL. Hence, the estimated test statistic follows the common \(\mathcal {X}^2\) distribution, which allows for derivation via Pearson’s \(\mathcal {X}^2\) test, aHis\(\chi ^2\), see [36]. However, MDL approaches suffer from their worst-case computational complexity and weaknesses regarding a low number of samples, cf. (R2). Another approach for non-parametric CMI estimation builds upon kNN methods, which are well-studied in continuous data, cf. [15, 29, 30], and have recently been applied to mixed discrete-continuous data [16, 37]. As the asymptotic distribution of kNN-based estimators is unclear, it remains to show that they can be used as a test statistic for a valid CI. In this context, it is worth noticing that permutation tests yield more robust constraint-based causal discovery than asymptotic CI tests, particularly for small sample sizes [49], cf. (R1). Following this, we combine a kNN-based CMI estimator and a kNN based local CP scheme (similar to Runge [43], which is restricted to the continuous case), and additionally provide theoretical results on the test’s validity and power.

3 Background: KNN-Based CMI Estimation

In this section, we provide information on kNN-based CMI estimation for mixed discrete-continuous data (Sect. 3.1). Further, we introduce an algorithmic description of the estimator (Sect. 3.2) and recap theoretical results (Sect. 3.3).

3.1 Introduction to CMI Estimation

A commonly used test statistic is the Conditional Mutual Information (CMI) \(I(X;Y\vert Z)\) as it provides a general measure of variables’ CI, i.e., \(I(X;Y\vert Z)=0\) if and only if \(X \! \mathrel {\perp \!\!\!\perp }\!Y \, \vert \, Z\), see [16, 18, 43]. Generally, \(I(X;Y\vert Z)\) is defined as \(I(X;Y \vert Z ) = \int \log \left( \textstyle \frac{dP_{XY\vert Z}}{d\left( P_{X\vert Z} \times P_{Y\vert Z} \right) } \right) dP_{XYZ}\), where \(\frac{dP_{XY\vert Z}}{d\left( P_{X\vert Z} \times P_{Y\vert Z} \right) }\) is the Radon-Nikodym derivative of the joint conditional measure, \(P_{XY\vert Z}\), with respect to the product of the marginal conditional measures, \(P_{X\vert Z}\! \times \! P_{Y\vert Z}\). Note the non-singularity of \(P_{XYZ}\) ensures the existence of a product reference measure and that the Radon-Nikodym derivative is well-defined [37, Lem. 2.1, Thm. 2.2]. Although well-defined, estimating CMI \(I(X;Y\vert Z)\) from mixed discrete-continuous data is a particularly hard challenge [16, 36, 37]. Generally, CMI estimation can be tackled by expressing \(I(X;Y \vert Z)\) in terms of Shannon entropies, i.e., \(I(X;Y \vert Z)=H(X,Y,Z)-H(X,Z)-H(Y,Z)+H(Z)\) with Shannon entropy H(W) for all cases \(W=XYZ,XZ,YZ,Z\), respectively, cf. [18, 36, 37]. In the continuous case, the KSG technique from Kraskov et al. [30] estimates the Shannon entropy H(W) locally for every sample \((w_i)_{i=1}^n\) where \(w_i \sim P_W\), i.e., estimating H(W) via \(\widehat{H}_n(W)=-\sum _{i=1}^n \log \widehat{f_W(w_i)}\) by considering the k-nearest neighbors within the \(\ell _\infty \)-norm for every sample \(i=1,...,n\) to locally estimate the density \(f_W\) density of \(W=XYZ,XZ,YZ,Z\), respectively, cf. [18, 36, 37]. For mixed discrete-continuous data, there is a non-zero probability that the kNN distance is zero for some samples. In this case, Gao et al. [16] extended the KSG technique by fixing the radius and using a plug-in estimator that differentiates between mixed, continuous, and discrete points. Recently, Mesner and Shalizi [37] extended this idea to derive a consistent CMI estimator in the mixed discrete-continuous case.

3.2 Algorithm for KNN-Based CMI Estimation

figure a

Algorithm 1 provides an algorithmic description of the theoretically examined estimator \(\hat{I}_n(X;Y \vert Z)\) developed by Mesner and Shalizi [37]. The basic idea is to take the mean of Shannon entropies estimated locally for each sample \(i=1,...,n\) considering samples \(j\ne i\), \(j=1,...,n\), that are close to i according to the \(\ell _\infty \)-norm, i.e., under consideration of the respective sample distance \(d_{i,j}(w):=\Vert (w_i)-(w_j)\Vert _{\infty }\), \(i,j=1,...,n\), of \(w=(w_i)_{i=1}^n\) for all cases \(w=xyz,xy,yz,z\) (see Algorithm 1, line 1). In this context, fixation of a kNN radius \(\rho _i\) used for local estimation of Shannon entropies yields a consistent global estimator. Therefore, for each sample \(i=1, \dots ,n\), let \(\rho _i\) be the smallest distance between \((x_i,y_i,z_i)\) and the \(k_{\scriptscriptstyle CMI}\)-nearest sample \((x_j,y_j,z_j)\), \(j\ne i, j=1,\dots ,n\), and replace \(k_{\scriptscriptstyle CMI}\) with \(\tilde{k}_i\), the number of samples whose distance to \((x_i,y_i,z_i)\) is smaller or equal to \(\rho _i\) (see Algorithm 1, line 3-4). For discrete or mixed discrete-continuous samples \((x_i, y_i, z_i)_{i=1}^n\) it holds that \(\rho _i=0\), and there may be more samples than \(k_{\scriptscriptstyle CMI}\) samples with zero distance. In this case, adapting the number of considered samples \(\tilde{k}_i\) to all samples with zero distance prevents undercounting, which, otherwise, yields a bias of the CMI estimator, see [37]. In case of continuous samples \((x_i, y_i, z_i)_{i=1}^n\), there are exactly \(\tilde{k}_i = k_{\scriptscriptstyle CMI}\) samples within the \(k_{\scriptscriptstyle CMI}\)-nearest distance with probability 1. The next step estimates the Shannon entropies required by the 3H-principle locally for each sample i, \(i=1,\dots ,n\). Therefore, let \(n_{xz,i}, n_{yz,i}\), and \(n_{z,i}\) be the numbers of \(\tilde{k}_i\)-nearest samples within the distance of \(\rho _i\) in the respective subspace XZ, YZ, and Z (see Algorithm 1, lines 5-7). Fixing the local kNN distance \(\rho _i\), using the \(\ell _\infty \)-norm, simplifies the local estimation as most relevant terms for CMI estimation using the 3H-principle cancel out, i.e., \(\xi _i:= -\widehat{f_{XYZ}(x_i,y_i,z_i)} + \widehat{f_{XZ}(x_i,z_i)} + \widehat{f_{YZ}(y_i,z_i)} - \widehat{f_Z(z_i)} = \psi (\tilde{k}_i)-\psi (n_{xz,i}) - \psi (n_{yz,i}) + \psi (n_{z,i})\), with digamma function \(\psi \) (see Algorithm 1, line 8) [16, 37]. Then, the global CMI estimate \(\hat{I}_n(x;y\vert z)\) is the average of the local CMI estimates \(\xi _i\) of each sample \((x_i, y_i, z_i)_{i=1}^n\), and the positive part is returned, as CMI or MI are non-negative (see Algorithm 1, line 10-11).

3.3 Properties of KNN-Based CMI Estimation

We recap the theoretic results of \(\hat{I}_n(X,Y\vert Z)\) proved by Mesner and Shalizi [37]. Under mild assumptions, \(\hat{I}_n(x;y\vert z)\) is asymptotically unbiased, see [37, Thm. 3.1].

Corollary 1 (Asymptotic-Unbiasedness of \(\hat{I}_n(x;y\vert z)\) [37, Thm. 3.1])

Let \((x_i,y_i,z_i)_{i=1}^n\) be i.i.d. samples from \(P_{XYZ}\). Assume

  1. (A1)

    \(P_{XY\vert Z}\) is non-singular such that \(f\equiv \frac{dP_{XY\vert Z}}{d(P_{X\vert Z} \times P_{Y\vert Z})}\) is well-defined, and assume, for some \(C>0\), \(f(x,y,z)<C\) for all \((x,y,z)\in \mathcal {X}\times \mathcal {Y}\times \mathcal {Z}\);

  2. (A2)

    \(\{(x,y,z)\in \mathcal {X}\times \mathcal {Y}\times \mathcal {Z}: P_{XYZ}((x,y,z))>0\}\) countable and nowhere dense in \(\mathcal {X}\times \mathcal {Y}\times \mathcal {Z}\);

  3. (A3)

    \(k_{\scriptscriptstyle CMI}=k_{{\scriptscriptstyle CMI},n}\rightarrow \infty \) and \(\frac{k_{CMI,{\scriptstyle n}}}{n} \rightarrow 0\) as \(n\rightarrow \infty \);

then \(\mathbb {E}_{P_{XYZ}}\left[ \hat{I}_n(x;y\vert z)\right] \rightarrow I(X;Y\vert Z)\) as \(n\rightarrow \infty \).

While (A1) seems rather technical, checking for non-singularity is helpful for data analysis by checking sufficient conditions. Given non-singularity, assumptions (A2) and (A3) are satisfied whenever \(P_{XYZ}\) is (i) (finitely) discrete, (ii) continuous, (iii) some dimensions are (countably) discrete and some are continuous, and (iv) a mixture of the previous cases, which covers most real-world data. For more details on the assumptions, see Appendix A.

We prove that the CMI estimator \(\hat{I}_n(X;Y\vert Z)\) described in Algorithm 1 is consistent.

Corollary 2 (Consistency of \(\hat{I}_n(x;y\vert z)\) )

Let \((x_i,y_i,z_i)_{i=1}^n\) be i.i.d. samples from \(P_{XYZ}\) and assume (A1)-(A3) of Cor. 1 hold. Then, for all \(\epsilon >0\), \(\lim _{n \rightarrow \infty }\mathbb {P}_{P_{XYZ}}\left( \left| \hat{I}_n(x ;y\vert z) - I(X;Y\vert Z)\right| > \epsilon \right) = 0\).

Proof

Recap that \(\hat{I}_n(x ;y\vert z)\) has asymptotic vanishing variance [37, Thm. 3.2], i.e., \(\displaystyle \lim _{n\rightarrow \infty } {\text {Var}}(\hat{I}_n(x ;y\vert z))=0\), and is asymptotically unbiased, see Cor. 1 or [37, Thm. 3.1]. The consistency of \(\hat{I}_n(x ;y\vert z)\) follows from Chebyshev’s inequality.    \(\square \)

Therefore, the kNN-based estimator described in Algorithm 1 serves as a valid test statistic for \(H_0: X \! \mathrel {\perp \!\!\!\perp }\!Y \, \vert \, Z\) vs. \(H_1: X \! \mathrel {\not \!\perp \!\!\!\perp }\!Y \, \vert \, Z\). Note that, \(\hat{I}_n(x;y\vert z)\) is biased towards zero for high-dimensional data with fixed sample size, i.e., it suffers from the curse of dimensionality, see [37, Thm. 3.3].

Corollary 3 (Dimensionality-Biasedness of \(\hat{I}_n(x;y\vert z)\) [37, Thm. 3.3])

Let \((x_i,y_i,z_i)_{i=1}^n\) be i.i.d. samples from \(P_{XYZ}\) and assume (A1)-(A3) of Cor. 1 hold, if the entropy rate of Z is nonzero, i.e., \(\displaystyle \lim _{d_Z\rightarrow \infty }\frac{1}{d_Z}H(Z)\ne 0\), then, for fixed dimensions \(d_X\) and \(d_Y\), \(\mathbb {P}_{P_{XYZ}}\left( \hat{I}_n(x;y\vert z) = 0\right) \rightarrow 1\) as \(d_Z \rightarrow \infty \).

Hence, even with asymptotic consistency, one must pay attention when estimating \(\hat{I}_n(X;Y\vert Z)\) in high-dimensional settings, particularly for low sample sizes.

4 mCMIkNN: Our Approach on Non-Parametric CI Testing

In this section, we recap the concept of Conditional Permutation (CP) schemes for CI testing (Sect. 4.1). Then, we introduce our approach for kNN-based CI testing in mixed discrete-continuous data, called mCMIkNN (Sect. 4.2). We prove that mCMIkNN is able to control type I and type II errors (Sect. 4.3). Moreover, we examine mCMIkNN-based causal discovery and prove its consistency (Sect. 4.4).

4.1 Introduction to Conditional Permutation Schemes

Using permutation schemes for non-parametric independence testing between two variables X and Y has a long history in statistics, cf. [5, 22, 31]. The basic idea is to compare an appropriate test statistic for independence calculated from the original samples \((x_i,y_i)_{i=1}^n\) against the test statistics calculated \(M_{perm}\) times from samples \((x_{\pi _m(i)},y_i)_{i=1}^n\) for a permutation \(\pi _m\) of \(\{1,\dots ,n\}\), \(m=1,\dots ,M_{perm}\), i.e., where samples of X are randomly permuted such that \(H_0: X \! \mathrel {\perp \!\!\!\perp }\!Y\) holds. In the discrete case, a permutation scheme to test for CI, i.e., for \(H_0: X \! \mathrel {\perp \!\!\!\perp }\!Y \, \vert \, Z\), can be achieved by permuting X for each realization \(Z=z\) to utilize the unconditional \( X \! \mathrel {\perp \!\!\!\perp }\!Y \, \vert \, Z=z\). In contrast, testing for CI in continuous or mixed discrete-continuous data is more challenging [45], as simply permuting X without considering the confounding effect of Z may yield very different marginal distributions, hence, suffers in type I error control [4, 28]. Therefore, Conditional Permutation (CP) schemes aim to compare a test statistic estimated from the original data \((x_i,y_i,z_i)_{i=1}^n\), with test statistics estimated from, conditionally on Z, permuted samples \((x_{\pi _m(i)},y_i,z_i)_{i=1}^n\), \(m=1, ...,M_{perm}\) to ensure \(H_0: X \! \mathrel {\perp \!\!\!\perp }\!Y \, \vert \, Z\). Then, the \(M_{perm}+1\) samples \((x_i,y_i,z_i)_{i=1}^n\) and \((x_{\pi _m(i)},y_i,z_i)_{i=1}^n\), \(m=1, ...,M_{perm}\) are exchangeable under \(H_0\), i.e., are drawn with replacement such that the p-value can be calculated in line with common Monte Carlo simulations [4, 28]. This requires either an approximation of \(P_{X \vert Z}\) either based upon model assumptions to simulate \(P_{X \vert Z}\) [4], or using an adaptive binning strategy of Z such that permutations can be drawn for each binned realization \(Z=z\) [28] (both focusing on the continuous case). To provide a data-adaptive approach valid in mixed discrete-continuous data without too restrictive assumptions, cf. (R1), which is computationally feasible, cf. (R2), we propose a local CP scheme leveraging ideas of kNN-based methods, cf. Section 3. In particular, our local CP scheme draws samples \((x_{\pi _m(i)},y_i,z_i)_{i=1}^n\) such that (I) the marginal distributions are preserved, and (II) \(x_i\) is replaced by \(x_{\pi _m(i)}\) only locally regarding the \(k_{perm}\)-nearest distance \(\sigma _i\) in the space of Z. Intuitively, the idea is similar to common conditional permutation schemes in the discrete case, where entries of the variable X are permuted for each realization \(Z=z\), but considering local permutations regarding the neighborhood of \(Z=z\).

4.2 Algorithm for KNN-Based CI Testing

Algorithm 2 gives an algorithmic description of our kNN-based local CP scheme for non-parametric CI testing in mixed discrete-continuous data.

figure b

First, the sample CMI \(\hat{I}_n:=\hat{I}_n(x;y\vert z)\) is estimated from the original samples via Algorithm 1 with parameter \(k_{\scriptscriptstyle CMI}\) (see Algorithm 2, line 1). To receive local conditional permutations for each sample \((x_i,y_i,z_i)_{i=1}^n\), the \(k_{perm}\)-nearest neighbor distance \(\sigma _i\) w.r.t. the \(\ell _\infty \)-norm of the subspace of Z is considered. Hence, \(\tilde{\textbf{z}}_i\) is the respective set of indices \(j\ne i\), \(j=1,...,n\) of points with distance smaller or equal to \(\sigma _i\) in the subspace of Z (see Algorithm 2, lines 3-4). According to a Monte Carlo procedure, samples are permuted \(M_{perm}\) times (see Algorithm 2, line 6). For each \(m =1, \dots , M_{perm}\), the local conditional permutation \(\pi ^i_m\), \(i=1,\dots ,n\), is a random permutation of the index set of \(\tilde{\textbf{z}}_i\) such that the global permutation scheme \(\pi _m\) of the samples’ index set \(\{1, \dots ,n\}\) is achieved by concatenating all local permutations, i.e., \(\pi _m:=\pi ^1_m \circ ... \circ \pi ^n_m\) (see Algorithm 2, lines 7-8). In the case of discrete data, \(\tilde{\textbf{z}}_i\) contains all indices of samples j with distance \(\rho _i=0\) to \(z_i\), i.e., the permutation scheme coincides with discrete permutation tests where permutations are considered according to \(Z=z_i\). In the continuous case, \(\tilde{\textbf{z}}_i\) contains exactly the, in space Z, \(k_{perm}\)-nearest neighbors’ indices and the global permutation scheme approximates \(P_{X\vert Z=z_i}\) locally within \(k_{perm}\)-NN distance \(\sigma _i\) of \(z_i\). Therefore, local conditional permuted samples \((x_{\pi _m(i)},y_i,z_i)\) are drawn by shuffling the values of \(x_i\) according to \(\pi _m\) and respective CMI values \(\hat{I}_{n}^{(m)}:=\hat{I}_n\left( x^{(m)}; y \vert z\right) \) are estimated using Algorithm 1 (see Algorithm 2, line 9). Hence, by construction, \((x_{\pi _m(i)},y_i,z_i)\) are drawn under \(H_0: X\! \! \mathrel {\perp \!\!\!\perp }\!\! Y \, \vert \, Z\) such that the p-value \(p_{perm,n}\) can be calculated according to a Monte Carlo scheme comparing the samples’ CMI value \(\hat{I}_{n}\) with the \(H_0\) CMI values \(\hat{I}_{n}^{(m)}\) (see Algorithm 2, line 11).

We define the CI test mCMIkNN as \(\Phi _{perm,n}:= \mathbbm {1}\{p_{perm,n} \le \alpha \}\) for the \(p_{perm,n}\) returned by Algorithm 2 and, hence, reject \(H_0: X \! \mathrel {\perp \!\!\!\perp }\!Y \, \vert \, Z\) if \(\Phi _n=1\). The computational complexity of mCMIkNN is determined by the kNN searches in Algorithms 1 and 2, which is implemented in \(\mathcal {O}(n \times log(n))\) using k-d-trees. For more details on assumptions, parameters, and computational complexity, see Appendix A.

4.3 Properties of mCMIkNN

The following two theorems show that mCMIkNN is valid, i.e., is able to control type I errors, and has non-trivial power, i.e., is able to control type II errors.

Theorem 1 (Validity: Type I Error Control of \(\Phi _{perm,n}\) )

Let \((x_i,y_i,z_i)_{i=1}^n\) be i.i.d. samples from \(P_{XYZ}\), and assume (A1), (A2), and

  1. (A4)

    \(k_{perm}=k_{perm,n}\rightarrow \infty \) and \(\frac{k_{perm,n}}{n}\rightarrow 0\) as \(n\rightarrow \infty \),

hold, then \(\Phi _{perm,n}\) with p-value estimated according to Algorithm 2 is able to control type I error, i.e., for any desired nominal value \(\alpha \in [0,1]\), when \(H_0\) is true, then

$$\begin{aligned} \lim _{n \rightarrow \infty } \mathbb {E}_{P_{XYZ}}[\Phi _{perm,n}] \le \alpha . \end{aligned}$$
(1)

Note that this holds true independent of the test statistic \(T_n: \mathcal {X}^n \times \mathcal {Y}^n \times \mathcal {Z}^n \rightarrow \mathbb {R}\). The idea of the proof is to bound the type I error using the total variation distance between the samples’ conditional distribution \(P^n_{X \vert Z}\) and the conditional distribution \(\widetilde{P}^n_{X \vert Z}\), approximated by the local CP scheme to simulate \(H_0\) and show that it vanishes for \(n\rightarrow \infty \). For a detailed proof, see Appendix B.

Theorem 2 (Power: Type II Error Control of \(\Phi _{perm,n}\) )

Let \((x_i,y_i,z_i)_{i=1}^n\) be i.i.d. samples from \(P_{XYZ}\), and assume (A1) - (A4) hold. Then \(\Phi _{perm,n}\), with p-value estimated according to Algorithm 2, is able to control type II error, i.e., for any desired nominal value \(\beta \!\in \! \left[ \frac{1}{1+M_{perm}},1\right] \), when \(H_1\) is true, then

$$\begin{aligned} \lim _{n \rightarrow \infty } \mathbb {E}_{P_{XYZ}}[1-\Phi _{perm,n}]=0. \end{aligned}$$
(2)

Hence, mCMIkNN’s power is naturally bounded according to \(M_{perm}\), i.e., \(1 - \beta \le 1-\frac{1}{1+M_{perm}}\). The proof follows from the asymptotic consistency of \(\hat{I}_n(x ;y\vert z)\) and that the local CP scheme allows asymptotic consistent approximating \(P_{X \vert Z}\). For a detailed proof, see Appendix B. Therefore, our work is in line with the result of Shah and Peters [45] and Kim et al. [28] by demonstrating that, under the mild assumptions (A1) and (A2) which allow approximating \(P_{X\vert Z}\), one can derive a CI test that is valid (see Thm. 1), and has non-trivial power (see Thm. 2).

4.4 mCMIkNN-based Constraint-based Causal Discovery

We examine the asymptotic consistency of mCMIkNN-based causal discovery, in particular, using the well-known PC algorithm [46]. Note that constraint-based methods for causal discovery cannot distinguish between different DAGs \(\mathcal {G}\) in the same equivalence class. Hence, the PC algorithm aims to find the Completed Partially Directed Acyclic Graph (CPDAG), denoted with \(\mathcal {G}_{CPDAG}\), that represents the Markov equivalence class of the true DAG \(\mathcal {G}\). Constraint-based methods apply CI tests to test whether \(X \! \mathrel {\perp \!\!\!\perp }\!Y \, \vert \, Z\) for \(X,Y \in \textbf{V}\) with \(d_X=d_Y=1\), and \(Z \in \textbf{V}\setminus \{X,Y\}\) iteratively with increasing \(d_Z\) given a nominal value \(\alpha \) to estimate the undirected skeleton of \(\mathcal {G}\) and corresponding separation sets in the first step. In a second step, orienting as many of the undirected edges through the repeated application of deterministic orientation rules yields \(\hat{\mathcal {G}}_{CPDAG}(\alpha )\) [26, 46].

Theorem 3 (Consistency of mCMIkNN -based Causal Discovery)

Let \(\textbf{V}\) be a finite set of variables with joint distribution \(P_{\textbf{V}}\) and assume (A1) - (A4) hold. Further, assume the general assumptions of the PC algorithm hold, i.e., causal faithfulness and causal Markov condition, see [46]. Let \(\hat{\mathcal {G}}_{CPDAG,n}(\alpha _n)\) be the estimated CPDAG of the PC algorithm and \(\mathcal {G}_{CPDAG}\) the CPDAG of the true underlying DAG \(\mathcal {G}\). Then, for \(\alpha _n\! =\! \frac{1}{1+M_{perm,n}}\) with \(M_{perm,n}\! \rightarrow \infty \) as \(n\!\rightarrow \infty \),

$$\begin{aligned} \lim _{n\rightarrow \infty }\mathbb {P}_{P_{\textbf{V}}}\left( \hat{\mathcal {G}}_{CPDAG,n}(\alpha _n) =\mathcal {G}_{CPDAG}\right) = 1. \end{aligned}$$
(3)

The idea of the proof is to consider wrongly detected edges due to incorrect CI decisions and show that they can be controlled asymptotically. For detailed proof and more information on causal discovery, see Appendix C. As the upper bound on the errors is general for constraint-based methods, the consistency statement of Thm. 3 holds for modified versions of the PC algorithm, e.g., its order-independent version PC-stable [8], too. Hence, mCMIkNN for constraint-based causal discovery allows consistently estimating the \(\mathcal {G}_{CPDAG}\) for \(n\!\rightarrow \! \infty \).

5 Empirical Evaluation

We consider the mixed additive noise model (MANM) (Sect. 5.1) to synthetically examine mCMIkNN’s robustness (Sect. 5.2). Further, we compare mCMIkNN’s empirical performance against state-of-the-art competitors regarding CI decisions (Sect. 5.3), causal discovery (Sect. 5.4), and in a real-world scenario (Sect. 5.5).

5.1 Synthetic Data Generating

We generate synthetic data according to the MANM [24]. Hence, for all \(X \in \textbf{V}\), let X be generated from its J discrete parents \(\mathcal {P}^{dis}(X)\subseteq \textbf{V}\setminus X\), where \(J:=\#\mathcal {P}^{dis}(X)\), its K continuous parents \(\mathcal {P}^{con}(X)\subseteq \textbf{V}\setminus X\), where \(K:=\#\mathcal {P}^{con}(X)\), and (continuous or discrete) noise term \(N_X\) according to \(X= \frac{1}{J}\sum _{j=1,\dots ,J} f_{j}(Z_j) + (\sum _{k=1,\dots ,K} f_{k}(Z_k)) \bmod d_X + N_X\) with appropriately defined functions \(f_j\), \(f_k\) between \(\mathbb {Z}\) and \(\mathbb {R}\). Hence, by construction (A1) and (A2) hold true for all combinations of \(X,Y,Z \subseteq \textbf{V}\). For experimental evaluation, we generate CGMs that either directly induce CI characteristics between variables X and Y conditioned on \(Z=\{Z_1, \dots ,Z_{d_Z}\}\), \(d_Z\) between 1 and 7, (see Sect. 5.2 - 5.3) or are randomly generated with between 10 to 30 variables and varying densities between 0.1 and 0.4 (see Sect. 5.4). Moreover, we consider different ratios of discrete variables between 0 and 1. We consider the cyclic model with \(d_X \in \{2,3,4\}\) for discrete X, and continuous functions that are equally drawn from \(\{id(\cdot ),(\cdot )^2,cos(\cdot )\}\). Note that we scale the parents’ signals to reduce the noise for subsequent variables avoiding high varsortability [41], and max-min normalize all continuous variables. For more information on the MANM and all parameters used for synthetic data generation, see Appendix D.1.

5.2 Calibration and Robustness of mCMIkNN

We provide recommendations for calibrating mCMIkNN and show its robustness, i.e., the ability to control type I and II errors in the finite case. Therefore, we restrict our attention to two simple CGMs \(\mathcal {G}\) with variables \(\textbf{V}=(X,Y,\) \(Z_1,\dots ,Z_{d_Z}\}\), where first, X and Y have common parents \(Z=\{Z_1,\dots ,Z_{d_Z}\}\) in \(\mathcal {G}\), i.e., \(H_0: X \! \mathrel {\perp \!\!\!\perp }\!Y \, \vert \, Z\), and second, there exists an additional edge connecting X and Y in \(\mathcal {G}\), i.e., \(H_1: X \! \mathrel {\not \!\perp \!\!\!\perp }\! Y \vert Z\). Accordingly, we generate the data using the MANM model with parameters described in Sect. 5.1.

Fig. 1.
figure 1

Type I and II error rates of mCMIkNN for different dimensions \(d_Z\! \in \!\{1,\!3,\!5,\!7\}\) of Z (smaller better) given varying sample sizes n for settings with different discrete variable ratios from \(dvr\!=\!0.0\), i.e., continuous (left), to \(dvr\!=\!1.0\), i.e., discrete (right).

Calibration: We evaluate the accuracy of CI decisions for different combinations of \(k_{\scriptscriptstyle CMI}\) and \(k_{perm}\) by comparing the area under the receiver operating curve (ROC AUC), as it provides a balanced measure of type I and type II errors. In particular, we examine different combinations of \(k_{\scriptscriptstyle CMI}\) and \(k_{perm}\) in settings with varying \(d_Z \in \{1,3,5,7\}\), discrete variable ratios \(dvr \in \{0.0, 0.25, 0.5, 0.75, 1.0\}\) and sample sizes n ranging from 50 to \(1\,000\). Note, we set \(\alpha = 0.05\) and \(M_{perm}=100\), cf. [14]. We find that small values of \(k_{\scriptscriptstyle CMI}\) and \(k_{perm}\) are sufficient to calibrate the CI test while not affecting accuracy much for the finite case, such that we set \(k_{\scriptscriptstyle CMI}=25\) and \(k_{perm}=5\) in the subsequent experiments. Note that Appendix D.2 provides detailed evaluation results. Moreover, for more information on all parameters, see Appendix A.

Robustness: We evaluate mCMIkNN’s robustness regarding validity and power in the finite case by examining the type I and II error rates as depicted in Fig. 1. In particular, we see that mCMIkNN is able to control type I errors for all discrete variable ratios dvr and sizes of the conditioning sets \(d_Z\) (cf. Appendix D.3). Moreover, the type II error rates decrease for an increasing number of samples n. Hence, mCMIkNN achieves non-trivial power, particularly for small sizes of the conditioning sets \(d_Z\). In this context, higher type II errors in the case of higher dimensions \(d_Z\) point out that mCMIkNN suffers from the curse of dimensionality, cf. Cor. 3. In summary, the empirical results are in line with the theoretical results on the asymptotic type I and II error control, cp. Thm. 1 and Thm. 2.

5.3 Conditional Independence Testing

Next, we compare mCMIkNN’s empirical performance to state-of-the-art CI tests valid for mixed discrete-continuous data. We chose a likelihood ratio test assuming conditional Gaussianity (CG) [1], a discretization-based approach, where we discretize continuous variables before applying Pearson’s \(\chi ^2\) test (disc\(\chi ^2\)), a non-parametric CI test based upon adaptive histograms (aHist\(\chi ^2\)) [36], and a non-parametric kernel-based CI test (KCIT) [52]. In this experiment, we again consider the two CGMs used for the calibration in Sect. 5.2 and examine the respective ROC AUC scores from \(20\,000\) CI decisions (\(\alpha =0.01\)) in Fig. 2.

Fig. 2.
figure 2

ROC AUC scores (higher better) of \(20\,000\) CI decisions of the CI tests mCMIkNN, CG, KCIT, disc\(\chi ^2\), and aHist\(\chi ^2\) with varying sample sizes n (left), dimensions of the conditioning sets \(d_Z\) (center), and ratios of discrete variables dvr (right) (Note, we limited the execution time to 10 min per CI test (Approx. \(4\,900\) runs of aHist\(\chi ^2\) exceeded this time. Thus, aHist\(\chi ^2\) is excluded for causal discovery).).

We compare the CI test’s performance for various sample sizes (Fig. 2 left), sizes of conditioning sets \(d_Z\) (center), and ratios of discrete variables (right). While the ROC AUC scores of all CI tests increase as n grows (left), mCMIkNN outperforms all competitors, particularly for small sizes, e.g., \(n \le 500\). With increasing sample sizes, the performance of KCIT catches up to ROC AUC scores of mCMIkNN, cf. \(n = 1\,000\). For an increasing size of the conditioning sets \(d_Z\) (center), we observe that all methods suffer from the curse of dimensionality, while mCMIkNN achieves higher ROC AUC scores than the competitors. Moreover, mCMIkNN achieves the highest ROC AUC independent of the ratio of discrete variables dvr (right), only beaten by KCIT for some dvr’s. For a detailed evaluation and an examination of type I and II errors, see Appendix D.4.

5.4 Causal Discovery

We evaluate the consistency of causal discovery using the PC-stable algorithm from [8] (\(\alpha =0.05\) with \(M_{perm}=100\)) to estimate \(\mathcal {G}_{CPDAG}\) of the DAG \(\mathcal {G}\) generated according to Sect. 5.1. We examine the \(\text {F1}\) scores [7] of erroneously detected edges in the skeletons of \(\hat{\mathcal {G}}_{CPDAG,n}(0.05)\) estimated with PC-stable using the respective CI tests in comparison to the true skeleton of \(\mathcal {G}\), see Fig. 3. While \(\text {F1}\) grows for all methods as n increases, mCMIkNN outperforms the competitors (left). Further, mCMIkNN achieves the highest \(\text {F1}\) scores for high discrete variables ratios (center left). In this context, \(\text {F1}\) scores are balanced towards type I errors, crucial in causal discovery. Further, constraint-based causal discovery requires higher sample sizes for consistency due to the multiple testing problem [17, 46]. All methods suffer from the curse of dimensionality, i.e., a decreasing \(\text {F1}\) score for increasing densities (center right) and numbers of variables (right) which yields larger conditioning sizes \(d_Z\). For more information, see Appendix D.6.

Fig. 3.
figure 3

\(\text {F1}\) scores (higher better) of PC-stable with CI tests mCMIkNN, CG, KCIT, and disc\(\chi ^2\) computed over \(3\,000\) CGMs for varying the sample sizes n, discrete variable ratios dvr, densities of CGMs, and numbers of variables N (left to right)\(^2\).

5.5 Real-World Scenario: Discrete Manufacturing

Finally, we apply mCMIkNN in causal discovery on real-world manufacturing data. Therefore, we consider a simplified discrete manufacturing process whose underlying causal structures are confirmed by domain experts. In particular, we consider quality measurements \(Q_{con}\) and rejections \(R_{con}\) within a configuration phase used for adjustment of the processing speed \(S_{con}\) to reduce the number of rejected goods \(R_{prod}\) within a production phase. Besides these causal structures for configuration, rejections within the production phase \(R_{prod}\) vary given the corresponding locality within one of nine existing units U. In contrast to commonly applied discretization-based approaches, cf. [20], an experimental evaluation shows that mCMIkNN covers more of the CI characteristics present in the mixed discrete-continuous real-world data, hence, yields better estimates of causal structures when used in constraint-based causal discovery, \(\text {F1}=0.57\) for mCMIkNN vs. \(\text {F1}=0.4\) for disc\(\chi ^2\). For additional details, see Appendix E.

6 Conclusion

We addressed the problem of testing CI in mixed discrete-continuous data and its application in causal discovery. We introduced the non-parametric CI test mCMIkNN, and showed its validity and power theoretically and empirically. We demonstrated that mCMIkNN outperforms state-of-the-art approaches in the accuracy of CI decisions, particularly for low sample sizes.

While mild assumptions simplify the application of mCMIkNN in practice, we cannot derive bounds on type I and II error control for the finite case as provided in [28], but the empirical results show that mCMIkNN is robust in the finite case, too. These bounds can be achieved by considering stronger assumptions, such as lower bounds on probabilities for discrete values, cf. [2, 28], or smoothness assumptions for continuous variables, cf. [4, 53]. Further, the current implementation of mCMIkNN is restricted to metric spaces. To extend the implementation to categorical variables, an isometric mapping into the metric space can be examined, cf. [37]. Note that kNN methods are not invariant regarding the scaling of variables, and their computational complexity yields long runtimes, particularly for large sample sizes. For an evaluation of runtimes, see Appendix D.5. We consider parallel execution strategies to speed up the computation, e.g., parallelizing the execution of \(M_{perm}\) permutations in Algorithm 2, cf. [43], or using GPUs [21].