1 Introduction

Nowadays there is increasing availability of good quality data produced by official statistics, and useful for both study purpose and decision making. Simple but important examples are the EU-SILC survey on income, the ISTAT survey on labour forces, the ISTAT survey on consumption expenses, etc.... Typically in those surveys data are collected with unequal first order inclusion probabilities. In all the above mentioned cases, as well as in many others, the construction of multivariate statistical models involving several variables is of primary interest. In this context Bayesian networks (BN) play a very important role because of their flexibility and easy-to-read representation of relationships among variables by means of edges connecting them. Bayesian networks are multivariate statistical models satisfying sets of conditional independence statements contained in a directed acyclic graph (DAG), see Cowell et al. (2007). The nodes of the graph correspond to random variables, while edges represent dependencies. In recent years BNs have been successfully applied to a large variety of contexts; among them official statistics. Data collected through a survey are typically affected by selection bias due to sampling design, nonresponse and measurement error. BNs appeared to be very useful in missing data imputation (Di Zio et al. (2004), Thibaudeau and Winkler (2002)), contingency table estimation for complex survey sampling (Ballin and Scanu (2010)) and measurement errors (Marella and Vicard (2013), Marella and Vicard (2015)). However, there are still serious obstacles complicating a wider application in official statistics contexts and in all those contexts where complex sampling designs are used. In fact, missing item imputation and measurement error correction can be performed once the BN structure is available (either known in advance or learned from data) since information has to be propagated throughout the network. Therefore, it is necessary to develop structural learning algorithms accounting for the sampling design complexity.

Learning BNs from a sample can be a time consuming task and a challenging issue even when data are independent and identically distributed (i.i.d.). For a survey on structural learning, see Drton and Maathuis (2017). In case of data driven learning, three broad classes of algorithms can be distinguished: constraint based algorithms, score-plus-search and hybrid algorithms. Constraint based algorithms carry out a series of independence tests and construct a graph satisfying the discovered independence statements. The second group of algorithms considers structure learning as a structural optimization problem, using a search strategy to select the structure optimizing a given score function which measures the fitting degree of network and data. Recently, the third group of algorithms, hybrid algorithms, has become widely used. They combine the ideas and the advantages of the previous two types of algorithms. One popular strategy is to use constraint-based algorithms (in the first learning phase) to determine the initial network structure, and then use score-based algorithms (in the second learning phase) to find the highest-scoring network structure, see Tsamardinos et al. (2006). The main constraint based algorithm is the PC algorithm, see Spirtes et al. (2000). It has several advantages, among which an intuitive basis. The PC algorithm is based on conditional independence tests, usually performed by using the standard Pearson chi-square test statistic under i.i.d. assumption, which is equivalent to simple random sampling assumption. However, sample selection in surveys may involve more complex sampling designs based on stratification, different level of clustering and inclusion probabilities proportional to an appropriate size measure. Complex designs can severely impact on i.i.d. based method, as shown in Skinner et al. (1989). In such circumstances, the standard test procedure is not valid even asymptotically. As a consequence, in these cases the PC algorithm fails in correctly identifying the independence equivalence class of DAGs containing the true structure of the BN under examination.

Survey weights and design effects are appropriate tools by which complex sampling designs can be accommodated. As far as the chi-square statistic is concerned, corrections based on the design effects have been proposed in Rao and Scott (1981), Rao and Scott (1984). However, such corrections require design effect or, alternatively, full covariance matrix estimates of the cell proportions estimators that are not generally delivered and then need to be estimated by resampling.

In this paper, a novel approach for inferring causal structure from complex survey data is investigated requiring to combine two different statistical cultures: sampling theory and structural learning. A modified version of the PC algorithm (PC-cs algorithm, for short) that uses independence tests accounting for the sample selection mechanism is proposed. The sampling design complexity is accounted for via a design-based approach by including the sampling weights in the BN parameters estimates. In fact, such weights contain invaluable information about the relationship between the distribution of the sample data and the distribution in the population from which the sample is taken. After having estimated such parameters, a procedure based on the chi-square statistic for testing the association in a two-way table is proposed; its limiting sampling distribution is estimated by resorting to resampling techniques for finite populations. The new test procedure is applied to BN structural learning.

The paper is organized as follows. In Sect. 2 the PC algorithm for i.i.d. data and the basic assumptions on which it relies are briefly recalled. In Sect. 3 the modified version of the PC algorithm for complex survey data is introduced and described. A simulation study is performed in Sect. 4. Finally, advantages and disadvantages of the proposed approach are discussed in Sect. 5.

2 Discovering causal structure with the PC algorithm

2.1 Preliminary definitions

A DAG is a pair \(G=(V,E)\) consisting of a set of vertices V and a set of directed edges between pairs of nodes.

A directed graph is acyclic in the sense that it is not possible to start from a node and go back to the same node following arrows directions. Each node represents a random variable, while a missing arrow between two nodes implies (un)conditional independence between the corresponding variables. Examples of DAG are shown in Fig. 1a, b. Consider Fig. 1a. In the arrow \(X_2 \rightarrow X_3\), \(X_2\) is said parent of \(X_3\) and \(X_3\) is said child of \(X_2\). \(X_1\) is called an ancestor of \(X_3\) and \(X_3\) a descendant of \(X_1\) since \(X_1\) is connected to \(X_3\) by a directed path, i.e. a sequence of direction preserving arrows. Two nodes \(X_i\) and \(X_j\) are said adjacent if they are directly connected by a directed or undirected edge. A graph is complete if all vertices are joined by a directed or indirected edge. A v-structure is a triple of nodes \((X_i,X_k,X_j)\) such that the arrows \(X_i \rightarrow X_k\) and \(X_j \rightarrow X_k\) are in the DAG. In such configuration, \(X_k\) is said unshielded collider. If \(X_{i}\) and \(X_{j}\) are connected then the collider \(X_{k}\) is termed shielded. For example, in Fig. 1a the triple \((X_2,X_3,X_4)\) constitutes a v-structure where \(X_3\) is an unshielded collider. The skeleton of a DAG G is the undirected graph obtained from G by replacing all arrows with lines (undirected edges).

For example, the undirected graph in Fig. 1c is the skeleton of the graphs in Fig. 1a, b.

Fig. 1
figure 1

a Example of DAG, b DAG Markov equivalent to (a); c skeleton of DAGs in (a) and (b); d completed partially DAG, CPDAG for DAGs (a), (b)

We next recall the basic assumptions and then the main steps of the PC algorithm.

2.2 Assumptions

Let P be the joint probability distribution associated to G. The following assumptions are set when applying the PC algorithm (for details see Zhang and Spirtes (2008) and Uhler et al. (2013)):

  1. 1.

    Sufficiency Condition. The set of observed variables V is causally sufficient, that is every common direct cause relative to V of any pair of variables in V is also contained in V.

  2. 2.

    Causal Markov Condition. P is said to be Markov with respect to G if a node of G is probabilistically independent of its non descendents given its parents in G. For example, in Fig. 1a, \(X_3\) is independent of \(X_1\) given its parent \(X_2\). The causal Markov condition defines the set of conditional independence relations entailed by the DAG.

  3. 3.

    Causal Faithfulness Condition. The joint probability distribution P is faithful to G if all conditional independencies can be read off G. This means that if the true causal structure G does not entail a conditional independence relation according to the causal Markov condition, then the conditional independence relation does not hold for the true probability distribution. In Zhang and Spirtes (2008) a decomposition of the faithfulness assumption useful for violation detection is proposed. The faithfulness condition implies adjacency-faithfulness and orientation-faithfulness. These do not constitute an exhaustive decomposition of the faithfulness assumption. However, the leftover part is irrelevant to the correctness of structural learning procedures such as PC algorithm.

2.3 PC algorithm for i.i.d. data

The PC algorithm starts with a complete undirected graph on the set V, and ends with a class of Markov equivalent DAGs where all the associated models encode the same conditional independence information.

Two DAGs are Markov equivalent if and only if they have the same skeleton and the same v-structures, Verma and Pearl (1990). For example DAGs a and b in Fig. 1 are Markov equivalent. In fact, they have the same skeleton (Fig. 1c) and the same v-structure \(X_2 \rightarrow X_3 \leftarrow X_4.\) A common tool for visualizing equivalence classes of DAGs are the completed partially directed acyclic graphs (CPDAG), see Spirtes et al. (2000). A CPDAG is a summary graph that has: a directed edge where all DAGs in the equivalence class have the same directed edge; an undirected edge between \(X_i\) and \(X_j\) if in the equivalence class there exist at least a DAG with \(X_i \rightarrow X_j\) and a DAG with \(X_i \leftarrow X_j\). An example of CPDAG is shown in Fig. 1d.

The PC algorithm proceeds according to the following two phases:

  1. Phase 1

    Skeleton estimation. First, all pairs \((X_i, X_j)\) are tested for marginal independence removing the edge between independent variables and saving the empty set as separating sets \(S_{ij}\) and \(S_{ji}\). Then all the pairs, say \((X_i, X_j)\), still adjacent, are tested for independence conditionally on one single node adjacent to \(X_i\). If \(X_i\) and \(X_j\) are judged to be independent given, say, \(X_k\), the edge between \(X_i\) and \(X_j\) is removed and \(X_k\) is saved as separating sets \(S_{ij}\) and \(S_{ji}\). The algorithm proceeds augmenting, one unit at a time, the conditioning set size until all adjacency sets are smaller than the conditioning set size. The resulting graph is the skeleton.

  2. Phase 2

    Arrows orientation. The v-structures and their colliders are identified. A triple of vertices \((X_i,X_k,X_j)\) in the skeleton such that the pairs \((X_i,X_ k)\) and \((X_j,X_k)\) are adjacent but \((X_i,X_j)\) is not, is oriented as a v-structure \(X_i \rightarrow X_k \leftarrow X_j\) if \(X_k\) is not in \(S_{ij}= S_{ji}\). Once all v-structures have been identified, it may be possible to orient some of the remaining edges, without introducing additional v-structures or directed cycles, thus leading to a CPDAG.

As far as the faithfulness assumption is concerned, adjacency-faithfulness condition is necessary to recover the skeleton of the true DAG and it is strictly related to Phase 1 of the PC algorithm. The orientation-faithfulness condition, necessary for finding the correct arrows orientation, is related to Phase 2 of the algorithm.

3 PC algorithm for complex survey data

3.1 The problem

Under the assumptions in Sect. 2.2 and if the sample size is large enough, the original PC algorithm is able to infer from data the Markov equivalence class the true causal DAG belongs to. This means that, if the input of the PC algorithm is a sample from a population distribution P faithful to some DAG, then in large sample limit, the algorithm can identify any probabilistic independence claim with perfect reliability. In practice, we do not have direct access to the true population distribution, and need to do statistical inference based on finite sample size. Hence, since the first phase of the PC algorithm consists in a series of conditional independence tests based on a finite sample size, it is possible that the original graph is not recovered even if the PC algorithm assumptions are verified at the population level. Therefore, it becomes very relevant to causal inference whether the population probability distribution, though faithful to the true causal structure, is far from or close to being unfaithful. As stressed in Zhang and Spirtes (2008), a population distribution is close-to-unfaithful to a causal structure, if the structure does not entail some conditional independence relation according to the Causal Markov Condition, but the conditional independence almost holds, or in other words, the conditional dependence is by some measure very weak in the population. Precisely, the dependence degree and the sample size determine how “weak" counts as “close to independence". It is clear that at every finite sample size, there are distributions faithful to the true causal structure but so close to being unfaithful as to possibly cause troubles for inference at that sample size. For instance, two variables, though entailed to be dependent conditional on some variables, can be close to be conditionally independent. As a consequence, due to sample size, tests can fail to correctly identify such a dependence leading to errors in judgment about the properties of the population.

The situation worsens for complex survey data since the sampling design can modify independence and conditional independence relations associated with the population probability distribution P. Roughly speaking, complex sampling schemes lead to unequal selection probabilities; ignoring this can result in biased estimates of the population distribution P. As a consequence, even if the population distribution is faithful to some DAG G, the sample distribution may not be faithful to the same DAG because of the sample selection mechanism. Specifically, the selection probabilities are often unequal in at least some stages of the sample selection. When such probabilities are correlated with the survey variables of interest, the observed outcomes are no longer representative of the population outcomes and the model holding for the sample data is then different from the model holding in the population, see Pfeffermann (2001). In this case, conventional analysis, which ignores the sampling effects, may yield large bias and erroneous inference, as illustrated, for example, in the book edited by Skinner et al. (1989). In Marella and Pfeffermann (2019) the effect of alternative sampling designs on a trivariate normal population distribution has been evaluated. For instance, suppose that (XYZ) is a trivariate normal with mean vector \(\varvec{\mu }\) and covariance matrix \(\varvec{\Sigma }\), and that the sample inclusion probabilities have expectations

$$\begin{aligned} E_p(\pi _i \vert x_i,y_i,z_i)=\kappa \exp \{\gamma _X x_i+\gamma _Y y_i+\gamma _Z z_i\}, \end{aligned}$$
(1)

where \(\kappa\) guarantees that the expectation is less or equal to one. It can be shown that the model holding for the sample outcomes is normal but with parameters \(\varvec{\mu }_{S}\) and \(\varvec{\Sigma }\), see Marella and Pfeffermann (2019). In this case, the sample model and the population model belong to the same family and differ only in some parameters. Clearly, other examples in Marella and Pfeffermann (2019) show that the population and the sample distributions can be either in the same family and differ only in some or all the parameters, or in different families.

In the sequel, we assume that the design variables used for sample selection are known for all the sample units so that the sufficiency condition is satisfied, as in Ballin and Scanu (2010). Nevertheless, the Markov and the faithfulness condition may fail if the sample is selected by a procedure that is biased towards two or more variables in the set V.

In the PC algorithm for complex survey data the skeleton learning step (Phase 1) of the PC algorithm is modified by introducing a procedure for testing association in a two-way table for data coming from complex sample surveys. Such a procedure is introduced in Sect. 3.2 where the existence of a limiting distribution of the test statistic under the independence null hypothesis is proved. In paragraph 3.3 such a distribution is estimated by resampling methods for finite population.

3.2 Independence test for complex sample surveys

In this section a test for independence based on the chi-square statistic in a two-way table for complex survey data is described. The test relies on some results in Conti et al. (2018) where a resampling technique allowing to make inference on the superpopulation parameters in a finite population setting is proposed.

Denote by A and B two characters of interest with H \((A^{1},\dots ,A^{H})\) and K \((B^{1},\dots ,B^{K})\) categories, respectively. Furthermore, let the superpopulation parameters \(p^{hk}\), \(p^{h.}\), \(p^{.k}\) be defined as

$$\begin{aligned} p^{hk}= & {} Prob(A=A^{h},B=B^{k}), \quad p^{h.}=Prob(A=A^{h})\nonumber \\ p^{.k}= & {} Prob(B=B^{k}) \end{aligned}$$
(2)

for \(h=1,\dots ,H\) and \(k=1,\dots ,K\). Let \({{\mathcal {U}}}_N\) be a finite population of size N generated from the superpopulation model (2), labeled by integers \(1,\dots ,N\). For each unit i, let \(Y_{i}^{h.}\) (\(Y_{i}^{.k}\)) be the indicator variable taking value 1 if the unit i assumes the modality \(A^{h}\) (\(B^{k}\)) and 0 otherwise, for \(h=1,\dots ,H\) (\(k=1,\dots ,K\)). Let \(Y_{i}^{hk}=Y_{i}^{h.}Y_{i}^{.k}\) so that for each unit i the following equalities hold

$$\begin{aligned} \sum _{h=1}^{H}Y_{i}^{h.}=\sum _{k=1}^{K}Y_{i}^{.k}=1, \quad \sum _{h=1}^{H}Y_{i}^{hk}=Y_{i}^{.k}, \quad \sum _{k=1}^{K}Y_{i}^{hk}=Y_{i}^{h.}. \end{aligned}$$
(3)

For each unit i in the population \({{\mathcal {U}}}_N\), let \(D_i\) be the sample membership indicator, i.e. a Bernoulli random variable taking value 1 whenever i is in the sample, and 0 otherwise, and let \({\varvec{D}}_{N}= (D_1 ,\dots ,D_N)\) be the sample membership random vector for the population units. An unordered, without replacement sampling design S is the probability distribution of \({\varvec{D}}_{N}\). In particular \(\pi _i = E_S [ D_i ]\) is the first order inclusion probability of unit i. The suffix S denotes the sampling design used to select population units. The sample selection, and therefore \(\pi _i\)s, depends on the values of design variables (like strata, cluster indicator variables or size measures) that are statistically related to \(Y_i^{h.}\) and \(Y_i^{.k}\) but are not included in the inference model. Consequently, the distribution holding for the sample data may be very different from the distribution in the population. In other words, even if the \(Y_i^{h.}\), \(Y_i^{.k}\), \(i=1, \ldots , N\), are i.i.d. at a population level, they are not i.i.d. at a sample level due to the sampling design (Pfeffermann (1993)).

The effective sample size is the r.v. \(n_s = D_1 + \cdots + D_N\). In the sequel we will confine ourselves to fixed size sampling designs, such that \(n_s \equiv n\). Assumptions on the sampling design according to which the sample is drawn, are similar to those used in Conti (2014), Conti and Marella (2015a) and Conti et al. (2019) (assumptions A1-A6). From now on we will assume that the sampling design possesses asymptotically maximal entropy. More precisely, consider the Poisson sampling design where the random variables \(D_i\)s are independent with \(E_{S}[D_i]=\pi _i\), and \(n=\pi _1 +\cdots + \pi _N\). The rejective sampling (or normalized conditional Poisson sampling) is obtained by conditioning the Poisson design to \(n_S=n\). Denote by \(H (S) = E_S \left[ \log Pr_S ( {\varvec{D}}_{N}) \right]\) the entropy of sampling S. In particular, the rejective design possesses maximal entropy among all sampling designs with fixed sample size and fixed first order inclusion probabilities. Let \(Pr_R(\cdot )\) be the sample probabilities for the rejective sampling, the Kullback-Leibler divergence of \(Pr_S\) from \(Pr_R\) can be written as,

$$\begin{aligned} \Delta _{KL}(Pr_S||Pr_R)=H(Pr_R)-H(Pr_S). \end{aligned}$$
(4)

In the sequel we assume that \(\Delta _{KL}(Pr_S||Pr_R) \rightarrow 0\) as \(n, N \rightarrow \infty\). This means that the sampling design S possesses high (asymptotically maximal) entropy.

The properties of high entropy sampling designs are discussed in Berger (2011), Grafström (2010) and references therein. As a matter of fact, asymptotics for high entropy sampling designs only depend on their first order inclusion probabilities. In other terms, if two high entropy designs have the same first order inclusion probabilities, then they have the same asymptotic behaviour. High entropy sampling designs generate highly randomized samples, which in turn make the design more robust. A discussion about high entropy designs and their relationship with robustness can be found in Grafström (2010). Examples of maximal asymptotic entropy sampling designs are simple random sampling, Rao-Sampford design, Chao design, stratified design, etc., see Berger (2011).

3.2.1 Parameters estimators and their distributions

Let \(p_{N}^{hk}\), \(p_{N}^{h.}\), \(p_{N}^{.k}\) be the finite population parameters defined as

$$\begin{aligned} p^{hk}_{N}= & {} \frac{1}{N}\sum _{i=1}^{N}Y_{i}^{hk}, p^{h.}_{N}=\frac{1}{N}\sum _{i=1}^{N}Y_{i}^{h.}=\sum _{k=1}^{K}p_{N}^{hk}, \nonumber \\ p^{.k}_{N}= & {} \frac{1}{N}\sum _{i=1}^{N}Y_{i}^{.k}=\sum _{h=1}^{H}p_{N}^{hk} \end{aligned}$$
(5)

where \(h=1,\dots , H\), \(k=1, \dots , K\). In complex sample design sample proportions would result in inconsistent estimators of the population parameters, Pfeffermann (1993). Parameters (5) can be estimated using the classical Hájek estimators (Hájek (1964))

$$\begin{aligned} \widehat{p}^{hk}= & {} \frac{\sum _{i=1}^{N}\frac{D_{i} Y_{i}^{hk}}{\pi _i}}{\sum _{i=1}^{N}\frac{D_{i}}{\pi _i}},\quad \widehat{p}^{h.}=\frac{\sum _{i=1}^{N}\frac{D_{i} Y_{i}^{h.}}{\pi _i}}{\sum _{i=1}^{N}\frac{D_{i}}{\pi _i}}=\sum _{k=1}^{K}\widehat{p}^{hk},\quad \nonumber \\ \widehat{p}^{.k}= & {} \frac{\sum _{i=1}^{N}\frac{D_{i} Y_{i}^{.k}}{\pi _i}}{\sum _{i=1}^{N}\frac{D_{i}}{\pi _i}}=\sum _{h=1}^{H}\widehat{p}^{hk} \end{aligned}$$
(6)

for \(h=1, \dots , H\), \(k=1, \dots , K\), where \(1/\pi _{i}\) is the sampling weight, that is the reciprocal of the probability that the unit i is included in the sample. Note that, in the simple random sampling case, the estimators (6) reduce to the proportion of units in the sample belonging to categories (hk), h and k, respectively.

The existence of the limiting distribution of the Hájek estimators (6) is studied in Conti et al. (2018), under both the sample selection and the population generation randomness sources. Similar results are in Boistard et al. (2017) where no resampling schemes allowing to recover the limiting distribution of the chi-square test statistic (14) defined in Sect. 3.2.2 are proposed. Here, similarly to Conti et al. (2018), we analyze the behaviour of the stochastic processes

$$\begin{aligned} W^{HK}= & {} \{\sqrt{n} ( \widehat{p}^{hk} - p^{hk} ), h=1, \, \dots , \, H,\, k=1, \, \dots , \, K \}, \end{aligned}$$
(7)
$$\begin{aligned} W^{H}= & {} \{\sqrt{n} ( \widehat{p}^{h.} - p^{h.} ), h=1, \, \dots , \, H\}, \end{aligned}$$
(8)
$$\begin{aligned} W^{K}= & {} \{\sqrt{n} ( \widehat{p}^{.k} - p^{.k} ), k=1, \, \dots , \, K\}. \end{aligned}$$
(9)

Each of these processes can be partitioned as follows

$$\begin{aligned} W^{HK}= & {} \{W^{HK}_{n}+W^{HK}_{N}, h=1, \, \dots , \, H,\, k=1, \, \dots , \, K \}\nonumber \\= & {} \{\sqrt{n} ( \widehat{p}^{hk} - p^{hk}_{N} )+ \sqrt{f}\sqrt{N}(p^{hk}_{N}-p^{hk}) \}, \end{aligned}$$
(10)
$$\begin{aligned} W^{H}= & {} \{W^{H}_{n}+W^{H}_{N}, h=1, \, \dots , \, H\}\nonumber \\= & {} \{\sqrt{n} ( \widehat{p}^{h.} - p^{h.}_{N} )+\sqrt{f}\sqrt{N}(p^{h.}_{N}-p^{h.})\}, \end{aligned}$$
(11)
$$\begin{aligned} W^{K}= & {} \{W^{K}_{n}+W^{K}_{N}, k=1, \, \dots , \,K\}\nonumber \\= & {} \{\sqrt{n} ( \widehat{p}^{.k} - p^{.k}_{N})+\sqrt{f}\sqrt{N}(p^{.k}_{N}-p^{.k})\} \end{aligned}$$
(12)

for \(h=1, \, \dots , \, H, k=1, \, \dots , \, K\) and \(f=\frac{n}{N}\). The components \(W^{HK}_{n},W^{H}_{n},W^{K}_{n}\) depend on the sample selection randomness while the components \(W^{HK}_{N}\),\(W^{H}_{N}\),\(W^{K}_{N}\) depend on the superpopulation randomness. Proposition 1 establishes the convergence of processes \((7)-(9)\) to Gaussian distributions.

Proposition 1

Under the assumptions A1-A6 of Proposition 1 in Conti et al. (2018), as n and N increase, the sequences:

  1. 1.

    \(W^{HK}_{n}\) and \(W^{HK}_{N}\) converge in distribution to degenerate multivariate normal distributions with zero mean vector and singular covariance matrices \(\varvec{\Sigma }^{HK}_1\) and \(\varvec{\Sigma }^{HK}_2\) of order HK, respectively. As a consequence, the whole process \(W^{HK}\) converges to a degenerate multivariate normal distribution with zero mean vector and singular covariance matrix \(\varvec{\Sigma }^{HK}=\varvec{\Sigma }^{HK}_1+f\varvec{\Sigma }^{HK}_2\).

  2. 2.

    \(W^{H}_{n}\) and \(W^{H}_{N}\) converge in distribution to degenerate multivariate normal distributions with zero mean vector and singular covariance matrices \(\varvec{\Sigma }^{H}_1\) and \(\varvec{\Sigma }^{H}_2\) of order H, respectively. As a consequence, the whole process \(W^{H}\) converges to a degenerate multivariate normal distribution with zero mean vector and singular covariance matrix \(\varvec{\Sigma }^{H}=\varvec{\Sigma }^{H}_1+f\varvec{\Sigma }^{H}_2\).

  3. 3.

    \(W^{K}_{n}\) and \(W^{K}_{N}\) converge in distribution to degenerate multivariate normal distributions with zero mean vector and singular covariance matrices \(\varvec{\Sigma }^{K}_1\) and \(\varvec{\Sigma }^{K}_2\) of order K, respectively. As a consequence, the whole process \(W^{K}\) converges to a degenerate multivariate normal distribution with zero mean vector and singular covariance matrix \(\varvec{\Sigma }^{K}=\varvec{\Sigma }^{K}_1+f\varvec{\Sigma }^{K}_2\).

The proof rests on the same ideas as the proof of Proposition 1 in Conti et al. (2018) and it can be seen as a direct consequence. The proof guidelines are reported in the Appendix.

3.2.2 The independence test statistic

Suppose to test the null hypothesis that the two categorical variables A and B are independent, against the alternative hypothesis that they are associated. Formally

$$\begin{aligned} {{\mathcal {H}}_{0}}:p^{hk}=p^{h.}p^{.k} \,\,\,\,\,\, \hbox {against} \,\,\,\,\,\, {{\mathcal {H}}_{1}}:p^{hk} \ne p^{h.}p^{.k}. \end{aligned}$$
(13)

The used test statistic is

$$\begin{aligned} \chi ^2_{H}=n\sum _{h=1}^{H}\sum _{k=1}^{K}\frac{(\widehat{p}^{hk}-\widehat{p}^{h.}\widehat{p}^{.k})^{2}}{\widehat{p}^{h.}\widehat{p}^{.k}} \end{aligned}$$
(14)

where the sampling weights in the Hájek estimators \(\widehat{p}^{hk}\), \(\widehat{p}^{h.}\) and \(\widehat{p}^{.k}\) (6) compensate for different selection probabilities. As stressed in Proposition 3, for complex survey data: (i) the statistic (14) does have a limiting distribution; (ii) the limiting distribution does not necessarily approach a chi-square distribution due to the singularity of the covariance matrices \(\varvec{\Sigma }^{HK}\), \(\varvec{\Sigma }^{H}\) and \(\varvec{\Sigma }^{K}\) in Proposition 1. From Proposition 1, the following result holds.

Proposition 2

Let \({\varvec{p}}^{HK}=(p^{11},\dots ,p^{1K},\dots ,p^{H1},\dots ,{\varvec{p}}^{HK})\), \({\varvec{p}}^{H.}=(p^{1.},\dots ,p^{H.})\), \({\varvec{p}}^{.K}=(p^{.1},\dots ,p^{.K})\) be vectors of length HK, H and K, respectively, and denote by \(\widehat{{\varvec{p}}}^{HK}\), \(\widehat{{\varvec{p}}}^{H.}\) and \(\widehat{{\varvec{p}}}^{.K}\) their estimates. Furthermore, let

$$\begin{aligned} \widehat{{\varvec{p}}}=(\widehat{{\varvec{p}}}^{HK},\widehat{{\varvec{p}}}^{H.},\widehat{{\varvec{p}}}^{.K}) \end{aligned}$$

and

$$\begin{aligned} {\varvec{p}}^{{{\mathcal {H}}_{0}}}=(p^{1.}p^{.1},\dots ,p^{H}p^{.K},p^{1.},\dots ,p^{H.},p^{.1},\dots ,p^{.K}) \end{aligned}$$

be two vectors of length \(T=HK+H+K\). Under the null hypothesis \({{\mathcal {H}}_{0}}\) the statistic

$$\begin{aligned} {\varvec{Z}}=\sqrt{n}(\widehat{{\varvec{p}}}-{\varvec{p}}^{{{\mathcal {H}}_{0}}}) \end{aligned}$$
(15)

converges, as n and N go to infinity, to a degenerate multivariate normal distribution with T components having zero mean vector and singular covariance matrix \(\varvec{\Sigma }^{T}\) .

The proof of Proposition 2 is a consequence of Proposition 1 and is reported in the Appendix.

Proposition 3

Let f be the continuous, differentiable and with continuous derivatives function defined as follows

$$\begin{aligned} f : {\varvec{Z}}\rightarrow \chi ^{2}_{H}=n\sum _{h=1}^{H}\sum _{k=1}^{K}\frac{(\widehat{p}^{hk}-\widehat{p}^{h.}\widehat{p}^{.k})^{2}}{\widehat{p}^{h.}\widehat{p}^{.k}}. \end{aligned}$$
(16)

From Proposition 2, it follows that \(\chi ^{2}_{H}\) tends in distribution to a quadratic form of a degenerate multivariate distribution.

The proof of Proposition 3 is reported in the Appendix.

3.3 Estimation of test statistic limiting distribution under \({{\mathcal {H}}_{0}}\) via resampling

Resampling methods are computer-based methods for assigning measures of accuracy to statistics estimates and performing statistical inference. In the Bayesian Network learning process, the Efron’s bootstrap Efron (1979) has been proposed as a computationally efficient approach for answering questions about specific network features. In Friedman et al. (1999) the aim is the increase in the learning procedure performance with information from the bootstrap estimates, even when the amount of data is not enough to induce a high scoring network. Then, the bootstrap method is used to answer questions as: (i) Is the existence of an edge between two nodes warranted? (ii) Is the Markov blanket of a given node robust? (iii) Can we say anything about the ordering of the variables?

Here resampling methods are used to recover the limiting sampling distribution of the test statistic (14) under the independence null hypothesis. With this regard, in sampling from finite populations, the original Efron’s bootstrap can lead to biased results since it does not take into account the dependence among units due to sampling design. Adaptations taking into account the non i.i.d. nature of the data have been proposed in the literature, which is mainly devoted to estimate variances of estimators; cfr. Mashreghi et al. (2016). The main approaches are essentially two: ad hoc approaches and plug in approaches (cfr. Ranalli and Mecatti (2012), Chauvet (2007) and references therein). The basic idea of ad hoc approaches consists in resampling from the original sample through a special design accounting for the dependence among units. This approach is pursued in McCarthy and Snowden (1985), Rao and Wu (1988), where the re-sampled data produced by the i.i.d. bootstrap are properly rescaled, as well as in Sitter (1992), Beaumont and Patak (2012), Chatterjee (2011), Conti and Marella (2015b), where a rescaled bootstrap process based on asymptotic results is proposed. Among the ad hoc approaches we also quote the paper by Antal and Tillé (2011), where a mixed resampling design is proposed. Plug-in approaches are based on the idea of expanding the sample to a pseudo-population that plays the role of a prediction of the original one. Then, bootstrap samples are drawn from such a pseudo-population according to some appropriate resampling design. The most intuitive choice consists in using the same sampling design used to draw the original sample from the population; cfr. Gross (1980), Chao and Lo (1985), Booth et al. (1994), Holmberg (1998), Chauvet (2007), as well as Mashreghi et al. (2016).

Virtually all resampling techniques proposed for finite populations rest on the same justification: in case of linear statistics, the variance of the resampled statistic should be very close to the usual variance estimator, possibly with approximated forms of the second order inclusion probabilities; cfr. Antal and Tillé (2011). This is far from the arguments commonly used to justify the classical bootstrap and its variants, that are based on asymptotic considerations involving the whole sampling distribution of a statistic (cfr., for instance, Bickel and Freedman (1981) and Lahiri (2003)): the asymptotic distribution of a bootstrapped statistic should coincide with that of the “original” statistic. This argument is actually used in Conti and Marella (2015b). In Conti et al. (2019), a class of resampling techniques for finite populations is proposed; cfr. also the paper Jiménez-Gamero et al. (2018), where asymptotic results in Conti et al. (2019) are used to construct various testing procedures based on a design-based estimator of the finite population characteristic function.

However, the class of resampling techniques defined in Conti et al. (2019) does not work in the present case due to the descriptive inference framework. The noticeable exception is the multinomial scheme introduced in Conti et al. (2018) together with the asymptotic results used in Sect. 3.2.1. Such a resampling procedure has been exploited in Conti et al. (2020) to approximate the asymptotic law of the Lorenz curve estimator when data are collected according to a complex sampling design.

It is based on a two-step procedure consisting in: (i) constructing on the basis of the sampling data a prediction of the population (i.e the multinomial pseudo-population) by sampling N units independently from the original sample where each unit \(i \in s\) is selected with probability \(\pi ^{-1}_{i}/\sum _{j \in s} \pi ^{-1}_{j}\). Such a prediction is based on the sampling design, and does not essentially involve the superpopulation model; (ii) drawing a sample of the same size of the original one from the pseudo-population according to an appropriate resampling design fulfilling the high entropy requirement. Notice that the idea of using pseudo-populations has been previously used by Mashreghi et al. (2016).

The asymptotic distribution of (14) under the null hypothesis is estimated applying the following procedure.

  1. Step 1

    Generate \(M=1000\) bootstrap samples of size n as the original sample size on the basis of the two phase resampling procedure described above. Notice that, the pseudo-population in phase 1 has to be generated under the null hypothesis \({{\mathcal {H}}_{0}}\). Specifically, for each unit \(i \in s\) such that \(Y^{h.}_{i}=1\) and \(Y^{.k}_{i}=1\), the original weight \(w_{i}=1/\pi _{i}\) is modified as follows

    $$\begin{aligned} w^{*}_{i}=w_{i}\frac{\widehat{p}^{h.}\widehat{p}^{.k}}{\widehat{p}^{hk}}. \end{aligned}$$
    (17)

    Let \(A^{hk}\) be the set \(\{i \in s: Y^{h.}_{i}=1,Y^{.k}_{i}=1\}\), the modified weights (17) guarantee that

    $$\begin{aligned} \sum _{i \in A^{hk}}w^{*}_{i}=\frac{\sum _{i \in A^{h.}}w_{i}\sum _{i \in A^{.k}}w_{i}}{\sum _{i \in s}w_{i}}. \end{aligned}$$
    (18)
  2. Step 2

    For each bootstrap sample, compute the corresponding Hájek estimators \(\widehat{p}^{hk}\), \(\widehat{p}^{h.}\),\(\widehat{p}^{.k}\) (6).

  3. Step 3

    Compute the M quantities \(\chi ^{2,m}_{H}\), \(m=1, \dots ,M\) as in (14).

  4. Step 4

    Compute the empirical cumulative distribution function of \(\chi ^{2,m}_{H}\)s

    $$\begin{aligned} \widehat{T}_{n,M} (t) = \frac{1}{M} \sum _{m=1}^{M} I_{( \chi ^{2,m}_{H} \le t )} , \;\; t \in {\mathbb {R}}. \end{aligned}$$
    (19)

Finally, compute the \(1-\alpha\) percentile of \(\widehat{T}_{n,M} (t)\)

$$\begin{aligned} \widehat{T}^{-1}_{n,M} (1-\alpha ) = \inf \{ t : \; \widehat{T}_{n,M} (t) \ge 1-\alpha \} , \;\; 0<\alpha <1 . \end{aligned}$$
(20)

If \(\chi ^{2}_{H}<\widehat{T}^{-1}_{n,M} (1-\alpha )\) then \({\mathcal H_{0}}\) is not rejected at the \(\alpha \%\) significance level.

4 Simulation studies

In this section we proceed to empirically test the PC-cs algorithm performance via a simulation study. The study is organized as follows. First of all, in Sect. 4.1 a preliminary analysis is performed. More specifically, the performance of the proposed algorithm when the PC algorithm basic assumptions are violated by the sampling design is investigated. The analysis is based on one sample replicate. Small and larger networks are considered. In Sect. 4.2 a Monte Carlo simulation to evaluate the accuracy of the proposed algorithm is carried out. The aim is to measure how often the PC-cs algorithm is able to eliminate the selection bias recovering the correct dependence structure between the variables. The accuracy is evaluated over a large number of sample replicates. Small and larger networks with fixed or randomly chosen DAG and conditional probabilities are considered.

4.1 Evaluating the accuracy of PC algorithm for complex survey data: a preliminary analysis

In Sect. 4.1.1 the PC-cs algorithm performance when the Markov assumption is violated by the sampling design is investigated. This means that the sample distribution violates the set of conditional independence relations entailed by the DAG to which the population probability distribution P is faithful. The PC-cs algorithm behaviour in case of orientation faithfulness and adjacency faithfulness violations is investigated in sects. 4.1.2 and 4.1.3, respectively. Finally, in Sect. 4.1.4 the robustness of the PC-cs algorithm is evaluated in a ten nodes network where the aforementioned PC algorithm assumptions can be simultaneously violated by the sample distribution.

4.1.1 Markov assumption violation

A finite population of size \(N=10,000\) has been generated according to the true causal DAG in Fig. 2a. In Tables 1, 2, 3, 4, 5 the conditional probability distributions associated to the nodes are reported.

Fig. 2
figure 2

a True graph, b Finite population CPDAG

Table 1 Probability distribution of \(X_{1}\)
Table 2 Probability distribution of \(X_{2} \vert X_{1}\)
Table 3 Probability distribution of \(X_{3} \vert X_{2}\)
Table 4 Probability distribution of \(X_{4} \vert X_{2}\)
Table 5 Probability distribution of \(X_{5} \vert X_{4}\)

An estimate of the finite population underlying causal structure has been obtained using the function pc() in the R-package pcalg, with the argument method=“stable”, implementing the PC algorithm, based on conditional independence test for i.i.d data, see Kalisch et al. (2012). Notice that, in high dimensional settings the proposal of Lagani et al. (2017) and Tsagris (2019) seems to perform faster and better in terms of returning correct networks. Fig. 2b shows the resulting CPDAG with undirected edges only, representing the Markov equivalence class. We next proceed to generate a sample from our finite population by a complex sampling design. To this aim the variable \(X_{1}\) has been transformed in a continuous variable Z as follows

$$\begin{aligned} Z= \bigg \{ \begin{array}{rl} N(100,2)+15 &{} \quad \hbox {if} \quad X_{1}=0 \\ N(10,2)+5 &{} \quad \hbox {if} \quad X_{1}=1 \quad \hbox {or} \quad X_{1}=2 \\ \end{array} \end{aligned}$$
(21)

A sample of size \(n=3000\) has been drawn from the finite population according to a conditional Poisson sampling design. Inclusion probabilities are taken proportional to Z-values (21). The effect of the survey design on the causal structure learned using the PC algorithm is shown in the CPDAG in Fig. 3a where an additional edge is placed between the nodes \(X_{1}\) and \(X_{3}\). The conditional independence between \(X_{1}\) and \(X_{3}\) given \(X_{2}\) at the population level is destroyed by the sampling design.

In order to estimate the test statistic distribution under the independence null hypothesis, \(M=1000\) bootstrap replications have been drawn from the selected sample using the procedure described in sect. 3.3, and the corresponding M bootstrap estimates \(\chi ^{2,m}_{H}\), \(m=1, \, \dots , \,M\), have been computed. Although the minimum number of bootstrap samples to be considered is 599 as suggested in Wilcox (2010), we use M=1000 bootstrap samples to reach a compromise between the computational cost of the procedure and the estimate accuracy.

The significance level for the independence tests has been set equal to 0.05. In this case, the PC-cs algorithm is able to recover the true population equivalence class obtaining the CPDAG shown in Fig. 2b.

Fig. 3
figure 3

a Markov Condition Violation, b Orientation Faithfulness Violation

The simulation has been repeated considering an alternative sampling design obtained from (21) modifying the parameters of the normal distribution and the intercept value. Specifically, in (21) for \(X_1=1\) and \(X_1=2\) the Z variable is generated according to \(N(5,1)+ 10\) and the effect of sampling design on the association structure is an additional edge between \((X_3,X_4)\) with respect to CPDAG in Fig. 2b. Again the PC-cs algorithm is able to recover the true population equivalence class in Fig. 2b.

4.1.2 Orientation faithfulness violation

In order to investigate the performance of the PC-cs algorithm when the orientation faithfulness is violated, a finite population of size \(N=10,000\) has been generated according to the true causal DAG in Fig. 2a; the conditional probability distributions are in Tables 1, 4, 5 and Tables 6, 7.

Table 6 Probability distribution of \(X_{2} \vert X_{1}\)
Table 7 Probability distribution of \(X_{3} \vert X_{2}\)

A sample of \(n=3000\) has been selected from the finite population according to a conditional Poisson sampling design with inclusion probabilities proportional to Z-values (21). In this case, the survey design produces a failure of the orientation-faithfulness assumption, as shown in Fig. 3b where a v-structure on the triple \((X_{2},X_{4},X_{5})\) is placed. As before, \(M=1000\) bootstrap replications and the corresponding M bootstrap estimates \(\chi ^{2,m}_{H}\), \(m=1, \, \dots , \,M\), have been computed. Setting the significance level at 0.05, the PC algorithm for complex survey data is able to recover the true population equivalence class obtaining the CPDAG shown in Fig. 2b.

A variation of the PC algorithm, called the conservative PC algorithm, has been proposed in Ramsey et al. (2006) to detect orientation-faithfulness failures. The main difference between the PC algorithm for complex survey data and the conservative PC algorithm can be summarized as follows:

  1. 1.

    the PC algorithm for complex survey data takes into account the sampling design via a design-based approach, by including the sampling weights in the estimates of the BN parameters. Hence, the PC-cs algorithm adjusts for sample selection bias at the top of the PC algorithm (Phase 1).

  2. 2.

    The conservative PC algorithm assumes the Markov condition and the adjacency faithfulness and tests the orientation faithfulness condition performing additional independence tests. Then the conservative PC algorithm adjusts for sample selection bias at the bottom, i.e. in Phase 2, of the PC algorithm. The conservative PC algorithm works in a model-based approach avoiding the use of sampling weights, then it produces bias in the structure learning process if the sampling design is not ignorable, see Pfeffermann (1993).

In our example, the conservative PC algorithm marks the triple \((X_{2},X_{4},X_{5})\) as ambiguous since \(X_{4}\) is in some but not all separating sets. An ambiguous triple is not oriented as a v-structure. Furthermore, no later orientation rule that needs to know whether \((X_{2},X_{4},X_{5})\) is a v-structure or not is applied.

4.1.3 Adjacency faithfulness violation

In order to investigate the performance of PC-cs algorithm when the adjacency faithfulness is violated, a finite population of size \(N=10,000\) has been generated according to the true causal DAG in Fig. 4a, where a v-structure is introduced. The conditional probability distributions associated to the nodes are in Tables 8, 9, 10, 11, 12.

Table 8 Probability distribution of \(X_{1}\)
Table 9 Probability distribution of \(X_{2}\)
Table 10 Probability distribution of \(X_{3} \vert (X_{1},X_{2})\)
Table 11 Probability distribution of \(X_{4} \vert X_{2}\)
Table 12 Probability distribution of \(X_{5} \vert X_{4}\)

An estimate of the underlying causal structure in the finite population has been obtained using the function pc() in the R-package pcalg. The result is shown in Fig. 4b.

A sample of size \(n=3000\) has been drawn from the finite population according to a conditional Poisson sampling design. Inclusion probabilities are taken proportional to Z-values, defined in (21). The sampling design effect is reported in Fig. 4c where an edge between \(X_{1}\) and \(X_{3}\) is missing. On the basis of \(M=1000\) bootstrap replications and a significance level equal to 0.05, the PC algorithm for complex survey data is able to recover the true population equivalence class in Fig. 4b.

Fig. 4
figure 4

a Superpopulation graph, b finite population CPDAG, c adjacency faithfulness violation

4.1.4 Robustness of PC-cs algorithm: a ten nodes network

In this section the performance of the PC-cs algorithm is evaluated in a ten nodes network. Specifically, a finite population of size \(N=10000\) has been generated according to the network in Fig. 5a, where \(X_3\) and \(X_9\) are discrete variables assuming the values (0, 1, 2) while the remaining nodes are dichotomous variables (0, 1).

Fig. 5
figure 5

a Superpopulation graph, b finite population CPDAG

As before, an estimate of the finite population causal structure has been obtained using the function pc(). The result is shown in Fig. 5b. A sample of size \(n=3000\) has been drawn from the finite population according to a conditional Poisson sampling design. Inclusion probabilities are taken proportional to Z-values, defined as follows

$$\begin{aligned} Z= \bigg \{ \begin{array}{rl} N(10,2)+5 &{} \quad \hbox {if} \quad X_{3}+X_{9}=0 \\ N(20,4)+10 &{} \quad \hbox {if} \quad X_{3}+X_{9}=1 \\ N(30,6)+15 &{} \quad \hbox {if} \quad X_{3}+X_{9}=2 \\ \end{array} \end{aligned}$$
(22)

and the original PC-algorithm is applied. The sampling design effect on the association structure is reported in Fig. 6a.

Fig. 6
figure 6

a CPDAG from PC-algorithm, b CPDAG from the PC-cs algorithm

Clearly, as the number of nodes increases the Markov condition, the orientation faithfulness, the adjacency faithfulness discussed in sects. 4.1.14.1.3 can be simultaneously violated by the sample distribution. Specifically, as Fig. 6a shows,

  1. 1.

    the edge \((X_2,X_7)\) is missing;

  2. 2.

    the edge \((X_{3},X_{9})\) is added;

  3. 3.

    the directions of the edges \((X_2,X_4)\) and \((X_2,X_3)\) are wrong.

On the basis of \(M=1000\) bootstrap replications and a significance level equal to 0.05, the PC algorithm for complex survey data is able to recover the population equivalence class in Fig. 6b.

Finally, the structural Hamming distance (SHD, Tsamardinos et al. (2006)) has been computed. Recall that the structural Hamming distance between two CPDAGs is defined as the number of operations (addition, deletion, flips) required to make the CPDAGs match. The SHDs between the population CPDAG (Fig. 5b) and the CPDAG obtained by the PC algorithm (Fig. 6a) and the PC-cs algorithm (Fig. 6b) are 7 and 1, respectively, showing the good performance of our proposed methodology in taking into account the sampling design complexity.

4.2 Evaluating the accuracy of PC algorithm for complex survey data: a Monte Carlo study

In this section the performance of the PC-cs algorithm is evaluated over a large number of sample replicates. Small and larger networks with fixed or randomly chosen DAG and conditional probabilities are considered. More specifically, in Sect. 4.2.1 a network of three nodes with fixed DAG and conditional probabilities is considered and the effects of sampling design on the association structure are graphically showed over 500 sample replicates. In Sect. 4.2.2 the ten nodes network of Fig. 5a is analyzed over 350 sample replicates. Finally, in Sect. 4.2.3 the accuracy of the PC-cs algorithm is evaluated when the DAG and the conditional probabilities are randomly chosen. The number of sample replicates is 350 and sample sizes equal to 3000 and 6000 are considered.

4.2.1 A three nodes network with fixed DAG and conditional probabilities

In this section a Monte Carlo simulation is performed to assess the PC-cs algorithm accuracy. A finite population of size \(N=10000\) has bee generated according to the network in Fig. 7a. The probability distributions of the nodes \(X_{1}\), \(X_{2}\) and \(X_{3} \vert (X_{1},X_{2})\) are reported in Tables 8, 13 and 14, respectively. An estimate of the finite population causal structure has been obtained using the function pc() in the R-package pcalg. The finite population CPDAG in Fig. 7a has been obtained. In order to investigate the effect of the sampling design on the structural learning process, 500 samples of size \(n=3000\) have been selected from the finite population according to (i) a simple random sampling design; (ii) a conditional Poisson sampling design with inclusion probabilities proportional to the Z-values, defined as follows

$$\begin{aligned} Z= \bigg \{ \begin{array}{rl} N(200,2)+10 &{} \quad \hbox {if} \quad X_{2}=0 \\ N(10,2)+5 &{} \quad \hbox {if} \quad X_{2}=1 \\ \end{array} \end{aligned}$$
(23)
Fig. 7
figure 7

a True graph and finite population CPDAG, b, c Sampling design effects

Table 13 Probability distribution of \(X_{2}\)
Table 14 Probability distribution of \(X_{3} \vert (X_{1},X_{2})\)

The significance level is fixed to 0.05. When the sample is selected according to a simple random sampling, the PC algorithm is not able to recover the true association structure in \(3 \%\) of the selected samples.

The percentage of wrong graphs rises to \(10.7\%\) when the sample is selected according to a conditional Poisson sampling. In Fig. 7 the survey design effects on the association structure are shown. The edge between the nodes \(X_{2}\) and \(X_{3}\) is missing in \(34\%\) of the wrong graphs (Fig. 7b). An additional edge is placed between the nodes \(X_{1}\) and \(X_{2}\) in the remaining \(66\%\) (Fig. 7c). For each sample, a pseudo population has been constructed and \(M=1000\) bootstrap samples have been drawn. The percentage of wrong graphs decreases to \(3.2\%\) when the PC algorithm for complex survey data is applied.

4.2.2 A ten nodes network with fixed DAG and conditional probabilities

With regard to the ten nodes network of Fig. 5a (sect. 4.1.4), in order to assess the PC-cs algorithm accuracy, 350 samples of size \(n=3000\) have been selected from the finite population according to a conditional Poisson sampling design with inclusion probabilities proportional to the Z-values defined in (22). The mean, the standard deviation and the quartiles of SHDs for the PC algorithm and the PC-cs algorithm over the 350 have been computed and denoted by \(SHD_{iid}\) and \(SHD_{cs}\), \(sd_{iid}\) and \(sd_{cs}\), \(Q^1_{iid},Q^2_{iid},Q^3_{iid}\) and \(Q^1_{cs},Q^2_{cs},Q^3_{cs}\) respectively. They are shown in Table 15, where it can be noticed that \(SHD_{iid}=9.3\) while \(SHD_{cs}=3.4\), confirming the good performance of PC-cs algorithm in taking into account the sample selection effects.

Table 15 Mean, standard deviations and quartiles of SHDs for PC-algorithm and PC-cs algorithm over the 350 samples

4.2.3 A four nodes network with randomly chosen DAG and conditional probabilities

In this section, the accuracy of the PC-cs algorithm is evaluated by a simulation where the DAG and the conditional probabilities are not fixed but randomly chosen. To this aim, the following procedure has been applied:

  1. Step 1

    Generate a random DAG with 4 nodes and sparsity parameter 0.6 by the function randomDAG in R package pcalg.

  2. Step 2

    Generate a multivariate dataset from the DAG defined in Step 1 by the function rmvDAG with nodes corresponding to normal random variables.

  3. Step 3

    Discretize the variables \((X_1,X_2,X_3,X_4)\) according to the DAG dependence structure. First of all, the range of the variables without parents has been divided into 3 intervals according to the hth percentiles of data, for \(h=0.5,0.75\). Next, if \(pa(X_j) \ne \emptyset\), then the range of \(X_j|pa(X_j)\) has been divided into 2 intervals according to the 80th percentile of data.

  4. Step 4

    Generate a finite population of size \(N=10000\) from the discretized variables in Step 3 and learn its structure by function pc(). This represents the population CPDAG.

  5. Step 5

    Draw a sample of size n from the finite population according to a conditional Poisson sampling design. Inclusion probabilities are taken proportional to Z-values defined as

    $$\begin{aligned} Z= \bigg \{ \begin{array}{rl} N(10,2)+5 &{} \quad \hbox {if} \quad X_{1}=0 \\ N(20,4)+10 &{} \quad \hbox {if} \quad X_{1}=1 \\ N(30,6)+15 &{} \quad \hbox {if} \quad X_{1}=2 \\ \end{array} \end{aligned}$$
    (24)
  6. Step 6

    Perform structural learning using the original PC algorithm and the PC-cs algorithm with a 0.05 significance level.

Steps 1-6 have been repeated 350 times and for \(n=3000,6000\). Algorithms performance have been compared in terms of (i) the percentage of wrong graphs denoted by \(W_{iid}\) and \(W_{cs}\) for the PC algorithm and the PC-cs algorithm, respectively; (ii) the mean of SHDs denoted by \(SHD_{iid}\) and \(SHD_{cs}\), respectively; (iii) the standard deviation of SHDs denoted by \(sd_{iid}\) and \(sd_{cs}\), respectively; (iv) the first, the second and the third quartile of SHDs denoted by \(Q^1_{iid},Q^2_{iid},Q^3_{iid}\) and \(Q^1_{cs},Q^2_{cs},Q^3_{cs}\), respectively. Results are reported in Table 16.

Table 16 Percentage of wrong graphs and mean, standard deviations and quartiles of SHDs for PC-algorithm and PC-cs algorithm over the 350 samples, for n=3000,6000

As results in Table 16 show, the percentage of wrong graphs \(W_{cs}\), the Structural Hamming Distance \(SHD_{cs}\), the standard deviation and the interquartile difference decrease when the PC-cs algorithm is applied and the complexity of sampling design is taken into account in the learning process. Finally, the accuracy of the results improves as the sample size increases.

5 Conclusions

In this paper a modified version of the PC algorithm is proposed for inferring causal structure from complex survey data. The complexity of sampling design is accounted for via a design-based approach. In the PC algorithm for complex survey data the chi-square statistic does not necessarily approach to a chi-square distribution, and the limiting distribution under the null hypothesis is estimated by a resampling method allowing to make inference on the superpopulation parameters in a finite population setting. Corrections based on design effects have been proposed by Rao & Scott in Rao and Scott (1981) and Rao and Scott (1984). However, while the PC algorithm for complex survey data adjusts for the sample selection bias including the sampling weights in the BN parameters estimates, Rao & Scott corrections use the classical chi-square test statistic adjusted on the basis of the design effects. Furthermore, the second order Rao & Scott correction requires availability of the full covariance matrix estimate of the cell proportions estimators. In secondary analysis this estimate is not necessarily provided, but cell design-effect estimate, possibly with marginal design effect estimate, might be reported. Approximate first-order corrections can then be obtained by using the design effect estimates. These results require that published two-way tables report at least the cells design effects and their marginal along with the cell estimates, otherwise variance estimates must be computed from microdata files using resampling methods for finite population. Hence, both the approaches require to resort to resampling methods: in the PC algorithm for complex survey data the distribution under the independence null hypothesis is estimated; in Rao & Scott approach the variances need to be estimated. Extension of the proposed approach to complex multi-stage designs is under investigation. Finally, future research will be devoted to the development of a structural learning procedure for complex survey sampling in the score-plus-search framework.