Bayesian network structural learning from complex survey data: a resampling based approach

Marella, Daniela; Vicard, Paola

doi:10.1007/s10260-021-00618-x

Bayesian network structural learning from complex survey data: a resampling based approach

Original Paper
Published: 21 January 2022

Volume 31, pages 981–1013, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Statistical Methods & Applications Aims and scope Submit manuscript

Bayesian network structural learning from complex survey data: a resampling based approach

Download PDF

532 Accesses
6 Citations
Explore all metrics

Abstract

Nowadays there is increasing availability of good quality official statistics data. The construction of multivariate statistical models possibly leading to the identification of causal relationships is of interest. In this context Bayesian networks play an important role. A crucial step consists in learning the structure of a Bayesian network. One of the most widely used procedures is the PC algorithm consisting in carrying out several independence tests on the available data set and in building a Bayesian network according to the tests results. The PC algorithm is based on the irremissible assumption that data are independent and identically distributed. Unfortunately, official statistics data are generally collected through complex sampling designs, then the aforementioned assumption is not met. In such a context the PC algorithm fails in learning the structure. To avoid this, the sample selection must be taken into account in the structural learning process. In this paper, a modified version of the PC algorithm is proposed for inferring causal structure from complex survey data. It is based on resampling techniques for finite populations. A simulation experiment showing the robustness with respect to departures from the assumptions and the good performance of the proposed algorithm is carried out.

Robust semiparametric inference for polytomous logistic regression with complex survey design

Article 23 November 2020

Design-Unbiased Statistical Learning in Survey Sampling

Article 06 October 2020

Robustness in Survey Sampling Using the Conditional Bias Approach with R Implementation

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Nowadays there is increasing availability of good quality data produced by official statistics, and useful for both study purpose and decision making. Simple but important examples are the EU-SILC survey on income, the ISTAT survey on labour forces, the ISTAT survey on consumption expenses, etc.... Typically in those surveys data are collected with unequal first order inclusion probabilities. In all the above mentioned cases, as well as in many others, the construction of multivariate statistical models involving several variables is of primary interest. In this context Bayesian networks (BN) play a very important role because of their flexibility and easy-to-read representation of relationships among variables by means of edges connecting them. Bayesian networks are multivariate statistical models satisfying sets of conditional independence statements contained in a directed acyclic graph (DAG), see Cowell et al. (2007). The nodes of the graph correspond to random variables, while edges represent dependencies. In recent years BNs have been successfully applied to a large variety of contexts; among them official statistics. Data collected through a survey are typically affected by selection bias due to sampling design, nonresponse and measurement error. BNs appeared to be very useful in missing data imputation (Di Zio et al. (2004), Thibaudeau and Winkler (2002)), contingency table estimation for complex survey sampling (Ballin and Scanu (2010)) and measurement errors (Marella and Vicard (2013), Marella and Vicard (2015)). However, there are still serious obstacles complicating a wider application in official statistics contexts and in all those contexts where complex sampling designs are used. In fact, missing item imputation and measurement error correction can be performed once the BN structure is available (either known in advance or learned from data) since information has to be propagated throughout the network. Therefore, it is necessary to develop structural learning algorithms accounting for the sampling design complexity.

Learning BNs from a sample can be a time consuming task and a challenging issue even when data are independent and identically distributed (i.i.d.). For a survey on structural learning, see Drton and Maathuis (2017). In case of data driven learning, three broad classes of algorithms can be distinguished: constraint based algorithms, score-plus-search and hybrid algorithms. Constraint based algorithms carry out a series of independence tests and construct a graph satisfying the discovered independence statements. The second group of algorithms considers structure learning as a structural optimization problem, using a search strategy to select the structure optimizing a given score function which measures the fitting degree of network and data. Recently, the third group of algorithms, hybrid algorithms, has become widely used. They combine the ideas and the advantages of the previous two types of algorithms. One popular strategy is to use constraint-based algorithms (in the first learning phase) to determine the initial network structure, and then use score-based algorithms (in the second learning phase) to find the highest-scoring network structure, see Tsamardinos et al. (2006). The main constraint based algorithm is the PC algorithm, see Spirtes et al. (2000). It has several advantages, among which an intuitive basis. The PC algorithm is based on conditional independence tests, usually performed by using the standard Pearson chi-square test statistic under i.i.d. assumption, which is equivalent to simple random sampling assumption. However, sample selection in surveys may involve more complex sampling designs based on stratification, different level of clustering and inclusion probabilities proportional to an appropriate size measure. Complex designs can severely impact on i.i.d. based method, as shown in Skinner et al. (1989). In such circumstances, the standard test procedure is not valid even asymptotically. As a consequence, in these cases the PC algorithm fails in correctly identifying the independence equivalence class of DAGs containing the true structure of the BN under examination.

Survey weights and design effects are appropriate tools by which complex sampling designs can be accommodated. As far as the chi-square statistic is concerned, corrections based on the design effects have been proposed in Rao and Scott (1981), Rao and Scott (1984). However, such corrections require design effect or, alternatively, full covariance matrix estimates of the cell proportions estimators that are not generally delivered and then need to be estimated by resampling.

In this paper, a novel approach for inferring causal structure from complex survey data is investigated requiring to combine two different statistical cultures: sampling theory and structural learning. A modified version of the PC algorithm (PC-cs algorithm, for short) that uses independence tests accounting for the sample selection mechanism is proposed. The sampling design complexity is accounted for via a design-based approach by including the sampling weights in the BN parameters estimates. In fact, such weights contain invaluable information about the relationship between the distribution of the sample data and the distribution in the population from which the sample is taken. After having estimated such parameters, a procedure based on the chi-square statistic for testing the association in a two-way table is proposed; its limiting sampling distribution is estimated by resorting to resampling techniques for finite populations. The new test procedure is applied to BN structural learning.

The paper is organized as follows. In Sect. 2 the PC algorithm for i.i.d. data and the basic assumptions on which it relies are briefly recalled. In Sect. 3 the modified version of the PC algorithm for complex survey data is introduced and described. A simulation study is performed in Sect. 4. Finally, advantages and disadvantages of the proposed approach are discussed in Sect. 5.

2 Discovering causal structure with the PC algorithm

2.1 Preliminary definitions

A DAG is a pair $G=(V,E)$ consisting of a set of vertices V and a set of directed edges between pairs of nodes.

A directed graph is acyclic in the sense that it is not possible to start from a node and go back to the same node following arrows directions. Each node represents a random variable, while a missing arrow between two nodes implies (un)conditional independence between the corresponding variables. Examples of DAG are shown in Fig. 1a, b. Consider Fig. 1a. In the arrow $X_2 \rightarrow X_3$, $X_2$ is said parent of $X_3$ and $X_3$ is said child of $X_2$. $X_1$ is called an ancestor of $X_3$ and $X_3$ a descendant of $X_1$ since $X_1$ is connected to $X_3$ by a directed path, i.e. a sequence of direction preserving arrows. Two nodes $X_i$ and $X_j$ are said adjacent if they are directly connected by a directed or undirected edge. A graph is complete if all vertices are joined by a directed or indirected edge. A v-structure is a triple of nodes $(X_i,X_k,X_j)$ such that the arrows $X_i \rightarrow X_k$ and $X_j \rightarrow X_k$ are in the DAG. In such configuration, $X_k$ is said unshielded collider. If $X_{i}$ and $X_{j}$ are connected then the collider $X_{k}$ is termed shielded. For example, in Fig. 1a the triple $(X_2,X_3,X_4)$ constitutes a v-structure where $X_3$ is an unshielded collider. The skeleton of a DAG G is the undirected graph obtained from G by replacing all arrows with lines (undirected edges).

For example, the undirected graph in Fig. 1c is the skeleton of the graphs in Fig. 1a, b.

We next recall the basic assumptions and then the main steps of the PC algorithm.

2.2 Assumptions

Let P be the joint probability distribution associated to G. The following assumptions are set when applying the PC algorithm (for details see Zhang and Spirtes (2008) and Uhler et al. (2013)):

1.
Sufficiency Condition. The set of observed variables V is causally sufficient, that is every common direct cause relative to V of any pair of variables in V is also contained in V.
2.
Causal Markov Condition. P is said to be Markov with respect to G if a node of G is probabilistically independent of its non descendents given its parents in G. For example, in Fig. 1a, $X_3$ is independent of $X_1$ given its parent $X_2$. The causal Markov condition defines the set of conditional independence relations entailed by the DAG.
3.
Causal Faithfulness Condition. The joint probability distribution P is faithful to G if all conditional independencies can be read off G. This means that if the true causal structure G does not entail a conditional independence relation according to the causal Markov condition, then the conditional independence relation does not hold for the true probability distribution. In Zhang and Spirtes (2008) a decomposition of the faithfulness assumption useful for violation detection is proposed. The faithfulness condition implies adjacency-faithfulness and orientation-faithfulness. These do not constitute an exhaustive decomposition of the faithfulness assumption. However, the leftover part is irrelevant to the correctness of structural learning procedures such as PC algorithm.

2.3 PC algorithm for i.i.d. data

The PC algorithm starts with a complete undirected graph on the set V, and ends with a class of Markov equivalent DAGs where all the associated models encode the same conditional independence information.

Two DAGs are Markov equivalent if and only if they have the same skeleton and the same v-structures, Verma and Pearl (1990). For example DAGs a and b in Fig. 1 are Markov equivalent. In fact, they have the same skeleton (Fig. 1c) and the same v-structure $X_2 \rightarrow X_3 \leftarrow X_4.$ A common tool for visualizing equivalence classes of DAGs are the completed partially directed acyclic graphs (CPDAG), see Spirtes et al. (2000). A CPDAG is a summary graph that has: a directed edge where all DAGs in the equivalence class have the same directed edge; an undirected edge between $X_i$ and $X_j$ if in the equivalence class there exist at least a DAG with $X_i \rightarrow X_j$ and a DAG with $X_i \leftarrow X_j$. An example of CPDAG is shown in Fig. 1d.

The PC algorithm proceeds according to the following two phases:

Phase 1
Skeleton estimation. First, all pairs $(X_i, X_j)$ are tested for marginal independence removing the edge between independent variables and saving the empty set as separating sets $S_{ij}$ and $S_{ji}$. Then all the pairs, say $(X_i, X_j)$, still adjacent, are tested for independence conditionally on one single node adjacent to $X_i$. If $X_i$ and $X_j$ are judged to be independent given, say, $X_k$, the edge between $X_i$ and $X_j$ is removed and $X_k$ is saved as separating sets $S_{ij}$ and $S_{ji}$. The algorithm proceeds augmenting, one unit at a time, the conditioning set size until all adjacency sets are smaller than the conditioning set size. The resulting graph is the skeleton.
Phase 2
Arrows orientation. The v-structures and their colliders are identified. A triple of vertices $(X_i,X_k,X_j)$ in the skeleton such that the pairs $(X_i,X_ k)$ and $(X_j,X_k)$ are adjacent but $(X_i,X_j)$ is not, is oriented as a v-structure $X_i \rightarrow X_k \leftarrow X_j$ if $X_k$ is not in $S_{ij}= S_{ji}$. Once all v-structures have been identified, it may be possible to orient some of the remaining edges, without introducing additional v-structures or directed cycles, thus leading to a CPDAG.

As far as the faithfulness assumption is concerned, adjacency-faithfulness condition is necessary to recover the skeleton of the true DAG and it is strictly related to Phase 1 of the PC algorithm. The orientation-faithfulness condition, necessary for finding the correct arrows orientation, is related to Phase 2 of the algorithm.

3 PC algorithm for complex survey data

3.1 The problem

Under the assumptions in Sect. 2.2 and if the sample size is large enough, the original PC algorithm is able to infer from data the Markov equivalence class the true causal DAG belongs to. This means that, if the input of the PC algorithm is a sample from a population distribution P faithful to some DAG, then in large sample limit, the algorithm can identify any probabilistic independence claim with perfect reliability. In practice, we do not have direct access to the true population distribution, and need to do statistical inference based on finite sample size. Hence, since the first phase of the PC algorithm consists in a series of conditional independence tests based on a finite sample size, it is possible that the original graph is not recovered even if the PC algorithm assumptions are verified at the population level. Therefore, it becomes very relevant to causal inference whether the population probability distribution, though faithful to the true causal structure, is far from or close to being unfaithful. As stressed in Zhang and Spirtes (2008), a population distribution is close-to-unfaithful to a causal structure, if the structure does not entail some conditional independence relation according to the Causal Markov Condition, but the conditional independence almost holds, or in other words, the conditional dependence is by some measure very weak in the population. Precisely, the dependence degree and the sample size determine how “weak" counts as “close to independence". It is clear that at every finite sample size, there are distributions faithful to the true causal structure but so close to being unfaithful as to possibly cause troubles for inference at that sample size. For instance, two variables, though entailed to be dependent conditional on some variables, can be close to be conditionally independent. As a consequence, due to sample size, tests can fail to correctly identify such a dependence leading to errors in judgment about the properties of the population.

The situation worsens for complex survey data since the sampling design can modify independence and conditional independence relations associated with the population probability distribution P. Roughly speaking, complex sampling schemes lead to unequal selection probabilities; ignoring this can result in biased estimates of the population distribution P. As a consequence, even if the population distribution is faithful to some DAG G, the sample distribution may not be faithful to the same DAG because of the sample selection mechanism. Specifically, the selection probabilities are often unequal in at least some stages of the sample selection. When such probabilities are correlated with the survey variables of interest, the observed outcomes are no longer representative of the population outcomes and the model holding for the sample data is then different from the model holding in the population, see Pfeffermann (2001). In this case, conventional analysis, which ignores the sampling effects, may yield large bias and erroneous inference, as illustrated, for example, in the book edited by Skinner et al. (1989). In Marella and Pfeffermann (2019) the effect of alternative sampling designs on a trivariate normal population distribution has been evaluated. For instance, suppose that (X, Y, Z) is a trivariate normal with mean vector $\varvec{\mu }$ and covariance matrix $\varvec{\Sigma }$, and that the sample inclusion probabilities have expectations

$$\begin{aligned} E_p(\pi _i \vert x_i,y_i,z_i)=\kappa \exp \{\gamma _X x_i+\gamma _Y y_i+\gamma _Z z_i\}, \end{aligned}$$

(1)

where $\kappa$ guarantees that the expectation is less or equal to one. It can be shown that the model holding for the sample outcomes is normal but with parameters $\varvec{\mu }_{S}$ and $\varvec{\Sigma }$, see Marella and Pfeffermann (2019). In this case, the sample model and the population model belong to the same family and differ only in some parameters. Clearly, other examples in Marella and Pfeffermann (2019) show that the population and the sample distributions can be either in the same family and differ only in some or all the parameters, or in different families.

In the sequel, we assume that the design variables used for sample selection are known for all the sample units so that the sufficiency condition is satisfied, as in Ballin and Scanu (2010). Nevertheless, the Markov and the faithfulness condition may fail if the sample is selected by a procedure that is biased towards two or more variables in the set V.

In the PC algorithm for complex survey data the skeleton learning step (Phase 1) of the PC algorithm is modified by introducing a procedure for testing association in a two-way table for data coming from complex sample surveys. Such a procedure is introduced in Sect. 3.2 where the existence of a limiting distribution of the test statistic under the independence null hypothesis is proved. In paragraph 3.3 such a distribution is estimated by resampling methods for finite population.

3.2 Independence test for complex sample surveys

In this section a test for independence based on the chi-square statistic in a two-way table for complex survey data is described. The test relies on some results in Conti et al. (2018) where a resampling technique allowing to make inference on the superpopulation parameters in a finite population setting is proposed.

Denote by A and B two characters of interest with H $(A^{1},\dots ,A^{H})$ and K $(B^{1},\dots ,B^{K})$ categories, respectively. Furthermore, let the superpopulation parameters $p^{hk}$, $p^{h.}$, $p^{.k}$ be defined as

$$\begin{aligned} p^{hk}= & {} Prob(A=A^{h},B=B^{k}), \quad p^{h.}=Prob(A=A^{h})\nonumber \\ p^{.k}= & {} Prob(B=B^{k}) \end{aligned}$$

(2)

for $h=1,\dots ,H$ and $k=1,\dots ,K$. Let ${{\mathcal {U}}}_N$ be a finite population of size N generated from the superpopulation model (2), labeled by integers $1,\dots ,N$. For each unit i, let $Y_{i}^{h.}$ ($Y_{i}^{.k}$) be the indicator variable taking value 1 if the unit i assumes the modality $A^{h}$ ($B^{k}$) and 0 otherwise, for $h=1,\dots ,H$ ($k=1,\dots ,K$). Let $Y_{i}^{hk}=Y_{i}^{h.}Y_{i}^{.k}$ so that for each unit i the following equalities hold

$$\begin{aligned} \sum _{h=1}^{H}Y_{i}^{h.}=\sum _{k=1}^{K}Y_{i}^{.k}=1, \quad \sum _{h=1}^{H}Y_{i}^{hk}=Y_{i}^{.k}, \quad \sum _{k=1}^{K}Y_{i}^{hk}=Y_{i}^{h.}. \end{aligned}$$

(3)

For each unit i in the population ${{\mathcal {U}}}_N$, let $D_i$ be the sample membership indicator, i.e. a Bernoulli random variable taking value 1 whenever i is in the sample, and 0 otherwise, and let ${\varvec{D}}_{N}= (D_1 ,\dots ,D_N)$ be the sample membership random vector for the population units. An unordered, without replacement sampling design S is the probability distribution of ${\varvec{D}}_{N}$. In particular $\pi _i = E_S [ D_i ]$ is the first order inclusion probability of unit i. The suffix S denotes the sampling design used to select population units. The sample selection, and therefore $\pi _i$s, depends on the values of design variables (like strata, cluster indicator variables or size measures) that are statistically related to $Y_i^{h.}$ and $Y_i^{.k}$ but are not included in the inference model. Consequently, the distribution holding for the sample data may be very different from the distribution in the population. In other words, even if the $Y_i^{h.}$, $Y_i^{.k}$, $i=1, \ldots , N$, are i.i.d. at a population level, they are not i.i.d. at a sample level due to the sampling design (Pfeffermann (1993)).

The effective sample size is the r.v. $n_s = D_1 + \cdots + D_N$. In the sequel we will confine ourselves to fixed size sampling designs, such that $n_s \equiv n$. Assumptions on the sampling design according to which the sample is drawn, are similar to those used in Conti (2014), Conti and Marella (2015a) and Conti et al. (2019) (assumptions A1-A6). From now on we will assume that the sampling design possesses asymptotically maximal entropy. More precisely, consider the Poisson sampling design where the random variables $D_i$s are independent with $E_{S}[D_i]=\pi _i$, and $n=\pi _1 +\cdots + \pi _N$. The rejective sampling (or normalized conditional Poisson sampling) is obtained by conditioning the Poisson design to $n_S=n$. Denote by $H (S) = E_S \left[ \log Pr_S ( {\varvec{D}}_{N}) \right]$ the entropy of sampling S. In particular, the rejective design possesses maximal entropy among all sampling designs with fixed sample size and fixed first order inclusion probabilities. Let $Pr_R(\cdot )$ be the sample probabilities for the rejective sampling, the Kullback-Leibler divergence of $Pr_S$ from $Pr_R$ can be written as,

$$\begin{aligned} \Delta _{KL}(Pr_S||Pr_R)=H(Pr_R)-H(Pr_S). \end{aligned}$$

(4)

In the sequel we assume that $\Delta _{KL}(Pr_S||Pr_R) \rightarrow 0$ as $n, N \rightarrow \infty$. This means that the sampling design S possesses high (asymptotically maximal) entropy.

The properties of high entropy sampling designs are discussed in Berger (2011), Grafström (2010) and references therein. As a matter of fact, asymptotics for high entropy sampling designs only depend on their first order inclusion probabilities. In other terms, if two high entropy designs have the same first order inclusion probabilities, then they have the same asymptotic behaviour. High entropy sampling designs generate highly randomized samples, which in turn make the design more robust. A discussion about high entropy designs and their relationship with robustness can be found in Grafström (2010). Examples of maximal asymptotic entropy sampling designs are simple random sampling, Rao-Sampford design, Chao design, stratified design, etc., see Berger (2011).

3.2.1 Parameters estimators and their distributions

Let $p_{N}^{hk}$, $p_{N}^{h.}$, $p_{N}^{.k}$ be the finite population parameters defined as

$$\begin{aligned} p^{hk}_{N}= & {} \frac{1}{N}\sum _{i=1}^{N}Y_{i}^{hk}, p^{h.}_{N}=\frac{1}{N}\sum _{i=1}^{N}Y_{i}^{h.}=\sum _{k=1}^{K}p_{N}^{hk}, \nonumber \\ p^{.k}_{N}= & {} \frac{1}{N}\sum _{i=1}^{N}Y_{i}^{.k}=\sum _{h=1}^{H}p_{N}^{hk} \end{aligned}$$

(5)

where $h=1,\dots , H$, $k=1, \dots , K$. In complex sample design sample proportions would result in inconsistent estimators of the population parameters, Pfeffermann (1993). Parameters (5) can be estimated using the classical Hájek estimators (Hájek (1964))

$$\begin{aligned} \widehat{p}^{hk}= & {} \frac{\sum _{i=1}^{N}\frac{D_{i} Y_{i}^{hk}}{\pi _i}}{\sum _{i=1}^{N}\frac{D_{i}}{\pi _i}},\quad \widehat{p}^{h.}=\frac{\sum _{i=1}^{N}\frac{D_{i} Y_{i}^{h.}}{\pi _i}}{\sum _{i=1}^{N}\frac{D_{i}}{\pi _i}}=\sum _{k=1}^{K}\widehat{p}^{hk},\quad \nonumber \\ \widehat{p}^{.k}= & {} \frac{\sum _{i=1}^{N}\frac{D_{i} Y_{i}^{.k}}{\pi _i}}{\sum _{i=1}^{N}\frac{D_{i}}{\pi _i}}=\sum _{h=1}^{H}\widehat{p}^{hk} \end{aligned}$$

(6)

for $h=1, \dots , H$, $k=1, \dots , K$, where $1/\pi _{i}$ is the sampling weight, that is the reciprocal of the probability that the unit i is included in the sample. Note that, in the simple random sampling case, the estimators (6) reduce to the proportion of units in the sample belonging to categories (h, k), h and k, respectively.

The existence of the limiting distribution of the Hájek estimators (6) is studied in Conti et al. (2018), under both the sample selection and the population generation randomness sources. Similar results are in Boistard et al. (2017) where no resampling schemes allowing to recover the limiting distribution of the chi-square test statistic (14) defined in Sect. 3.2.2 are proposed. Here, similarly to Conti et al. (2018), we analyze the behaviour of the stochastic processes

$$\begin{aligned} W^{HK}= & {} \{\sqrt{n} ( \widehat{p}^{hk} - p^{hk} ), h=1, \, \dots , \, H,\, k=1, \, \dots , \, K \}, \end{aligned}$$

(7)

$$\begin{aligned} W^{H}= & {} \{\sqrt{n} ( \widehat{p}^{h.} - p^{h.} ), h=1, \, \dots , \, H\}, \end{aligned}$$

(8)

$$\begin{aligned} W^{K}= & {} \{\sqrt{n} ( \widehat{p}^{.k} - p^{.k} ), k=1, \, \dots , \, K\}. \end{aligned}$$

(9)

Each of these processes can be partitioned as follows

$$\begin{aligned} W^{HK}= & {} \{W^{HK}_{n}+W^{HK}_{N}, h=1, \, \dots , \, H,\, k=1, \, \dots , \, K \}\nonumber \\= & {} \{\sqrt{n} ( \widehat{p}^{hk} - p^{hk}_{N} )+ \sqrt{f}\sqrt{N}(p^{hk}_{N}-p^{hk}) \}, \end{aligned}$$

(10)

$$\begin{aligned} W^{H}= & {} \{W^{H}_{n}+W^{H}_{N}, h=1, \, \dots , \, H\}\nonumber \\= & {} \{\sqrt{n} ( \widehat{p}^{h.} - p^{h.}_{N} )+\sqrt{f}\sqrt{N}(p^{h.}_{N}-p^{h.})\}, \end{aligned}$$

(11)

$$\begin{aligned} W^{K}= & {} \{W^{K}_{n}+W^{K}_{N}, k=1, \, \dots , \,K\}\nonumber \\= & {} \{\sqrt{n} ( \widehat{p}^{.k} - p^{.k}_{N})+\sqrt{f}\sqrt{N}(p^{.k}_{N}-p^{.k})\} \end{aligned}$$

(12)

for $h=1, \, \dots , \, H, k=1, \, \dots , \, K$ and $f=\frac{n}{N}$. The components $W^{HK}_{n},W^{H}_{n},W^{K}_{n}$ depend on the sample selection randomness while the components $W^{HK}_{N}$,$W^{H}_{N}$,$W^{K}_{N}$ depend on the superpopulation randomness. Proposition 1 establishes the convergence of processes $(7)-(9)$ to Gaussian distributions.

Proposition 1

Under the assumptions A1-A6 of Proposition 1 in Conti et al. (2018), as n and N increase, the sequences:

1.
$W^{HK}_{n}$ and $W^{HK}_{N}$ converge in distribution to degenerate multivariate normal distributions with zero mean vector and singular covariance matrices $\varvec{\Sigma }^{HK}_1$ and $\varvec{\Sigma }^{HK}_2$ of order HK, respectively. As a consequence, the whole process $W^{HK}$ converges to a degenerate multivariate normal distribution with zero mean vector and singular covariance matrix $\varvec{\Sigma }^{HK}=\varvec{\Sigma }^{HK}_1+f\varvec{\Sigma }^{HK}_2$.
2.
$W^{H}_{n}$ and $W^{H}_{N}$ converge in distribution to degenerate multivariate normal distributions with zero mean vector and singular covariance matrices $\varvec{\Sigma }^{H}_1$ and $\varvec{\Sigma }^{H}_2$ of order H, respectively. As a consequence, the whole process $W^{H}$ converges to a degenerate multivariate normal distribution with zero mean vector and singular covariance matrix $\varvec{\Sigma }^{H}=\varvec{\Sigma }^{H}_1+f\varvec{\Sigma }^{H}_2$.
3.
$W^{K}_{n}$ and $W^{K}_{N}$ converge in distribution to degenerate multivariate normal distributions with zero mean vector and singular covariance matrices $\varvec{\Sigma }^{K}_1$ and $\varvec{\Sigma }^{K}_2$ of order K, respectively. As a consequence, the whole process $W^{K}$ converges to a degenerate multivariate normal distribution with zero mean vector and singular covariance matrix $\varvec{\Sigma }^{K}=\varvec{\Sigma }^{K}_1+f\varvec{\Sigma }^{K}_2$.

The proof rests on the same ideas as the proof of Proposition 1 in Conti et al. (2018) and it can be seen as a direct consequence. The proof guidelines are reported in the Appendix.

3.2.2 The independence test statistic

Suppose to test the null hypothesis that the two categorical variables A and B are independent, against the alternative hypothesis that they are associated. Formally

$$\begin{aligned} {{\mathcal {H}}_{0}}:p^{hk}=p^{h.}p^{.k} \,\,\,\,\,\, \hbox {against} \,\,\,\,\,\, {{\mathcal {H}}_{1}}:p^{hk} \ne p^{h.}p^{.k}. \end{aligned}$$

(13)

The used test statistic is

$$\begin{aligned} \chi ^2_{H}=n\sum _{h=1}^{H}\sum _{k=1}^{K}\frac{(\widehat{p}^{hk}-\widehat{p}^{h.}\widehat{p}^{.k})^{2}}{\widehat{p}^{h.}\widehat{p}^{.k}} \end{aligned}$$

(14)

where the sampling weights in the Hájek estimators $\widehat{p}^{hk}$, $\widehat{p}^{h.}$ and $\widehat{p}^{.k}$ (6) compensate for different selection probabilities. As stressed in Proposition 3, for complex survey data: (i) the statistic (14) does have a limiting distribution; (ii) the limiting distribution does not necessarily approach a chi-square distribution due to the singularity of the covariance matrices $\varvec{\Sigma }^{HK}$, $\varvec{\Sigma }^{H}$ and $\varvec{\Sigma }^{K}$ in Proposition 1. From Proposition 1, the following result holds.

Proposition 2

Let ${\varvec{p}}^{HK}=(p^{11},\dots ,p^{1K},\dots ,p^{H1},\dots ,{\varvec{p}}^{HK})$, ${\varvec{p}}^{H.}=(p^{1.},\dots ,p^{H.})$, ${\varvec{p}}^{.K}=(p^{.1},\dots ,p^{.K})$ be vectors of length HK, H and K, respectively, and denote by $\widehat{{\varvec{p}}}^{HK}$, $\widehat{{\varvec{p}}}^{H.}$ and $\widehat{{\varvec{p}}}^{.K}$ their estimates. Furthermore, let

$$\begin{aligned} \widehat{{\varvec{p}}}=(\widehat{{\varvec{p}}}^{HK},\widehat{{\varvec{p}}}^{H.},\widehat{{\varvec{p}}}^{.K}) \end{aligned}$$

and

$$\begin{aligned} {\varvec{p}}^{{{\mathcal {H}}_{0}}}=(p^{1.}p^{.1},\dots ,p^{H}p^{.K},p^{1.},\dots ,p^{H.},p^{.1},\dots ,p^{.K}) \end{aligned}$$

be two vectors of length $T=HK+H+K$. Under the null hypothesis ${{\mathcal {H}}_{0}}$ the statistic

$$\begin{aligned} {\varvec{Z}}=\sqrt{n}(\widehat{{\varvec{p}}}-{\varvec{p}}^{{{\mathcal {H}}_{0}}}) \end{aligned}$$

(15)

converges, as n and N go to infinity, to a degenerate multivariate normal distribution with T components having zero mean vector and singular covariance matrix $\varvec{\Sigma }^{T}$ .

The proof of Proposition 2 is a consequence of Proposition 1 and is reported in the Appendix.

Proposition 3

Let f be the continuous, differentiable and with continuous derivatives function defined as follows

$$\begin{aligned} f : {\varvec{Z}}\rightarrow \chi ^{2}_{H}=n\sum _{h=1}^{H}\sum _{k=1}^{K}\frac{(\widehat{p}^{hk}-\widehat{p}^{h.}\widehat{p}^{.k})^{2}}{\widehat{p}^{h.}\widehat{p}^{.k}}. \end{aligned}$$

(16)

From Proposition 2, it follows that $\chi ^{2}_{H}$ tends in distribution to a quadratic form of a degenerate multivariate distribution.

The proof of Proposition 3 is reported in the Appendix.

3.3 Estimation of test statistic limiting distribution under ${{\mathcal {H}}_{0}}$ via resampling

Resampling methods are computer-based methods for assigning measures of accuracy to statistics estimates and performing statistical inference. In the Bayesian Network learning process, the Efron’s bootstrap Efron (1979) has been proposed as a computationally efficient approach for answering questions about specific network features. In Friedman et al. (1999) the aim is the increase in the learning procedure performance with information from the bootstrap estimates, even when the amount of data is not enough to induce a high scoring network. Then, the bootstrap method is used to answer questions as: (i) Is the existence of an edge between two nodes warranted? (ii) Is the Markov blanket of a given node robust? (iii) Can we say anything about the ordering of the variables?

Here resampling methods are used to recover the limiting sampling distribution of the test statistic (14) under the independence null hypothesis. With this regard, in sampling from finite populations, the original Efron’s bootstrap can lead to biased results since it does not take into account the dependence among units due to sampling design. Adaptations taking into account the non i.i.d. nature of the data have been proposed in the literature, which is mainly devoted to estimate variances of estimators; cfr. Mashreghi et al. (2016). The main approaches are essentially two: ad hoc approaches and plug in approaches (cfr. Ranalli and Mecatti (2012), Chauvet (2007) and references therein). The basic idea of ad hoc approaches consists in resampling from the original sample through a special design accounting for the dependence among units. This approach is pursued in McCarthy and Snowden (1985), Rao and Wu (1988), where the re-sampled data produced by the i.i.d. bootstrap are properly rescaled, as well as in Sitter (1992), Beaumont and Patak (2012), Chatterjee (2011), Conti and Marella (2015b), where a rescaled bootstrap process based on asymptotic results is proposed. Among the ad hoc approaches we also quote the paper by Antal and Tillé (2011), where a mixed resampling design is proposed. Plug-in approaches are based on the idea of expanding the sample to a pseudo-population that plays the role of a prediction of the original one. Then, bootstrap samples are drawn from such a pseudo-population according to some appropriate resampling design. The most intuitive choice consists in using the same sampling design used to draw the original sample from the population; cfr. Gross (1980), Chao and Lo (1985), Booth et al. (1994), Holmberg (1998), Chauvet (2007), as well as Mashreghi et al. (2016).

Virtually all resampling techniques proposed for finite populations rest on the same justification: in case of linear statistics, the variance of the resampled statistic should be very close to the usual variance estimator, possibly with approximated forms of the second order inclusion probabilities; cfr. Antal and Tillé (2011). This is far from the arguments commonly used to justify the classical bootstrap and its variants, that are based on asymptotic considerations involving the whole sampling distribution of a statistic (cfr., for instance, Bickel and Freedman (1981) and Lahiri (2003)): the asymptotic distribution of a bootstrapped statistic should coincide with that of the “original” statistic. This argument is actually used in Conti and Marella (2015b). In Conti et al. (2019), a class of resampling techniques for finite populations is proposed; cfr. also the paper Jiménez-Gamero et al. (2018), where asymptotic results in Conti et al. (2019) are used to construct various testing procedures based on a design-based estimator of the finite population characteristic function.

However, the class of resampling techniques defined in Conti et al. (2019) does not work in the present case due to the descriptive inference framework. The noticeable exception is the multinomial scheme introduced in Conti et al. (2018) together with the asymptotic results used in Sect. 3.2.1. Such a resampling procedure has been exploited in Conti et al. (2020) to approximate the asymptotic law of the Lorenz curve estimator when data are collected according to a complex sampling design.

It is based on a two-step procedure consisting in: (i) constructing on the basis of the sampling data a prediction of the population (i.e the multinomial pseudo-population) by sampling N units independently from the original sample where each unit $i \in s$ is selected with probability $\pi ^{-1}_{i}/\sum _{j \in s} \pi ^{-1}_{j}$. Such a prediction is based on the sampling design, and does not essentially involve the superpopulation model; (ii) drawing a sample of the same size of the original one from the pseudo-population according to an appropriate resampling design fulfilling the high entropy requirement. Notice that the idea of using pseudo-populations has been previously used by Mashreghi et al. (2016).

The asymptotic distribution of (14) under the null hypothesis is estimated applying the following procedure.

Step 1
Generate $M=1000$ bootstrap samples of size n as the original sample size on the basis of the two phase resampling procedure described above. Notice that, the pseudo-population in phase 1 has to be generated under the null hypothesis ${{\mathcal {H}}_{0}}$. Specifically, for each unit $i \in s$ such that $Y^{h.}_{i}=1$ and $Y^{.k}_{i}=1$, the original weight $w_{i}=1/\pi _{i}$ is modified as follows
$$\begin{aligned} w^{*}_{i}=w_{i}\frac{\widehat{p}^{h.}\widehat{p}^{.k}}{\widehat{p}^{hk}}. \end{aligned}$$
(17)
Let $A^{hk}$ be the set $\{i \in s: Y^{h.}_{i}=1,Y^{.k}_{i}=1\}$, the modified weights (17) guarantee that
$$\begin{aligned} \sum _{i \in A^{hk}}w^{*}_{i}=\frac{\sum _{i \in A^{h.}}w_{i}\sum _{i \in A^{.k}}w_{i}}{\sum _{i \in s}w_{i}}. \end{aligned}$$
(18)
Step 2
For each bootstrap sample, compute the corresponding Hájek estimators $\widehat{p}^{hk}$, $\widehat{p}^{h.}$,$\widehat{p}^{.k}$ (6).
Step 3
Compute the M quantities $\chi ^{2,m}_{H}$, $m=1, \dots ,M$ as in (14).
Step 4
Compute the empirical cumulative distribution function of $\chi ^{2,m}_{H}$s
$$\begin{aligned} \widehat{T}_{n,M} (t) = \frac{1}{M} \sum _{m=1}^{M} I_{( \chi ^{2,m}_{H} \le t )} , \;\; t \in {\mathbb {R}}. \end{aligned}$$
(19)

Finally, compute the $1-\alpha$ percentile of $\widehat{T}_{n,M} (t)$

$$\begin{aligned} \widehat{T}^{-1}_{n,M} (1-\alpha ) = \inf \{ t : \; \widehat{T}_{n,M} (t) \ge 1-\alpha \} , \;\; 0<\alpha <1 . \end{aligned}$$

(20)

If $\chi ^{2}_{H}<\widehat{T}^{-1}_{n,M} (1-\alpha )$ then ${\mathcal H_{0}}$ is not rejected at the $\alpha \%$ significance level.

4 Simulation studies

In this section we proceed to empirically test the PC-cs algorithm performance via a simulation study. The study is organized as follows. First of all, in Sect. 4.1 a preliminary analysis is performed. More specifically, the performance of the proposed algorithm when the PC algorithm basic assumptions are violated by the sampling design is investigated. The analysis is based on one sample replicate. Small and larger networks are considered. In Sect. 4.2 a Monte Carlo simulation to evaluate the accuracy of the proposed algorithm is carried out. The aim is to measure how often the PC-cs algorithm is able to eliminate the selection bias recovering the correct dependence structure between the variables. The accuracy is evaluated over a large number of sample replicates. Small and larger networks with fixed or randomly chosen DAG and conditional probabilities are considered.

4.1 Evaluating the accuracy of PC algorithm for complex survey data: a preliminary analysis

In Sect. 4.1.1 the PC-cs algorithm performance when the Markov assumption is violated by the sampling design is investigated. This means that the sample distribution violates the set of conditional independence relations entailed by the DAG to which the population probability distribution P is faithful. The PC-cs algorithm behaviour in case of orientation faithfulness and adjacency faithfulness violations is investigated in sects. 4.1.2 and 4.1.3, respectively. Finally, in Sect. 4.1.4 the robustness of the PC-cs algorithm is evaluated in a ten nodes network where the aforementioned PC algorithm assumptions can be simultaneously violated by the sample distribution.

4.1.1 Markov assumption violation

A finite population of size $N=10,000$ has been generated according to the true causal DAG in Fig. 2a. In Tables 1, 2, 3, 4, 5 the conditional probability distributions associated to the nodes are reported.

Table 1 Probability distribution of $X_{1}$

Bayesian network structural learning from complex survey data: a resampling based approach

Abstract

Similar content being viewed by others

Robust semiparametric inference for polytomous logistic regression with complex survey design

Design-Unbiased Statistical Learning in Survey Sampling

Robustness in Survey Sampling Using the Conditional Bias Approach with R Implementation

1 Introduction

2 Discovering causal structure with the PC algorithm

2.1 Preliminary definitions

2.2 Assumptions

2.3 PC algorithm for i.i.d. data

3 PC algorithm for complex survey data

3.1 The problem

3.2 Independence test for complex sample surveys

3.2.1 Parameters estimators and their distributions

Proposition 1

3.2.2 The independence test statistic

Proposition 2

Proposition 3

3.3 Estimation of test statistic limiting distribution under \({{\mathcal {H}}_{0}}\) via resampling

4 Simulation studies

4.1 Evaluating the accuracy of PC algorithm for complex survey data: a preliminary analysis

4.1.1 Markov assumption violation

4.1.2 Orientation faithfulness violation

4.1.3 Adjacency faithfulness violation

4.1.4 Robustness of PC-cs algorithm: a ten nodes network

4.2 Evaluating the accuracy of PC algorithm for complex survey data: a Monte Carlo study

4.2.1 A three nodes network with fixed DAG and conditional probabilities

4.2.2 A ten nodes network with fixed DAG and conditional probabilities

4.2.3 A four nodes network with randomly chosen DAG and conditional probabilities

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Proof of Proposition 1

Lemma 1

Proof

Lemma 2

Proof

Proof of Proposition 2

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Proof of Proposition 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation