Keywords

1 Introduction

Lung cancer is the most frequent cause of cancer-related death in men worldwide and the second most common cause in women  [4], despite significant improvements in diagnosis and treatment during the past decade. Lung cancer is typically divided into two major subtypes: small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC), where the latter is more prevalent. The overall 5-year survival rate for lung cancer is 19% but it can be increased to 57% if diagnosis occurs at a localized stage of the disease  [2], which is not often the case. Accurate prediction of lung cancer outcomes is important, because it can facilitate patient care and clinical decision making. With rapid advancements in technology, large amounts of lung cancer data with clinical and molecular information are being collected and made available. This trend provides an opportunity for researchers to apply machine learning techniques to predict outcomes of lung cancer.

Most machine learning methods for predicting clinical outcomes construct a single model M from training data. M is then applied to predict outcomes in future instances. We refer to such a model as a population-wide model because it predicts outcomes for a future population of instances. It may be difficult for population-wide models to perform well in domains in which instances are highly heterogeneous. Studies have revealed that heterogeneity exists both within individual lung cancer tumors and between patients  [20]. In such domains, a reasonable approach is to learn a model that is tailored to a particular instance (e.g., a patient), which we refer to as an instance-specific model. An instance-specific approach builds a model \(M_T\) for a given instance T from the features that we know about T (e.g., clinical and molecular features) and from a training set of data on many other instances. It then uses \(M_T\) to predict the outcome for T. This procedure repeats for each instance that is to be predicted.

In this paper, we use Bayesian network (BN) classifiers to predict patient survival in a cohort of 261 lung cancer patients with clinical and omics information. More specifically, we apply a score-based instance-specific BN learning method, called IGES, which we introduced in  [14]. The IGES algorithm searches the space of BNs to learn a model that is specific to an instance T by guiding the search based on T’s features (i.e., variable-value pairs). We also apply a state-of-the-art score-based population-wide BN learning method, called GES  [14], as a control method. These methods are summarized in Sect. 4. The main goal of this paper is to investigate the effectiveness of instance-specific modeling in predicting survival outcomes for lung cancer patients.

2 Related Work

Various population-wide methods such as neural networks, support vector machines, random forests, and Bayesian models have been applied in cancer research to predict cancer survival and reported AUROCs of \(\sim \)0.80  [1, 8, 19]. A review of such methods is provided in  [16]. None of these methods learn instance-specific models. Additionally, they use different sets of predictors and different patient cohorts, so a direct comparison to the results in the current paper is not possible.

Several machine learning methods have been developed to learn instance-specific models. Zheng and Webb  [22] introduced a lazy Bayesian rule learning (LBR) method to construct a model that is specific to a test instance. In an LBR rule, the antecedent is a conjunction of the variable-value pairs that are present in the test instance and the consequent is a local naïve Bayes classifier in which the target variable is the parent of the variables that do not appear in the antecedent. Visweswaran and Cooper  [21] developed an instance-specific Markov Blanket (ISMB) algorithm that searches over the space of Markov blankets (MBs) of the target variable by utilizing the features of a given test instance. ISMB first uses a greedy hill-climbing search to find a set of MBs that best fit the training data. Then, it greedily adds single edges to the MB structures from the previous step, if doing so improves the prediction of a given test instance. Ferreira et al.  [10] implemented two patient-specific decision path (PSDP) algorithms using two variable selection criteria: balanced accuracy and information gain. A decision path is a conjunction of features that are present in a given test instance and a leaf node that contains the probability distribution of the target variable.

Recently, Lengerich et al.  [17] introduced a personalized linear regression model that learns a specific set of parameters for each test case based on the idea that the similarity between personalized parameters is related to the similarity between the features. Accordingly, a regularizer is learned to match the pairwise distance between the features (i.e., variable-value pairs) and the personalized regression parameters. In other related work, Cai et al.  [5] developed a method to learn tumor-specific BN models from data; this method uses bipartite BNs and makes other assumptions that restrict its generality.

The IGES method  [14] that we use in this paper is different from the methods mentioned above. IGES differs from a population-wide method in a principled way; it learns a model for each instance T that is optimized for T, while a population-wide model is designed to learn a single model that is optimized for a population of instances. Also, the IGES method learns a more general model than does LBR  [22], ISMB  [21], PSDP  [10], and tumor-specific BN models  [5], because it learns Bayesian network models, which can be used for both prediction and causal discovery  [13,14,15].

3 Preliminaries

A Bayesian network (BN) encodes probabilistic relationships among a set of variables. A BN consists of a pair \((\mathcal {G},\varvec{\varTheta })\), where \(\mathcal {G}\) is a directed acyclic graph (DAG) and \(\varvec{\varTheta }\) is a set of parameters for \(\mathcal {G}\). A DAG \(\mathcal {G}\) is given as a pair \((\varvec{X}, \varvec{E})\), where \(\varvec{X} = \{X_1, X_2, ..., X_n\}\) denotes a set of nodes that correspond to domain variables and \(\varvec{E}\) is a set of directed edges between variables. The presence of an edge \(X_i \rightarrow X_j\) (\(X_i\) is called the parent and \(X_j\) is called the child) denotes that these variables are probabilistically dependent. The absence of an edge denotes that \(X_i\) and \(X_j\) are conditionally probabilistically independent. A set of DAGs that encode the same independence and dependence relationships are statistically indistinguishable from observational data; such DAGs are called Markov equivalent. The second component, \(\varvec{\varTheta }\), is a set of parameters that encodes the joint probability distribution over \(\varvec{X}\), which can be efficiently factored based on the parent-child relationships in \(\mathcal {G}\).

As mentioned above, the edges present and absent in a DAG represent conditional dependence and independence relationships between variables, respectively. Any such relationship should hold for all combinations of values of the variables in the BN. There is a more refined form of conditional independence that holds only in a specific context, which is known as context-specific independence (CSI)  [3]. Figure 1 shows a BN that includes two CSI relationships: and . The IGES method models such local independence structures for each instance T, which results in a more expressive BN structure and a more efficient BN parameterization.

Fig. 1.
figure 1

An example Bayesian network with two context-specific independencies (CSIs): and .

4 Methodology

4.1 Population-Wide Greedy Equivalence Search (GES)

Score-based methods are one of the main approaches to learn BN structure from data that involve (1) a scoring function to measure how well the data (and optional background knowledge) support a given DAG and (2) a search strategy to explore the space of possible DAGs. Since recovering the data-generating BN is an NP-hard problem  [6], these methods often utilize a greedy search strategy. GES  [7] is a state-of-the-art score-based BN learning algorithm that searches over the Markov equivalence classes of DAGs using two local graph operations: single edge addition and single edge removal. First, it greedily adds single edges to the current graph as long as doing so leads to score improvement. It then greedily removes single edges as long as doing so results in a higher score. Under reasonable assumptions, the GES algorithm converges to the data-generating BN or one Markov equivalent to it  [7].

The Bayesian Dirichlet (BD) scoring function  [12] is used to score a BN with discrete variables and is calculated as follows:

$$\begin{aligned} \text {BD} (\mathcal {G},D) = P(\mathcal {G}) \cdot \prod _{i=1}^{n}\prod _{j=1}^{q_i}{\frac{\mathrm {\Gamma } (\alpha _{ij})}{\mathrm {\Gamma } (\alpha _{ij}+N_{ij})} \cdot \prod _{k=1}^{r_i}\frac{\mathrm {\Gamma } (\alpha _{ijk}+N_{ijk})}{\mathrm {\Gamma } (\alpha _{ijk})}} \,, \end{aligned}$$
(1)

where \(P(\mathcal {G})\) is the prior structure probability of \(\mathcal {G}\). The first product term is over all n variables, the second product is over the \(q_i\) parent instantiations of variable \(X_i\), and the third product is over all \(r_i\) values of variable \(X_i\). The term \(N_{ijk}\) is the number of cases in the data in which variable \(X_i = k\) and its parents \(\varvec{Pa}(X_i) = j\); also, \(N_{ij}=\sum _{k=1}^{r_i}{N_{ijk}}\). The term \(\alpha _{ijk}\) is a Dirichlet prior parameter that may be interpreted as representing “pseudo-counts” and \(\alpha _{ij} = \sum _{k=1}^{r_i}{\alpha _{ijk}}\). The pseudo-counts can be defined to be evenly distributed, in which case Eq. (1) represents the so-called BDeu score  [11]:

$$\begin{aligned} \alpha _{ijk} = \frac{\alpha }{r_i \cdot q_i} \,, \end{aligned}$$
(2)

where \(\alpha \) is a positive constant called the prior equivalent sample size (PESS).

4.2 Instance-Specific Greedy Equivalence Search (IGES)

We introduced an instance-specific version of GES called IGES in  [14]. Similar to GES, IGES is a two-stage greedy BN learning algorithm that uses an instance-specific score (IS-Score). First, it runs GES using the training data D and the BDeu score to learn a population-wide BN model \(\mathcal {G}_{\textit{PW}}\); this is the population-wide BN for all test instances. In the second stage, IGES uses the data D, the population-wide model \(\mathcal {G}_{\textit{PW}}\), and a single test instance T to learn an instance-specific model for T, which is denoted as \(\mathcal {G}_{\textit{IS}}\). To do so, IGES starts with \(\mathcal {G}_{\textit{PW}}\) and runs an adapted version of GES with a specialized IS-Score to maximize the marginal likelihood of the data given both BN models \(\mathcal {G}_{\textit{PW}}\) and \(\mathcal {G}_{\textit{IS}}\).

For each variable \(X_i\), the IS-Score is calculated at the parent-child level and is composed of two components: (1) the instance-specific structure that includes \(X_i\)’s parents in \(\mathcal {G}_{\textit{IS}}\), which we denote by \(\varvec{Pa}_{\textit{IS}}(X_i)\) and (2) the population-wide structure that includes \(X_i\)’s parents in \(\mathcal {G}_{\textit{PW}}\), which we denote by \(\varvec{Pa}_{\textit{PW}}(X_i)\). In order to score the instance-specific structure \(\varvec{Pa}_{\textit{IS}}(X_i) \rightarrow X_i\), we use the instances in data that are similar to the current test instance T in terms of the values of the variables in \(\varvec{Pa}_{\textit{IS}}(X_i)\). These instances are selected based on the values of \(\varvec{Pa}_{\textit{IS}}(X_i)\) in T; let \(D_{{Pa}_{\textit{IS}}(X_i)=j}\) be the instances that are similar to T assuming \(\varvec{Pa}_{\textit{IS}}(X_i)=j\). Then the instance-specific score for \(\varvec{Pa}_{\textit{IS}}(X_i) \rightarrow X_i\) given data \(D_{{Pa}_{\textit{IS}}(X_i)=j}\) is as follows:

$$\begin{aligned} P(D_{{Pa}_{\textit{IS}}(X_i)=j}|\varvec{Pa}_{\textit{IS}}(X_i) \rightarrow X_i) ={\frac{\mathrm {\Gamma } (\alpha _{ij})}{\mathrm {\Gamma } (\alpha _{ij}+N_{ij})} \cdot \prod _{k=1}^{r_i}\frac{\mathrm {\Gamma } (\alpha _{ijk}+N_{ijk})}{\mathrm {\Gamma } (\alpha _{ijk})}}\,, \end{aligned}$$
(3)

where \(r_i\) denotes all values of \(X_i\), \(N_{ijk}\) is the number of instances in \(D_{{Pa}_{\textit{IS}}(X_i)= j}\) in which \(X_i=k\), and \(N_{ij}=\sum _{k=1}^{r_i}{N_{ijk}}\). The terms \(\alpha _{ijk}\) and \(\alpha _{ij}=\sum _{k=1}^{r_i}{\alpha _{ijk}}\) are the corresponding Dirichlet priors.

Since the instances in \(D_{{Pa}_{\textit{IS}}(X_i)= j}\) are being used to score the instance-specific structure, they should no longer be used to also score the population-wide structure; therefore, the score for \(\varvec{Pa}_{\textit{PW}}(X_i) \rightarrow X_i\) must be adjusted accordingly. We re-score \(\varvec{Pa}_{\textit{PW}}(X_i) \rightarrow X_i\) using the remaining instances in \(D_{{Pa}_{\textit{IS}}(X_i)\ne j}\) as follows:

$$\begin{aligned} P(D_{{Pa}_{\textit{IS}}(X_i)\ne j}|\varvec{Pa}_{\textit{PW}}(X_i) \rightarrow X_i) = \prod _{j^{\prime }=1}^{q_i}{\frac{\mathrm {\Gamma } (\alpha _{ij^{\prime }})}{\mathrm {\Gamma } (\alpha _{ij^{\prime }}+N_{ij^{\prime }})} \cdot \prod _{k=1}^{r_i}\frac{\mathrm {\Gamma } (\alpha _{ij^{\prime }k}+N_{ij^{\prime }k})}{\mathrm {\Gamma } (\alpha _{ij^{\prime }k})}}\,, \end{aligned}$$
(4)

where \(r_i\) and \(q_i\) are the number of possible values of \(X_i\) and \(\varvec{Pa}_{\textit{PW}}(X_i)\), respectively. \(N_{ij^{\prime }k}\) is the number of instances in \(D_{{Pa}_{\textit{IS}}(X_i)\ne j}\) for which \(X_i=k\) and \(\varvec{Pa}_{\textit{PW}}(X_i)=j^{\prime }\), and \(N_{ij^{\prime }}=\sum _{k=1}^{r_i}{N_{ij^{\prime }k}}\). The terms \(\alpha _{ij^{\prime }k}\) and \(\alpha _{ij^{\prime }}=\sum _{k=1}^{r_i}{\alpha _{ij^{\prime }k}}\) are the corresponding Dirichlet priors.

Finally, to obtain the overall score for variable \(X_i\) using all instances in D, we multiply these two scores. In essence, this method searches for the most probable context-specific BN in light of how well (1) the instance-specific model predicts data for instances like the current test instance, and (2) the population-wide model predicts data for the remaining instances. See  [14] for a detailed description of the IS-Score and the IGES method.

5 Experimental Results

This section describes a comparison of the performance of an instance-specific machine learning method to its population-wide counterpart when predicting 1-year survival in lung cancer patients, using clinical and molecular data, which are described in Sect. 5.1. We used Bayesian network classifiers as machine learning models to predict the target variable. More specifically, we applied IGES and GES methods to learn instance-specific and population-wide BN classifiers.

To predict the target variable, we first ran the IGES and GES methods to construct a BN structure over all variables (i.e., the predictors and target). Then, we obtained the Markov blanket (MB) of the target variable that includes the variable’s parents, children, and its children’s parents. Finally, we calculated the probability distribution of the target variable given its MB and output the most probable outcome as the prediction. We report evaluation criteria to measure the effectiveness of the instance-specific BN model versus the population-wide BN model. In particular, as a measure of discrimination, we report the area under the ROC curve (AUROC) when predicting 1-year survival. We also report the differences between the variables in the MB of the target variable found by the instance-specific models compared to the population-wide model.

5.1 Data Description

This was a retrospective analysis of banked tumor specimens that were collected from patients with lung cancer at the University of Pittsburgh Medical Center (UPMC) in 2016. Baseline demographics, smoking history, staging, treatment, and survival data were collected through the UPMC Network Cancer Registry. We replaced the missing values of the predictor variables with a new category called “missing” and removed the cases for which the value of the outcome variable was not known. Demographic and clinical characteristics of the 261 patients are summarized in Table 1. DNA sequencing was performed using the Ion AmpliSeq\(^\mathrm{TM}\) Cancer Panel (Ion Torrent, Life Technologies, Fisher Scientific). Gene rearrangements of ALK, ROS1, and RET, and MET amplification were detected using FISH. PD-L1 SP263 and PD-L1 22C2 assays were performed on lung cancer samples to determine the PD-L1 tumor proportion score (TPS). Table 2 provides information about the type, name, and description of the variables that are included in the lung cancer dataset. The outcome-prediction research reported here was performed under the auspices of study protocol number PRO15070164 from the University of Pittsburgh Institutional Review Board (IRB).

Table 1. One-year survival given demographic and clinical characteristics. A 95% confidence interval is included for each sub-group of patients.
Table 2. Type, name, and description of the variables of the lung cancer dataset.

5.2 Results

We performed leave-one-out cross-validation on the lung cancer dataset. For each instance T, we used T as a test instance and all the remaining instances as the training set D. We applied IGES search using T, D, and the IS-Score to learn an instance-specific BN model for T, which is called \(\mathcal {G}_T\) and is used to predict the outcome for T. We also applied GES search using D and the BDeu score to learn a population-wide BN model for T; this BN model is denoted by \(\mathcal {G}_{\textit{PW}}\) and is used to predict the outcome for T. We repeated this procedure for every instance in the dataset. Note that this leave-one-out cross-validation does not involve any hyperparameter tuning; therefore, we do not expect it to cause overfitting. We used an efficient implementation of GES, called FGES  [18], which is available in the Tetrad systemFootnote 1. We used prior equivalence sample sizes of \(\text {PESS}=\{0.1, 1.0, 10.0\}\) for both IGES and GES methods. IGES also has a parameter that penalizes the structural difference between the population-wide and instance-specific BN models, called \(\kappa \) (\(0.0 < \kappa \le 1.0\)), where a lower value indicates more penalty for differences (see  [14] for details). We report the results of using multiple values of \(\kappa \).

Table 3 shows the AUROC results on the lung cancer dataset, using both GES and IGES searches; boldface indicates that the results are statistically significantly better, based on DeLong’s non-parametric test  [9] at a 0.001 significance level. The results indicate that with \(\text {PESS}= 1.0\) and \(\kappa = 1.0\), the instance-specific search resulted in the highest AUROC; also, for almost all values of \(\kappa \), IGES performs better. Table 3 also suggests that it is important to define PESS properly when applying a Bayesian method on a dataset with small to moderate sample size, which is the case in this paper.

Table 3. AUROC of the GES and IGES methods on the lung cancer dataset. Boldface indicates statistically significantly better results.
Table 4. Comparison of the target variable’s Markov blanket (MB) in the instance-specific BNs vs. the population-wide BN for \(\text {PESS}=1.0\) and \(\kappa =1.0\).

Table 4a shows the results of comparing the target variable’s MB in the instance-specific models versus the population-wide models with \(\text {PESS}=1.0\) and \(\kappa =1.0\). It indicates that in \(16.9\%\) of the patient cases, the MB of the target variable was exactly the same in instance-specific and population-wide BNs. Also, in \(10.7\%\) of the cases, the MB of the target variable had 5 additional variables in instance-specific models compared to the population-wide model. Table 4b also shows the percentage of the 7 variables that occurred the most in the instance-specific MBs. Table 4 supports that instance-specific structures exist for the lung cancer cases in the dataset we used.

6 Discussion and Conclusions

In this paper, we evaluated the performance of an instance-specific BN classifier, which uses the IGES algorithm, to predict 1-year survival for 261 lung cancer patient cases. We compared IGES results to its population-wide counterpart, GES. We compared IGES to GES method for two reasons. First, the goal of this study was to evaluate the effectiveness of instance-specific modeling in predicting lung-cancer survival; therefore, we wanted the only difference between the two methods to be instance-specific versus population-wide modeling, while keeping the type of machine learning classifier the same (i.e., Bayesian networks). Since to date we have only implemented instance-specific and population-wide algorithms for learning BNs, we compared these two methods. Additionally, BNs continue to be an important machine learning method for clinical outcome prediction because they generally perform well and provide interpretable models that clinicians can understand relatively well. We compared the predictive performance using AUROC and the structural differences between the instance-specific and population-wide BNs. The results provide support that the instance-specific models are often different and have better predictive performance than the population-wide ones. Future extensions include (1) tuning hyperparameters of the methods such as PESS and \(\kappa \), and (2) implementing instance-specific versions of other machine learning classification methods and comparing them to their population-wide counterparts.