Introduction

Liver plays an important role in many essential body functions [1]. Thus, any lesion of the liver adversely affects the important physiological functions such as excretory, secretory and detoxification, which eventually leads to the poor health of patient [2]. Recent research demonstrates that hepatitis such as Hepatitis B or Hepatitis C cause the liver failure, cirrhosis, or cancer [3].

Hepatitis is defined by the World Health Organization as an inflammation of the liver and is caused by a variety of pathogenic factors such as viruses, bacteria, parasites, chemical poisons, drugs, alcohol and autoimmune [4]. Hepatitis A virus (HAV), hepatitis B virus (HBV), hepatitis C virus (HCV), hepatitis D virus (HDV) and hepatitis E virus (HEV) are the five major pathogenic viruses cause the viral hepatitis [5]. Hepatitis like HBV gradually develops into chronic hepatitis, cirrhosis and hepatocellular carcinoma, which eventually leads to a large number of deaths each year [6]. Especially, 80% of patients with HBV develop into liver cancer as the lack of timely medical intervention[7, 8].

However, early intervention on these patients with hepatitis can avoid further damage, and finally reduce morbidity and mortality. The Model for End-Stage Liver Disease (MELD) is widely used in liver disease diagnosis and treatment because of its simplicity and objectivity [9]. Nevertheless, it remains challenging to identify the onset of liver failure caused by hepatitis due to the complex interaction between liver and other organs [10].

Over the last few years, researchers have utilized machine learning to identify the onset of liver failure caused by hepatitis. Alexandra et al. employed Chi-squared Automatic Interaction Detector (CHAID) to forest the patients who should be screened for chronic hepatitis B or C. The results showed that the probability of HBV infection was higher in patients with ALT ≥ 0.56 μkat/l [11]. Chen et al. utilized Support Vector Machine (SVM), Naive Bayesian Model (NBM), Random Forest (RF) and K-Nearest Neighbor (KNN) to predict the stage diagnosis of the hepatitis. The experimental results showed that RF classifier achieved the best performance among the four machine learning models. The result indicated that the complex models could be a potentially useful tool to predict the stage of hepatic fibrosis [12]. Hashem et al. took advantage of multilinear regression (MR), decision tree (DT), particle swarm optimization (PSO) and genetic algorithm (GA) to forest the advanced fibrosis risk by combining the serum bio-markers and clinical information. The study found that machine learning could be used as the alternative methods to prognose the risk of advanced liver fibrosis caused by chronic hepatitis C [13]. Tian et al. compared the eXtreme Gradient Boosting (XGBoost), RF, DT, and logistic regression (LR) to identify the optimal model to predict the HBsAg seroclearance. The study discovered that the XGBoost reached the best predictive performance for predicting HBsAg seroclearance with clinical data [14]. Singh et al. proposed a hybrid approach to evaluate the stage of hepatitis disease. Simulation results indicated that the improved ensemble learning method performed better than the other existing individual methods on the diagnosis of hepatitis [15].

In general, the complex machine-learning models such as RF and XGBoost perform better than the simple models such as LR in the prediction of hepatitis [16]. However, the complex machine-learning model as a black box does not reveal its internal mechanisms. Thus, the hepatobiliary physicians can not understand the models by looking at their parameters (e.g. a XGBoost). Due to the lack of interpretability, the application of the complex machine-learning approaches in the actual clinical setting is limited.

For the sake of a broader applicability of artificial intelligence (AI) to the hepatitis diagnosis, it is imperative to provide the hepatobiliary physicians with the explanations why a certain prediction is made, that is, the internal mechanisms that lead to the prediction [17]. Thus, an explainable AI (XAI) framework combing the SHapley Additive exPlanations (SHAP) [18], Partial Dependence Plots (PDP) [19] and Local Interpretable Model-agnostic Explanations (LIME) [20] methods is proposed to provide the explanations for the complex models. Figure 1 represents the flow chart of XAI-based diagnosis process of hepatitis. The clinical data collected is sent to the XAI framework after the hepatitis patient is examined by different inspections. Then, the proposed XAI approach generates the computer aided diagnosis (CAD) results to the doctors. Finally, the doctors perform the diagnosis and the treatment to the hepatitis patient with the support of CAD.

Fig. 1
figure 1

Flow chart of XAI-based diagnosis process of hepatitis

The main contributions of this paper are summarized as follows: (1) To obtain a higher exacerbation risk identification accuracy for hepatitis, multiple complex models are explored. A public benchmark data set from UCI is applied to assess the performance of the complex models. (2) To achieve a broader applicability of the complex models to the hepatitis diagnosis, an interpretable framework is proposed to provide the global and local explanations to improve the clinical understanding of the hepatitis exacerbation risk prediction. The rest of paper is organized as follows. In “Methodology”, the research methodology that we apply is explained. In “Results”, we present the details of how the interpretability framework works. “Discussion” discusses the proposed interpretability framework. Finally, “Conclusion” concludes our paper with the future developments.

Fig. 2
figure 2

The framework of XAI

Methodology

Figure 2 illustrates the framework of the proposed XAI for the CAD system. The proposed framework provides the global and local explanations to improve the clinical understanding of the hepatitis exacerbation risk prediction. Patient record is obtained by data collecting and preprocessing. Then, the model is loaded to predict the outcome of the exacerbation risk. Next, the model explanation method is applied to achieve the global and local explanations. Finally, the prediction and the explanation results are transmitted to the doctors for examining and further validating.

Data collecting and preprocessing

To evaluate the feasibility of the proposed framework, a public classification benchmark on hepatitis from UCI machine learning repository is used in the empirical study [21]. The benchmark contains a mixture of integer and real value attributes about patients affected by the hepatitis. The task of our proposed framework is to predict the disease deterioration risk. Distribution of survival (low risk) and death groups (high risk) in patients with hepatitis is shown in Table 1. The benchmark contains 155 patients with hepatitis and 19 features. The attribute description of hepatitis and its abbreviation are shown in Table 1.

Table 1 Distribution of survival and death groups in patients with hepatitis. Values are expressed as mean ± standard deviation

As indicated in Table 1, patients those die are labeled as class 1, while those survive are labeled as class 2. The data set contains 75 instances with missing values. To deal with the missing values of the hepatitis data set, the nominal or binary features are set to the majority value while the continuous attributes are set to the average value. The proportion of hepatitis patients with death outcome is 20.6%, while the proportion of survival outcome is 79.4%. To overcome the model bias caused by data imbalance, Synthetic Minority Oversampling Technique (SMOTE) is applied to balance the data set [22].

Model selection and prediction

The easiest way to obtain the model explanations is to apply the interpretable models (simple models) to the clinical data. Linear/logistic regression, decision tree, naive bayesian and k nearest neighbor are the most commonly used explanatory models. However, the simple models such as logical regression can only represent linear relationships between the input and output, which often oversimplify the complex relationships in reality and usually reach the unsatisfactory predictive performance. In the low-risk scene (e.g. a music recommender system), it may be good enough that the simple model performs well on a test dataset. But in the high-risk medical scene, the prediction performance provides the reliability for the model. While the explanations give the clinicians the deeper understanding about the problem, the data and the reason why a model might fail. Thus, the prediction performance and the explanation are both important to the clinicians when designing the CAD system [23].

To achieve a high prediction performance, the complex models SVM, Xgboost and RF are employed to build the model. SVM is a convex optimization problem that achieves data partitioning by searching the hyperplane with maximum intervals [24]. Xgboost is an optimized distributed gradient promotion model, which is designed to be efficient, flexible and portable [25]. RF is generated by the ensemble of decision trees. It is widely used in the analysis and modeling of medical scenarios due to its rapidity, high accuracy, and robustness [26]. In the field of data science, SVM, XGBoost and RF are the most popular models. In particular, SVM, XGBoost and RF are also the most commonly used in hepatitis-assisted decision-making systems. However, it is not enough just to know what is predicted in the high-risk medical scene by the black box models.

Model-agnostic explanations

To obtain the explanation of the complex models, the model-agnostic interpretation methods, the recent advances in machine learning, are applied to achieve the explanations of the complex models while retaining a good prediction performance. Compared with model-specific explanation method, the model-agnostic interpretation is more flexible by separating the model from explanations [27]. Model-agnostic interpretation methods can be divided into two categories: local explanation and global explanation [28]. LIME is the most commonly used local explanation method. While PDP and SHAP are the most popular global interpretable approaches.

LIME as a local explanation method trains the local surrogate models to provide the interpretability for the complex models. First, LIME creates a new dataset by data perturbation. Then, LIME trains an interpretable model such as decision tree on the new dataset. Finally, the corresponding prediction performance of the black box model is compared with that of the interpretable model. LIME is defined as follows:

$$ \gamma(x) = \underset{g \in G}{\arg\min} L(f,g,\pi_{x}) + {\Omega}(g) $$
(1)

where the loss function L is used to measure how close the interpretable model g is to the prediction of the original complex model f. f is the original complex model. g denotes the interpretable model for the instance x (e.g., logistic regression). G indicates the family of the interpretable models. πx represents proximity of the sampled instances to the instance x. Ω(g) is the complexity of model g.

PDP demonstrates the marginal effect of the single feature on the predicted outcome for the complex machine learning model. PDP represents the relationship (linear, monotonous or more complex) between the outcome and input. The partial dependence function \(\hat {f_{x_{s}}}\) defined as:

$$ \hat{f_{x_{s}}}(x_{s}) = \frac{1}{n} \sum\limits_{i=1}^{n} \hat{f_{x_{s}}}(x_{s},{x_{c}^{i}}) $$
(2)

where \(\hat {f_{x_{s}}}(x_{s})\) is the partial function which displays the global relationship of a input feature with the predicted outcome. s is a feature set containing only one or two features, xs denotes the set of features is to be plotted by \(\hat {f_{x_{s}}}(x_{s})\), xc indicates the other features used in the machine learning model f. \({x_{c}^{i}}\) expresses the actual feature values from the dataset for the features in which we are not interested, n is the number of instances of the dataset.

SHAP uses the Shapley values to measure the feature impact for the complex model. Shapley values is defined as the (weighted) average of marginal contributions [29]. It is characterized by the impact of feature value on the prediction across all possible coalitions [30]. Shapley value is defined as:

$$ \phi_{j}(x) = \underset{s \subseteq \{x_{1},x_{2},...,x_{m}\} \setminus \{x_{j}\}}{\sum} \frac{|s|!(m - |s| - 1)!}{m!} (val(s \cup \{x_{j}\} \!- val(s) )) $$
(3)

where ϕj(x) is the Shapley value of xj, xj represents a feature value, s is a feature subset of the model, m depicts the number of features, val is the prediction for feature values in set s.

Results

Prediction results

We implement the interpretable framework on the development platform of Python 3.6.4. We calculate the overall accuracy of the simple and complex models using K-fold cross-validation. Generally, K is set to 5,10 or 20. The evaluation of the simple and complex models on the hepatitis data are shown in Table 2. When K = 5, the prediction accuracy of the developed LR, Classification and Regression Tree (CART), KNN, NBM, SVM, XGBoost and RF with K-fold-cross validation is 85.4%, 85.4%, 77.2%, 74.0%, 88.2%, 87.4% and 88.2%. SVM and RF perform better than the other models. When K = 10, the prediction accuracy of the developed LR, CART, KNN, NBM, SVM, XGBoost and RF with K-fold-cross validation is 85.7%, 87.0%, 79.2%, 72.7%, 86.9%, 89.8% and 91.0%. RF perform better than the other models. When K = 20, the prediction accuracy of the developed LR, CART, KNN, NBM, SVM, XGBoost and RF with K-fold-cross validation is 87.5%, 85.9%, 78.9%, 76.7%, 88.3%, 89.6% and 91.9%.

Table 2 Comparison results on the hepatitis dataset using K-fold validation

We can find that the developed the complex models such as SVM, XGBoost and RF achieve better performance than the simple ones. Especially, RF obtains the best predictive performance. This is mainly due to the data collected fits better with RF. However, RF is a black box model. To get the explanation of RF, the global and local interpretation methods are applied while retaining the good prediction performance.

Global explanations

To get insights into the impact of each predictor to the output of complex model, we compute the mean SHAP values of random forest. Figure 3 demonstrates the average feature impact of the developed RF classifier. We can find that ascites, spiders, bilirubin, albuMin, malaise, varices, and SpleenPalpable have more impact than the others. The explanations for the feature impact are broadly in accordance with the literature and prior knowledge from hepatobiliary physicians.

Fig. 3
figure 3

Average feature impact of the developed RF classifier

Figure 4 represents the averaged feature-importance estimates extracted from random forest classifier. Horizontal axis (x-axis) represents the Shapley value which denotes the average feature value marginal contribution on the output across all possible coalitions. Shapley value with less than 0, equal to 0 and greater than 0 means the negative contribution, no contribution and positive contribution, respectively. Left longitudinal coordinate (y-axis) indicates the features which are sorted by the importance in reverse order. Right longitudinal coordinate expresses the value of the features from low to high. We can see that ascites is the most important feature on average, and the developed random forest classifier is more likely to consider the hepatitis patients as high risk when the feature value of ascites becomes larger. Compared with the traditional features importance, the interpretable framework we proposed can assist the hepatobiliary physicians to predict the deterioration risk of hepatitis.

Fig. 4
figure 4

Averaged feature-importance estimates of random forest

SHAP aids the hepatobiliary physicians to probe the feature contribution of the developed model. While it is also clinically meaningful to explore how each feature affects the model decision-making. Thus, PDP is applied to achieve to visualize the linear, monotonous or more complex relationship between the output and a feature. To visualize the PDP with the continuous features, we examine the effects of the bilirubin and alkphosphate on the predicted output.

Figure 5 shows the relationship between the bilirubin and the prediction of patient outcomes. It can be seen that there exists a complex relationship between the output and the feature bilirubin. First, the impact of bilirubin on the output increases when the value changes from 0.3 to 0.9. Then, the impact falls when the value changes from 0.9 to 2.5. Finally, the impact remains the same when the value is greater than 2.5.

Fig. 5
figure 5

Partial dependence plot of feature Bilirubin

Similarly, Fig. 6 depicts the relationship between the alkphosphate and the prediction of patient outcomes. Similarly, there also exists a complex relationship between the output and the feature alkphosphate. First, the impact of alkphosphate on the output falls when the value changes from 26 to 106. Then, the impact increases when the value changes from 106 to 250. Finally, the impact remains the same when the value exceeds 250.

Fig. 6
figure 6

Partial dependence plot of feature AlkPhosphate

Local explanations

After taking into account the global explanation for the predicted outcome of hepatitis patients, it is also essential to comprehend whether the condition of a specific hepatitis patient will get worse. To explain why individual outcome prediction for the hepatitis patients are carried out by the black box machine learning model, LIME is employed to train a local surrogate models instead of training a global surrogate model. Figure 7 shows the LIME explanations for one instance randomly selected from hepatitis dataset.

Fig. 7
figure 7

LIME explanations for one instance from hepatitis dataset

The top left diagram shows the predicted outcome of hepatitis patients with probability. Class 1 indicates the hepatitis patients with death outcome, while class 2 represents hepatitis patients with survival outcome. The developed RF classifier predicts the randomly selected hepatitis patient with 93% probability survival (7% probability death). The orange color represents the target class 1 whereas the blue color represents the target class 2. It can be seen that the weight for each feature with their predicted class is denoted by color. They represent the local positive or negative weights assigned to each feature. The greater the weight is, the longer the color bar becomes.

Discussion

We investigate the use of XAI frameworks and the example of such application to support the healthcare of hepatitis. Our research can be summarized as follows: First, both interpretable and complex models are utilized to identify the exacerbation risk in patients with hepatitis. Especially, to improve the prediction accuracy, the complex models based on decision trees are introduced. Second, the global and local explanation methods are employed to avoid the obscurity of the complex models. Third, the predictors such as ascites, spiders, bilirubin, albumin, malaise, varices and spleenpalpable seem to display more important clinical significance than the other predictors. This can assist the hepatobiliary physicians to get insight into the predictions made by the clinical decision support system, and thereby they can make more accurate clinical diagnosis.

Lundberg et al. employed a single complex model (XGBoost) and the explanation method SHAP to predict the intraoperative hypoxaemia events based on the electronically recorded data before they occur [17]. Due to the complexity of clinical decision-making, it is often more convincing to adopt multiple models and interpretation methods. Different from the prescience system Lundberg (2018) developed, we stress the integration of multiple complex models and interpretable methods to improve the clinical understanding of the hepatitis exacerbation risk.

There are some limits in our research. First, to make the experiment objectivity and justice, the present study uses a benchmark on the hepatitis from UCI Machine Learning Repository. The number of patients is relatively small. To ensure the generalization ability of the model, K-fold cross validation is applied. However, more hepatitis data would be needed to conclusively validate the results. Especially, the real time data of hepatitis is the key to realizing the monitor of the exacerbation risk in patients with hepatitis. In the future, we will collect more hepatitis patients from the real world. Second, we employ three typical model-agnostic methods to improve the complex models explanation. However, the other interpretability methods such as counterfactual explanation, which may be conducive to improve the explanation, are ignored for now because the construction of counterfactual samples in medicine often requires rich human and material resources.

Conclusion

In the study, we propose an interpretable machine learning framework, which combines the complex models and the explanation methods that are developed recently, to reliably forecast the exacerbation risk of hepatitis. To evaluate the feasibility of the proposed framework, a benchmark on the hepatitis from UCI Machine Learning Repository is used. The results shows that random forest achieved the best overall accuracy (91.9%). The detailed evaluation of the proposed framework is shown in Table 1. The explanation results generated by the proposed framework agree with the characteristics of the hepatitis, which may improve the diagnostic accuracy of clinicians. In addition, our proposed framework could help the hepatobiliary physicians choose the right structure when they design the CAD system. Our work highlights the values of XAI frameworks in interpreting blackbox models such as RF, which supports the use of AI in healthcare. Further research can focus on the collection of the real time hepatitis data and the exploration of the novel model-agnostic methods.