Keywords

1 Introduction

Artificial intelligence (AI) has recently gained a tremendous advance in various applications, e.g., autonomous driving, big data, pattern recognition, intelligent search, image understanding, automatic programming, and robotics [1]. These applications also inspire AI technique to develop and innovate, in a way. The increasing availability of healthcare data and rapid development of big data analytic methods have made possible the recent successful applications of AI in healthcare [2]. Machine Learning as one of the core technologies of AI has been widely used in all walks of life. In recent years, the healthcare industry produces a huge amount of digital data by utilizing information from all sources of healthcare data such as Electronic Health Records [3] and Personal Health Records [4]. At the same time, machine learning is well poised to assist clinical researchers in deciphering complex predictive patterns in healthcare data [5]. All of these provides the basis of the prognostication of diseases with Machine Learning technique.

Indeed, the incidence of thyroid cancer has nearly tripled since 1975 [6]. In PTMC, the prevalence of subclinical CLNM has been detected as 30%–65% [7]. And PTMC can lead to a recurrence of cancer. Therefore, it is urgent to introduce machine learning into the field of Thyroid Disease. To solve the prognostic problem of Thyroid Disease, we propose a disease diagnosis model, and apply it to thyroid disease diagnosis in the First Hospital of Jilin University.

The technical contributions done in this paper are summarized as follows:

  1. 1.

    We propose an algorithm MsaDtd that converts the original characteristic space into a larger characteristic space and improved decision tree algorithm for disease diagnosis to predict LNM in patients with PTMC.

  2. 2.

    We use MS-Apriori to obtain composite features, taking into account rare items by setting multiple minimum supports (MIS), and introduce fuzzy logic to deal with continuous attributes, aiming to avoid the cost of producing large frequent items.

  3. 3.

    We use 5425 Clinical-pathological data of PTMC patients in the First Hospital of Jilin University to validate MsaDtd. Experimental analysis indicates that the algorithm predicts LNM effectively and accurately.

2 Related Work

Prediction of thyroid diseases using machine learning has been an ongoing effort in recent years. Chen et al. [8] presented a three-stage expert system based on a hybrid support vector machines (SVM). It combined feature selection and parameter optimization, the developed FS-PSO-SVM expert system achieved excellent performance in distinguishing among hyperthyroidism, hypothyroidism and normal ones. Makas et al. [9] developed seven distinct sorts of Neural Networks to identify the thyroid disease. And used particle swarm optimization (PSO), artificial bee colony (ABC) and migrating birds optimization (MBO) algorithms retrained the network. The accuracy of the network developed outperformed the similar studies. Pourahmad et al. [10] used a back propagation feedforward neural networks to diagnose the malignancy in thyroid tumor. Thirteen batch learning algorithms were investigated and three different numbers of neuron in hidden layer were compared to achieve the best performance. Kaya et al. [11] applied Extreme Learning Machine (ELM) to the diagnosis of thyroid disease. This study indicated the classification and speed of ELM were higher than other machine learning methods. Maysanjaya et al. [12] used Multilayer Perceptron method to identify the type of thyroid (normal, hypothyroid, hyperthyroid) with WEKA tool. The accuracy of the prediction was as high as 96.74%.

Researchers have done much research on solving the problem of thyroid diseases diagnosis. But there are few studies on the prognosis of LNM in patients with PTMC. The prognosis of LNM is essential to prevent recurrence of cancer. For the above situation, this paper designs an intelligent decision model MsaDtd to predicts lymph node metastasis (LNM) in patients with PTMC.

3 The Prognosis Algorithm Based on MS-Apriori and Decision Tree

We design a disease diagnostic algorithm by mapping the prognosis of LNM in patients with PTMC to a binary classification problem. The symptoms of patients are mapped to independent variables \( \varvec{u} = (u_{1} ;u_{2} ; \ldots ;u_{d} ) \) and diagnostic results are mapped to dependent variables \( y \in \{ 0,1\} \).

3.1 MS-Apriori Rule Mining

Apriori play a major role in identifying frequent itemset and deriving rule set out of it [13]. Using Apriori results in a shortage when mining rare knowledge patterns of rare events, due to the entire database only set one minimum support. To solve this problem, we use MS-Apriori setting MIS for different items.

For attribute value, this paper introduces fuzzy logic to map attribute values to different subintervals through membership function, aiming to avoid the cost of producing large frequent items.

The association rule mining process is as follow. An item type vi is defined as each value type under each attribute in clinical-pathological data. The set of items in the whole database is I shown in Eq. (1) and the item type set is V shown in Eq. (2).

$$ I = \{ a_{1} ,a_{2} , \ldots ,a_{m} \} \,{ = }\,IA_{1} \cup IA_{2} \cup \ldots \cup IA_{d} ,\;m\, = \,n*d $$
(1)
$$ V = \{ v_{i} \} ,\;i = 1,2, \ldots ,u $$
(2)

There are m items, u item types and d-dimension features in the whole database. \( IA = \{ a_{i} \} \;(i = 1,2, \ldots ,n) \) represents the fuzzy itemset under an attribute. We specify that different attribute values under the same attribute do not belong to the same frequent itemset. The items in a frequent itemset should meet the condition shown in Eq. (3).

$$ a_{x} \cap a_{y} = \emptyset ,\;a_{x} \in IA_{i} ,a_{y} \in IA_{j} ,i = j $$
(3)

In addition, the support of frequent itemset is the smallest MIS of items in the frequent itemset. The frequent itemset is defined as Eq. (4). The MIS of frequent itemset c is defined as Eq. (5). The MIS of the item is defined as Eq. (6).

$$ c = \{ a_{1} ,a_{2} , \ldots a_{k} \} ,\;1 \le k \le d $$
(4)
$$ MIS(c) = \hbox{min} (MIS(a_{1} ),MIS(a_{2} ) \ldots MIS(a_{k} )) $$
(5)
$$ MIS(v_{i} ) = \frac{{v_{i} \cup LM_{yes} }}{N} $$
(6)

vi represents an item, corresponding a value type in clinical-pathological data. \( LM_{yes} \) represents the label of patients is lymph node metastasis. N is the total number of instances. The probability of item vi and item \( LM_{yes} \) appear in the same frequent itemset is set to the MIS of vi.

The frequent item cj is converted to rule Rulej shown in Eqs. (7) and (8).

$$ c_{j} :a_{1} \cup a_{2} \cup \ldots \cup LM_{yes} /LM_{no} $$
(7)
$$ Rule_{j} - > LM_{yes} ,\;Rule_{j} :a_{1} \cup a_{2} \cup \ldots \cup a_{k - 1} $$
(8)

We rank the rule by cosine measure and delete disturbance rules by defining a threshold. The cosine measure of positive tuple rules is defined by Eq. (9).

$$ { \cos }ine(Rule_{j} ,LM_{yes} ) = \frac{{P(Rule_{j} \cup LM_{yes} )}}{{\sqrt {P(Rule_{j} )*P(LM_{yes} ))} }} $$
(9)

\( P(Rule_{j} \cup LM_{yes} ) \) represents the probability that \( Rule_{j} \) and \( LM_{yes} \) belong to the same frequent item. The cosine measure of negative tuple rules is defined as Eq. (10).

$$ { \cos }ine(Rule_{j} ,LM_{no} ) = \frac{{P(Rule_{j} \cup LM_{no} )}}{{\sqrt {P(Rule_{j} )*P(LM_{no} ))} }} $$
(10)

Algorithm 1 outlines the process of rule mining by MS-Apriori. SDC is used to limit a rare item and a common item appear in the same frequent item. threshold is used to delete disturbance rules.

figure a

3.2 Decision Tree Construction

We obtain the sorted rule set \( R \, = \, \{ \, rule|cosine\left( {rule} \right) \, \ge \, threshold\} \) which is closely related to LMN diagnosis, through mining association rules in clinical-pathological data. Next, we build a decision tree which is used to predict LNM.

Through converting each rule in rule set R to the candidate attributes of the decision tree, the algorithm generates attribute set A. To determine which rule is selected as the splitting attribute in the process of classification, information gain is used as a decision criterion. When an instance contains all items needed in \( rule_{i} \), this rule can be applied to this instance.\( rule_{i} \) as a new attribute, its attribute value is \( LM_{yes} /LM_{no} \). If the rule is positive tuple rule, the value of \( rule_{i} \) is \( LM_{yes} \) after applying the rule. If the rule is negative tuple rule, the value of \( rule_{i} \) is \( LM_{no} \) after applying the rule. Otherwise, the rule cannot be applied, the value is No. The dataset D is converted to \( S = \{ (\varvec{x}_{i} ,y_{i} )\} ,i = (1,2, \ldots ,n),y_{i} \in \{ 0,1\} \). The labels of the dataset are LNM and normal. We mark it as S1 and S0. The information entropy of S is defined as Eq. (11).

$$ H(S) = - \sum\limits_{i = 1}^{2} {p_{i} \log_{2} p_{i} } $$
(11)
$$ p_{i} = \frac{{S_{i} }}{S},\;i = 1,2 $$
(12)

Where pi represents the probability that \( x_{i} \in S \) belongs to a class Si, and is estimated by Eq. (12). The information gain for attribute \( r \in A \) at node N is defined as Eq. (13).

$$ Gain(S,r,N) = H(S) - \sum\limits_{j = 0}^{1} {\frac{{S_{j} }}{S}H(S_{j} )} $$
(13)

The attribute with the maximum information gain is selected as the splitting attribute at node N. The instances are recursively partitioned into smaller subsets through analyzing the affiliation between instances and the rules mined by MS-Apriori. When all the subsets belong to a single class, or there are no instance or attribute can be used to partition, a model used to predict LNM is constructed.

4 Experiments

4.1 Data Pre-processing

This study is conducted in the Thyroid Surgery of the First Hospital of Jilin University. A total of 5425 patients with PTMC who underwent thyroidectomy with neck dissection from 2011 to 2015 are studied. Among the 5254 patients, there are 4855 cases met the criteria, including 323 cases treated lateral neck dissection.

Features used in this study include gender, age, capsule invasion (CI), maximum tumor diameter (MTD), multifocal, Hashimoto thyroiditis (HT), Central lymph node number (CN). These features are shown in Table 1. For LLNM, adding two additional features, CLNM and lateral lymph node number (LN).

Table 1. Description of feature

In this paper, we use the box plot to analyze data. We identify noise data by IRQ and set the value of it as null. Because box plot identifies abnormal values more objective and quartiles have a certain degree of robustness. For the missing values, in order to avoiding the loss of information by deleting. We should speculate missing values based on the majority of the existing data. We use mean/mode imputation (MMI) to deal with missing values. A bias occurs when we use it to train a predictive model, because of the unbalanced data. To solve the problem of skewed data, we use balancing techniques. The techniques we use is on CNLM dataset is KNN-NearMiss-2, a kind of supervised under-sampling techniques based on K-nearest neighbor. For LLNM dataset, SMOTE over-sampling technique is used, due to the small number of instances.

4.2 Results and Discussion

The proposed predictor is applied to the Clinical-pathological data of the First Hospital of Jilin University. To illustrate the performance of MsaDtd, we compare MsaDtd with a range of baseline algorithms, including Decision Tree (DT), Support Vector Machines (SVM), Logistic regression (LR), Bernoulli Bayes (BNB). We use 10-fold cross-validation to valid MsaDtd algorithm on CLNM dataset and LLNM dataset.

Tables 2 and 3 shows the results of various algorithms on CLNM dataset and LLNM dataset, respectively. As we can see, on CLNM dataset, MsaDtd algorithm achieves the results with Accuracy, Precision, Recall, F1 and AUC values are 76.09%, 72.16%, 63.63%, 72.63%, and 82.06%. High prediction accuracy of 76.09% is obtained for MsaDtd algorithm. The accuracy of the improved decision tree is higher than the traditional decision tree and other classifiers. The accuracy of improved decision tree MsaDtd increased by 2.47% compared with the traditional decision tree. On LLNM dataset, the average prediction Accuracy, Recall, Precision, F1, and AUC of MsaDtd are 87.21%, 82.75%, 85.86%, 86.85% and 88.37%. Our method outperforms the traditional decision tree in all aspects. The Accuracy, Recall, Precision, F1, and AUC increased by 3.51%, 4.21%, 1.91%, 3.09% and 5.17% comparing to the decision tree. Our method has the highest Accuracy, Recall, Precision and AUC among the methods we compared.

Table 2. Performance comparison with baseline algorithms on CLNM dataset
Table 3. Performance comparison with baseline algorithms on LLNM dataset

Figures 1 and 2 shows a plot of the ROC curves derived from MsaDtd and various baseline algorithms on different dataset. One CLNM dataset, it is higher 6.69% than LR which having the highest ROC area among baseline algorithms. On LLNM dataset, the Roc area of MsaDtd is 88.37%, which is higher than all of the methods mentioned. The above results show the superior performance of the prediction we proposed.

Fig. 1.
figure 1

ROC curve comparison with baseline algorithms on CLNM dataset

Fig. 2.
figure 2

ROC curve comparison with baseline algorithms on LLNM dataset

To our Knowledge, there is almost no one proposed the specialized algorithm for the prognosis of lymph node metastasis (LNM) in patients with PTMC in recent years, so we compare our method with a classification algorithm DeepPPI-Con [14] which achieves superior performance in Protein-Protein Interactions. The results shown in Table 4 indicate that our method is significantly superior to DeepPPI. The Accuracy, Precision, F1 and AUC of MsaDtd are 10.43%, 8.38%, 4.35% and 7.48% higher than DeepPPI on CLNM dataset. They are increased by 5.38%, 6.53%, 3.8% and 2.06% comparing to DeepPPI on LLNM dataset.

Table 4. Performance comparison with DeepPPI on CLNM and LLNM dataset

5 Conclusion

In this paper, we propose an algorithm MsaDtd which improved decision tree with MS-Apriori and applied to the prognosis of thyroid disease through establishing a predictor to predict LNM in patients with PTMC. Fuzzy logic is introduced to handles continuous attributes, preventing to generate too many frequent items. Sorting and filtering rules mined by MS-Apriori used to avoid generate distractions, aiming to improve the prediction accuracy. Through the application of rules, the algorithm obtains new features to transform feature space, making full use of composite features. This improves the robustness and generalization capabilities of our algorithm. Building a decision tree and predicting thyroid disease by analyzing the affiliation between instances and rules to make the effective prediction. Clinicians can use the information given by predictor to adopt specific protocols throughout treatment. For the patients prone to LNM, clinicians should take customized interventions to reduce the risk of cancer recurrence.