Prognosis of Thyroid Disease Using MS-Apriori Improved Decision Tree

Hao, Yuwei; Zuo, Wanli; Shi, Zhenkun; Yue, Lin; Xue, Shuai; He, Fengling

doi:10.1007/978-3-319-99365-2_40

Yuwei Hao ORCID: orcid.org/0000-0003-3959-5833^16,17,
Wanli Zuo^16,17,
Zhenkun Shi^16,17,
Lin Yue¹⁸,
Shuai Xue¹⁹ &
…
Fengling He^16,17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11061))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1693 Accesses
5 Citations

Abstract

The lymph nodes metastasis in the papillary thyroid microcarcinoma (PTMC) can lead to a recurrence of cancer. We hope to take preventive measures to reduce the recurrence rate of the thyroid cancer. This paper presents a decision tree improved by MS-Apriori for the prognosis of lymph node metastasis (LNM) in patients with PTMC, called MsaDtd (Decision tree Diagnosis based on MS-Apriori). The method converts the original feature space into a more abundant feature space, MS-Apriori is used to generate association rules that consider rare items by multiple supports and fuzzy logic is introduced to map attribute values to different subintervals. Then, we filter the ranked rules which consider positive and negative tuples. We improve accuracy through deleting disturbance rules. At last, we use the decision tree to predict LNM by analyzing the affiliation between the instance and rules. Clinical-pathological data were obtained from the First Hospital of Jilin University. The results show that the proposed MsaDtd achieves better prediction performance than other methods on the prognosis of LNM.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Machine learning based on SEER database to predict distant metastasis of thyroid cancer

Article 29 December 2023

Machine learning framework with feature selection approaches for thyroid disease classification and associated risk factors identification

Article Open access 16 June 2023

Improving the diagnosis of thyroid cancer by machine learning and clinical data

Article Open access 01 July 2022

Keywords

1 Introduction

Artificial intelligence (AI) has recently gained a tremendous advance in various applications, e.g., autonomous driving, big data, pattern recognition, intelligent search, image understanding, automatic programming, and robotics [1]. These applications also inspire AI technique to develop and innovate, in a way. The increasing availability of healthcare data and rapid development of big data analytic methods have made possible the recent successful applications of AI in healthcare [2]. Machine Learning as one of the core technologies of AI has been widely used in all walks of life. In recent years, the healthcare industry produces a huge amount of digital data by utilizing information from all sources of healthcare data such as Electronic Health Records [3] and Personal Health Records [4]. At the same time, machine learning is well poised to assist clinical researchers in deciphering complex predictive patterns in healthcare data [5]. All of these provides the basis of the prognostication of diseases with Machine Learning technique.

Indeed, the incidence of thyroid cancer has nearly tripled since 1975 [6]. In PTMC, the prevalence of subclinical CLNM has been detected as 30%–65% [7]. And PTMC can lead to a recurrence of cancer. Therefore, it is urgent to introduce machine learning into the field of Thyroid Disease. To solve the prognostic problem of Thyroid Disease, we propose a disease diagnosis model, and apply it to thyroid disease diagnosis in the First Hospital of Jilin University.

The technical contributions done in this paper are summarized as follows:

1.
We propose an algorithm MsaDtd that converts the original characteristic space into a larger characteristic space and improved decision tree algorithm for disease diagnosis to predict LNM in patients with PTMC.
2.
We use MS-Apriori to obtain composite features, taking into account rare items by setting multiple minimum supports (MIS), and introduce fuzzy logic to deal with continuous attributes, aiming to avoid the cost of producing large frequent items.
3.
We use 5425 Clinical-pathological data of PTMC patients in the First Hospital of Jilin University to validate MsaDtd. Experimental analysis indicates that the algorithm predicts LNM effectively and accurately.

2 Related Work

Prediction of thyroid diseases using machine learning has been an ongoing effort in recent years. Chen et al. [8] presented a three-stage expert system based on a hybrid support vector machines (SVM). It combined feature selection and parameter optimization, the developed FS-PSO-SVM expert system achieved excellent performance in distinguishing among hyperthyroidism, hypothyroidism and normal ones. Makas et al. [9] developed seven distinct sorts of Neural Networks to identify the thyroid disease. And used particle swarm optimization (PSO), artificial bee colony (ABC) and migrating birds optimization (MBO) algorithms retrained the network. The accuracy of the network developed outperformed the similar studies. Pourahmad et al. [10] used a back propagation feedforward neural networks to diagnose the malignancy in thyroid tumor. Thirteen batch learning algorithms were investigated and three different numbers of neuron in hidden layer were compared to achieve the best performance. Kaya et al. [11] applied Extreme Learning Machine (ELM) to the diagnosis of thyroid disease. This study indicated the classification and speed of ELM were higher than other machine learning methods. Maysanjaya et al. [12] used Multilayer Perceptron method to identify the type of thyroid (normal, hypothyroid, hyperthyroid) with WEKA tool. The accuracy of the prediction was as high as 96.74%.

Researchers have done much research on solving the problem of thyroid diseases diagnosis. But there are few studies on the prognosis of LNM in patients with PTMC. The prognosis of LNM is essential to prevent recurrence of cancer. For the above situation, this paper designs an intelligent decision model MsaDtd to predicts lymph node metastasis (LNM) in patients with PTMC.

3 The Prognosis Algorithm Based on MS-Apriori and Decision Tree

We design a disease diagnostic algorithm by mapping the prognosis of LNM in patients with PTMC to a binary classification problem. The symptoms of patients are mapped to independent variables $ \varvec{u} = (u_{1} ;u_{2} ; \ldots ;u_{d} ) $ and diagnostic results are mapped to dependent variables $ y \in \{ 0,1\} $.

3.1 MS-Apriori Rule Mining

Apriori play a major role in identifying frequent itemset and deriving rule set out of it [13]. Using Apriori results in a shortage when mining rare knowledge patterns of rare events, due to the entire database only set one minimum support. To solve this problem, we use MS-Apriori setting MIS for different items.

For attribute value, this paper introduces fuzzy logic to map attribute values to different subintervals through membership function, aiming to avoid the cost of producing large frequent items.

The association rule mining process is as follow. An item type v_i is defined as each value type under each attribute in clinical-pathological data. The set of items in the whole database is I shown in Eq. (1) and the item type set is V shown in Eq. (2).

$$ I = \{ a_{1} ,a_{2} , \ldots ,a_{m} \} \,{ = }\,IA_{1} \cup IA_{2} \cup \ldots \cup IA_{d} ,\;m\, = \,n*d $$

(1)

$$ V = \{ v_{i} \} ,\;i = 1,2, \ldots ,u $$

(2)

There are m items, u item types and d-dimension features in the whole database. $ IA = \{ a_{i} \} \;(i = 1,2, \ldots ,n) $ represents the fuzzy itemset under an attribute. We specify that different attribute values under the same attribute do not belong to the same frequent itemset. The items in a frequent itemset should meet the condition shown in Eq. (3).

$$ a_{x} \cap a_{y} = \emptyset ,\;a_{x} \in IA_{i} ,a_{y} \in IA_{j} ,i = j $$

(3)

In addition, the support of frequent itemset is the smallest MIS of items in the frequent itemset. The frequent itemset is defined as Eq. (4). The MIS of frequent itemset c is defined as Eq. (5). The MIS of the item is defined as Eq. (6).

$$ c = \{ a_{1} ,a_{2} , \ldots a_{k} \} ,\;1 \le k \le d $$

(4)

$$ MIS(c) = \hbox{min} (MIS(a_{1} ),MIS(a_{2} ) \ldots MIS(a_{k} )) $$

(5)

$$ MIS(v_{i} ) = \frac{{v_{i} \cup LM_{yes} }}{N} $$

(6)

v_i represents an item, corresponding a value type in clinical-pathological data. $ LM_{yes} $ represents the label of patients is lymph node metastasis. N is the total number of instances. The probability of item v_i and item $ LM_{yes} $ appear in the same frequent itemset is set to the MIS of v_i.

The frequent item c_j is converted to rule Rule_j shown in Eqs. (7) and (8).

$$ c_{j} :a_{1} \cup a_{2} \cup \ldots \cup LM_{yes} /LM_{no} $$

(7)

$$ Rule_{j} - > LM_{yes} ,\;Rule_{j} :a_{1} \cup a_{2} \cup \ldots \cup a_{k - 1} $$

(8)

We rank the rule by cosine measure and delete disturbance rules by defining a threshold. The cosine measure of positive tuple rules is defined by Eq. (9).

$$ { \cos }ine(Rule_{j} ,LM_{yes} ) = \frac{{P(Rule_{j} \cup LM_{yes} )}}{{\sqrt {P(Rule_{j} )*P(LM_{yes} ))} }} $$

(9)

$ P(Rule_{j} \cup LM_{yes} ) $ represents the probability that $ Rule_{j} $ and $ LM_{yes} $ belong to the same frequent item. The cosine measure of negative tuple rules is defined as Eq. (10).

$$ { \cos }ine(Rule_{j} ,LM_{no} ) = \frac{{P(Rule_{j} \cup LM_{no} )}}{{\sqrt {P(Rule_{j} )*P(LM_{no} ))} }} $$

(10)

Algorithm 1 outlines the process of rule mining by MS-Apriori. SDC is used to limit a rare item and a common item appear in the same frequent item. threshold is used to delete disturbance rules.

3.2 Decision Tree Construction

We obtain the sorted rule set $ R \, = \, \{ \, rule|cosine\left( {rule} \right) \, \ge \, threshold\} $ which is closely related to LMN diagnosis, through mining association rules in clinical-pathological data. Next, we build a decision tree which is used to predict LNM.

Through converting each rule in rule set R to the candidate attributes of the decision tree, the algorithm generates attribute set A. To determine which rule is selected as the splitting attribute in the process of classification, information gain is used as a decision criterion. When an instance contains all items needed in $ rule_{i} $, this rule can be applied to this instance.$ rule_{i} $ as a new attribute, its attribute value is $ LM_{yes} /LM_{no} $. If the rule is positive tuple rule, the value of $ rule_{i} $ is $ LM_{yes} $ after applying the rule. If the rule is negative tuple rule, the value of $ rule_{i} $ is $ LM_{no} $ after applying the rule. Otherwise, the rule cannot be applied, the value is No. The dataset D is converted to $ S = \{ (\varvec{x}_{i} ,y_{i} )\} ,i = (1,2, \ldots ,n),y_{i} \in \{ 0,1\} $. The labels of the dataset are LNM and normal. We mark it as S₁ and S₀. The information entropy of S is defined as Eq. (11).

$$ H(S) = - \sum\limits_{i = 1}^{2} {p_{i} \log_{2} p_{i} } $$

(11)

$$ p_{i} = \frac{{S_{i} }}{S},\;i = 1,2 $$

(12)

Where p_i represents the probability that $ x_{i} \in S $ belongs to a class S_i, and is estimated by Eq. (12). The information gain for attribute $ r \in A $ at node N is defined as Eq. (13).

$$ Gain(S,r,N) = H(S) - \sum\limits_{j = 0}^{1} {\frac{{S_{j} }}{S}H(S_{j} )} $$

(13)

The attribute with the maximum information gain is selected as the splitting attribute at node N. The instances are recursively partitioned into smaller subsets through analyzing the affiliation between instances and the rules mined by MS-Apriori. When all the subsets belong to a single class, or there are no instance or attribute can be used to partition, a model used to predict LNM is constructed.

4 Experiments

4.1 Data Pre-processing

This study is conducted in the Thyroid Surgery of the First Hospital of Jilin University. A total of 5425 patients with PTMC who underwent thyroidectomy with neck dissection from 2011 to 2015 are studied. Among the 5254 patients, there are 4855 cases met the criteria, including 323 cases treated lateral neck dissection.

Features used in this study include gender, age, capsule invasion (CI), maximum tumor diameter (MTD), multifocal, Hashimoto thyroiditis (HT), Central lymph node number (CN). These features are shown in Table 1. For LLNM, adding two additional features, CLNM and lateral lymph node number (LN).

Table 1. Description of feature

Full size table

In this paper, we use the box plot to analyze data. We identify noise data by IRQ and set the value of it as null. Because box plot identifies abnormal values more objective and quartiles have a certain degree of robustness. For the missing values, in order to avoiding the loss of information by deleting. We should speculate missing values based on the majority of the existing data. We use mean/mode imputation (MMI) to deal with missing values. A bias occurs when we use it to train a predictive model, because of the unbalanced data. To solve the problem of skewed data, we use balancing techniques. The techniques we use is on CNLM dataset is KNN-NearMiss-2, a kind of supervised under-sampling techniques based on K-nearest neighbor. For LLNM dataset, SMOTE over-sampling technique is used, due to the small number of instances.

4.2 Results and Discussion

The proposed predictor is applied to the Clinical-pathological data of the First Hospital of Jilin University. To illustrate the performance of MsaDtd, we compare MsaDtd with a range of baseline algorithms, including Decision Tree (DT), Support Vector Machines (SVM), Logistic regression (LR), Bernoulli Bayes (BNB). We use 10-fold cross-validation to valid MsaDtd algorithm on CLNM dataset and LLNM dataset.

Tables 2 and 3 shows the results of various algorithms on CLNM dataset and LLNM dataset, respectively. As we can see, on CLNM dataset, MsaDtd algorithm achieves the results with Accuracy, Precision, Recall, F₁ and AUC values are 76.09%, 72.16%, 63.63%, 72.63%, and 82.06%. High prediction accuracy of 76.09% is obtained for MsaDtd algorithm. The accuracy of the improved decision tree is higher than the traditional decision tree and other classifiers. The accuracy of improved decision tree MsaDtd increased by 2.47% compared with the traditional decision tree. On LLNM dataset, the average prediction Accuracy, Recall, Precision, F₁, and AUC of MsaDtd are 87.21%, 82.75%, 85.86%, 86.85% and 88.37%. Our method outperforms the traditional decision tree in all aspects. The Accuracy, Recall, Precision, F₁, and AUC increased by 3.51%, 4.21%, 1.91%, 3.09% and 5.17% comparing to the decision tree. Our method has the highest Accuracy, Recall, Precision and AUC among the methods we compared.

Table 2. Performance comparison with baseline algorithms on CLNM dataset

Full size table

Table 3. Performance comparison with baseline algorithms on LLNM dataset

Full size table

Figures 1 and 2 shows a plot of the ROC curves derived from MsaDtd and various baseline algorithms on different dataset. One CLNM dataset, it is higher 6.69% than LR which having the highest ROC area among baseline algorithms. On LLNM dataset, the Roc area of MsaDtd is 88.37%, which is higher than all of the methods mentioned. The above results show the superior performance of the prediction we proposed.

To our Knowledge, there is almost no one proposed the specialized algorithm for the prognosis of lymph node metastasis (LNM) in patients with PTMC in recent years, so we compare our method with a classification algorithm DeepPPI-Con [14] which achieves superior performance in Protein-Protein Interactions. The results shown in Table 4 indicate that our method is significantly superior to DeepPPI. The Accuracy, Precision, F₁ and AUC of MsaDtd are 10.43%, 8.38%, 4.35% and 7.48% higher than DeepPPI on CLNM dataset. They are increased by 5.38%, 6.53%, 3.8% and 2.06% comparing to DeepPPI on LLNM dataset.

Table 4. Performance comparison with DeepPPI on CLNM and LLNM dataset

Full size table

5 Conclusion

In this paper, we propose an algorithm MsaDtd which improved decision tree with MS-Apriori and applied to the prognosis of thyroid disease through establishing a predictor to predict LNM in patients with PTMC. Fuzzy logic is introduced to handles continuous attributes, preventing to generate too many frequent items. Sorting and filtering rules mined by MS-Apriori used to avoid generate distractions, aiming to improve the prediction accuracy. Through the application of rules, the algorithm obtains new features to transform feature space, making full use of composite features. This improves the robustness and generalization capabilities of our algorithm. Building a decision tree and predicting thyroid disease by analyzing the affiliation between instances and rules to make the effective prediction. Clinicians can use the information given by predictor to adopt specific protocols throughout treatment. For the patients prone to LNM, clinicians should take customized interventions to reduce the risk of cancer recurrence.

References

Fan, M., Hu, J., Cao, R., et al.: A review on experimental design for pollutants removal in water treatment with the aid of artificial intelligence. Chemosphere 200, 330–343 (2018)
Article Google Scholar
Jiang, F., Jiang, Y., Zhi, H., et al.: Artificial intelligence in healthcare: past, present and future. Stroke Vasc. Neurol. 2(4), 230–243 (2017)
Article Google Scholar
Jiang, H., Zhang, Z., Tao, L.: A semantic-based EMRs integration framework for diagnosis decision-making. In: Buchmann, R., Kifor, C.V., Yu, J. (eds.) KSEM 2014. LNCS (LNAI), vol. 8793, pp. 380–387. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12096-6_34
Chapter Google Scholar
Fang, R., Pouyanfar, S., Yang, Y., et al.: Computational health informatics in the big data age: a survey. ACM Comput. Surv. 49(1), 12 (2016)
Article Google Scholar
Vemulapalli, V., Qu, J., Garren, J.M., et al.: Non-obvious correlations to disease management unraveled by Bayesian artificial intelligence analyses of CMS data. Artif. Intell. Med. 74, 1–8 (2016)
Article Google Scholar
Tomaszewski, J.J., Uzzo, R.G., Egleston, B., et al.: Coupling of prostate and thyroid cancer diagnoses in the United States. Ann. Surg. Oncol. 22(3), 1043–1049 (2015)
Article Google Scholar
Akın, Ş., Yazgan, A.D., Akın, S., et al.: Prediction of central lymph node metastasis in patientswith thyroid papillary microcarcinoma. Turk. J. Med. Sci. 47(6), 1723 (2017)
Article Google Scholar
Chen, H.L., Yang, B., Wang, G., et al.: A three-stage expert system based on support vector machines for thyroid disease diagnosis. J. Med. Syst. 36(3), 1953–1963 (2012)
Article Google Scholar
Makas, H., Yumusak, N.: A comprehensive study on thyroid diagnosis by neural networks and swarm intelligence. In: International Conference on Electronics, Computer and Computation, pp. 180–183. IEEE, Ankara (2014)
Google Scholar
Pourahmad, S., Azad, M., Paydar, S.: Diagnosis of malignancy in thyroid tumors by multi-layer perceptron neural networks with different batch learning algorithms. Glob. J. Health Sci. 7(6), 46–54 (2015)
Article Google Scholar
Kaya, Y.A.: Fast intelligent diagnosis system for thyroid diseases based on extreme learning machine. Arch. Otolaryngol. Head Neck Surg. 15(1), 41–49 (2014)
Google Scholar
Maysanjaya, I.M.D., Nugroho, H.A., Setiawan, N.A.: A comparison of classification methods on diagnosis of thyroid diseases. In: International Seminar on Intelligent Technology and ITS Applications, pp. 89–92. IEEE, Surabaya (2015)
Google Scholar
Chaudhary, R., Sharma, S., Sharma, V.K.: Improving the performance of MS-Apriori algorithm using dynamic matrix technique and map-reduce framework. Int. J. Innov. Res. Sci. Technol. 2(5), 2349–6010 (2015)
Google Scholar
Du, X., Sun, S., Hu, C., et al.: DeepPPI: boosting prediction of protein-protein interactions with deep neural networks. J. Chem. Inf. Model. 57(6), 1499–1510 (2017)
Article Google Scholar

Download references

Acknowledgement

Project supported by the Nature Science Foundation of Jilin Province (No. 20180101330JC), the National Nature Science Foundation of China (No. 60973040), the Fundamental Research Funds for the Central Universities (No. 2412017QD028), China Postdoctoral Science Foundation (No. 2017M621192), the Scientific and Technological Development Program of Jilin Province (No. 20180520022JH).

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, Changchun, 130012, China
Yuwei Hao, Wanli Zuo, Zhenkun Shi & Fengling He
Key Laboratory of Symbol Computation and Knowledge Engineering, Jilin University, Ministry of Education, Changchun, 130012, China
Yuwei Hao, Wanli Zuo, Zhenkun Shi & Fengling He
School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, China
Lin Yue
The First Hospital of Jilin University, Changchun, 130021, China
Shuai Xue

Authors

Yuwei Hao
View author publications
You can also search for this author in PubMed Google Scholar
Wanli Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Zhenkun Shi
View author publications
You can also search for this author in PubMed Google Scholar
Lin Yue
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Xue
View author publications
You can also search for this author in PubMed Google Scholar
Fengling He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenkun Shi .

Editor information

Editors and Affiliations

University of Bristol, Bristol, United Kingdom
Weiru Liu
Università di Trento, Povo, Italy
Fausto Giunchiglia
Jilin University, Changchun, China
Bo Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hao, Y., Zuo, W., Shi, Z., Yue, L., Xue, S., He, F. (2018). Prognosis of Thyroid Disease Using MS-Apriori Improved Decision Tree. In: Liu, W., Giunchiglia, F., Yang, B. (eds) Knowledge Science, Engineering and Management. KSEM 2018. Lecture Notes in Computer Science(), vol 11061. Springer, Cham. https://doi.org/10.1007/978-3-319-99365-2_40

Download citation

DOI: https://doi.org/10.1007/978-3-319-99365-2_40
Published: 12 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99364-5
Online ISBN: 978-3-319-99365-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Prognosis of Thyroid Disease Using MS-Apriori Improved Decision Tree

Abstract

Similar content being viewed by others

Machine learning based on SEER database to predict distant metastasis of thyroid cancer

Machine learning framework with feature selection approaches for thyroid disease classification and associated risk factors identification

Improving the diagnosis of thyroid cancer by machine learning and clinical data

Keywords

1 Introduction

2 Related Work