Abstract
Determining whether correct disease codes are included in discharge summaries is important for hospital management because submission of medical receipts with incorrect disease codes can result in loss of insurance reimbursement. Because medical information managers in large hospitals must evaluate more than 1000 summaries per month, an automated determination of discharge summaries will reduce their workload, allowing information managers to focus on complicated cases. This paper proposes a method of constructing classifiers of discharge summaries. In the first step, morphological analysis generated a term matrix from text data extracted from the hospital information system. Subsequently, important keywords were selected from an analysis of correspondence, training examples were generated, and machine learning methods were applied to the training examples. Several machine learning methods were compared using discharge summaries stored in the information system of Shimane University Hospital. A random forest method was found to be the best classifier when compared with deep learning, SVM and decision tree methods. Furthermore, the random forest method had a classification accuracy greater than 90%.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Computerization of patient records enables the storage of “big unstructured text data” in hospital information systems (HIS). For example, Shimane University Hospital treats about 1000 patients in its outpatient clinics and about 600 patients in inpatient wards. The HIS of this hospital stores about 200 GB of text data per year, including patient records, discharge summaries and radiology and pathology reports. Text mining of these resources can enable decisions about clinical actions, research and hospital management.
This paper proposes a five-step method of constructing classifiers for discharge summaries. In the first step, discharge summaries are extracted from the HIS. In the second step, morphological analysis is applied to a set of summaries and a term matrix is generated. In the third step, correspondence analysis is applied to the term matrix with class labels, and two-dimensional coordinates are assigned to each keyword; measurements of distances between categories and assigned points can generate a ranking of keywords for each category. In the fourth step, keywords are selected as attributes according to their rank, and training examples for classifiers are generated. Finally, learning methods are applied to the training examples. Experimental validation was performed using four methods: random forest, deep learning (multi-layer perceptron), SVM and decision tree induction. The random forest achieved the best performance, followed by the deep learning method.
The paper is organized as follows. Section 2 explains our motivation. Section 3 describes a proposed mining process. Section 4 shows the experimental results. Section 5 discusses these results. Finally, Sect. 6 provides the conclusions of this study.
2 Motivation
The principal purpose of applying AI to hospital data is to enhance the efficiency of the medical staff in a clinical environment. One of the more laborious tasks for doctors and nurses is documentation, including detailed descriptions of patient records. Careful documentation is needed for several purposes, including submission to insurance companies for reimbursement and exchange of information among hospitals and clinics. The accuracy of medical documents should be evaluated, mainly because most medical payments are based on the submission of medical fee statements, with the information from these statements obtained from medical documents. Large-scale hospitals in Japan must submit statements according to the Diagnosis Procedure Combination (DPC system) [4]. A DPC code is assigned to the condition to which the majority of medical resources were devoted during the hospitalization of a patient. For each day of hospital stay, a payment point is assigned for each DPC code. Thus, medical payments by DPC code depend on the length of hospital stay, diagnosis and medical procedures, and differ from the traditional medical payment system, which depends on a set repayment for each medical procedure.Footnote 1
Because DPC codes in the HIS are used to classify each medical payment during hospitalization, the assigned code may differ from medically classified diseases, making it difficult for doctors to assign DPC codes.
Thus, before submitting requests for payment, medical information managers must review clinical records to determine whether DPC codes are or are not correct. Mainly, managers check the validity of assigned DPC codes by reviewing discharge summaries and patient records. For example, at Shimane University Hospital, an average of 40 patients are discharged per day. In one month, about 1200 patients are discharged, which means that medical information managers must check 1200 discharge summaries and patient records per month, thus making efficient checking very important. At Shimane University Hospital, six managers check patient records and DPC codes.
An automated document classification system with correct DPC codes will help medical information managers at large hospitals submit accurate fee statements, enabling them to focus on complicated cases.
3 Methods
3.1 Discharge Summary
A discharge summary has been defined as a document that outlines the details of the hospitalization and care of a patient [1]. This summary is prepared when a patient is released from a health care facility and is incorporated into the permanent medical records of that patient. Ideally, a discharge summary should include an explanation for the patient’s admission; records of patient complaints, physical findings, laboratory results and radiographic studies while hospitalized; a list of changes in medications at discharge; and recommendations for follow-up care. For optimal patient care, the discharge summary should be transmitted to or reviewed with the patient’s primary care provider.
A discharge summary includes all clinical processes during patient hospitalization. It is written in more formal style than regular patient records. Thus, a conventional text mining approach can be used to extract enough keywords from the text of each discharge summary. Figure 1 shows an example of a discharge summary.
3.2 Motivation for Feature Selection
Feature selection is important even for deep learners [11, 12]. Although deep learners show better performance in image analysis, differences between deep learners and other classification methods are generally very small. This may be due to the lack of a suitable network structure and the absence of suitable features for classification. Empirical results showing that deep learners are good at recognition of images suggest that some type of topological relationship should be explicitly embedded into training data. This study proposes a new feature selection method based on correspondence analysis, which calculates mapping attributes to points of multi-dimensional coordinates. The method can extract the topological relationships between keywords and data concepts.
3.3 Mining Process
Figure 2 shows the proposed total mining process, whose workflow consists of five steps.
3.3.1 Morphological Analysis
Target discharge summaries are extracted from the HIS, followed by morphological analysis [5], which outputs a term matrix, consisting of a contingency table for keywords and concepts.
3.3.2 Correspondence Analysis
The term matrix is subjected to correspondence analysis. Although high dimensional coordinates can be selected, a very large table is obtained. This study, therefore, focused on two-dimensional analysis, which can be easily used for visualization. Two dimensional coordinates are therefore assigned to each keyword and concept.
3.3.3 Ranking
The coordinates of each concept and keyword are used to calculate the euclidean distance between them. Distances are used to rank keywords to each concept, with smaller distances indicating a higher ranking.
3.3.4 Keyword Selection
Prior to analysis, the number of keywords is determined; e.g., 100. All keywords with rankings up that determination are selected for classification. Because some keywords may overlap, any overlapping keywords are deleted. Training examples with a classification label and the value of selected keywords (binary attributes) are subsequently constructed.
3.3.5 Classification
Finally, classification learning methods are applied. This study compared four classification methods: random forest [9], deep learning (multi-layer perceptron) (darch), Support Vector Machine (SVM) [7], Backpropagation Neural Network (BNN) [15] and decision tree (rpart) [14].
4 Experimental Evaluation
The 20 most frequent DPC codes in the fiscal year 2015 were selected, and discharge summaries that included these codes were extracted from the HIS of Shimane University Hospital. Table 1 shows the statistics of these 20 DPC codes, as well as the average number of characters used in the summaries.
Except for extraction from data from the HIS, all processes were performed using R 3.5.0 software, with analysis and evaluation on two units HP Proliant ML110 Gen9 (Xeon E5-2640 v3.2 2.6GHz 8Core, 64GBDRAM) computers.
4.1 Mining Process
4.1.1 Correspondence Analysis
Morphological analysis was performed using RMeCab [5]. A bag of keywords was generated and used to construct a contingency table for these summaries. Correspondence analysis was applied to the table using the MASS package on R3.5.0. Two-dimensional coordinates were assigned to each keyword and each class.Footnote 2
Figure 3 shows the two-dimensional plot of correspondence analysis. Because discharge summaries are written in English, all the keywords in the figure are shown in Japanese. English tranlations of some important keywords of frequent diseases are shown in Table 3. All the keywords in Fig. 3 are arranged along a horseshoe curve, a specific feature of both correspondence analysis and principal component analysis [2, 3, 13]. These findings indicate that the correspondence analysis fit the correspondence between keywords and DPC codes.
The information important for classification in Fig. 3 is plotted near the target classes, with the target class values (DPC codes) plotted as numerical codes.
For example, the two right bottom numbers denote a cataract, with the keywords for eye symptoms and surgical operations plotted near these classes. In contrast, the right upper class is “Injury to the Elbow and Knee”, with the keywords for rehabilitation and fixation of joints located nearby.
4.1.2 Ranking
Next, the distances between the coordinates of a keyword and those of a class are calculated, and the keywords ranked for each class. Because target classes and keywords are assigned to two dimensional planes, the distances between classes and keywords can be calculated from the assigned coordinates.
Figure 4 shows the distribution of the distances. The horizontal axis denotes the distance between keywords and classes and the vertical axis denotes the number of attributes of the given distance. Because distances close to 0 indicate that keywords and classes are very close, the figure shows that, except for cataracts and injury to the elbow and knee, the keywords are very close to the coordinates of each class. Thus, selection of keywords may be a little subtle and surrogate split may be useful for the decision tree induction and random forest methods.
Using their rankings, a preset number of keywords was selected to generate a table for learning classification.
4.1.3 Keyword Selection
Table 2 shows the total numbers of keywords selected for the top 10 and 20 DPC codes. The selection of 250 keywords for each DPC code would result in a total of 5000 keywords. After the removal of overlapping keywords, only 1932 keywords were used for classification. Some important keywords may be deleted due to overlap if these keywords are frequently used in at least two diseases.
Table 3 shows the top 10 keywords for the three top DPC codes. For comparison, the results obtained by tf-idf are also shown. Interestingly, keywords selected by correspondence analysis differed from those selected by tf-idf, suggesting that frequency based information may not play important roles in the classification of discharge summaries.
4.1.4 Classification
Finally, decision tree (package: rpart [14]), random forest (package: randomForest [9], SVM (kernlab [7]), BNN(package: nnet [15]) and Deep Learner (multi-layer perceptron) (darchFootnote 3) were applied to the generated training examples. To determine the parameters of Darch, the numbers of intermediate neurons were set at 20, (40,20) and (80,20), with an epoch of 100. For all other packages, the default settings of parameters were used.
4.1.5 Evaluation Process
The evaluation process was based on repeated two-fold cross validation [8].Footnote 4 First, the dataset was randomly split 1:1 into training samples and test samples. The training samples were used to construct a classifier, and the derived classifiers were evaluated with the remaining test samples. These procedures were repeated 100 times, and the averaged accuracy was calculated.
The number of keywords varied from 1 to 1000, selected according to their ranking by correspondence analysis.
4.2 Classification Results
Table 4 shows the evaluation results of the top 20 diseases. For four or fewer keywords, all the classifiers showed an accuracy of about 70%. At five or more keywords, however, SVM showed a decrease in accuracy, whereas the other methods showed monotonic increases in accuracy, with the latter plateauing at 200 keywords. The Random Forest method performed better than the other classifiers, followed by Darch deep learning. If more than 250 keywords were selected, the performance of Darch decreased, whereas the performances of random forest and decision trees increased monotonically. Although BNN showed poorer accuracy than Darch (default setting) with 5 to 100 selected keywords, the accuracy of BNN approached that of Darch classifiers with a larger number of keywords. Interestingly, the accuracy of the decision tree method increased monotonically, becoming maximal when all the keywords were used for analysis.
5 Discussion
5.1 Misclassified Cases
Figures 5 and 6 show confusion matrices of random forest and darch (multi-layer perceptron), where DPC codes were set in order, indicating that similar codes were to similar diseases.Footnote 5 Shaded regions indicate misclassified patients. Although errors using the random forest method are located near the diagonal, errors using darch were more scattered. This finding suggests that the random forest method was almost correct in classifying a patient if similar DPC codes were grouped into one generalized class. In contrast, the Darch method had unexpected errors.
5.2 Classification Accuracy of Decision Trees
Two results were unexpected: (1) the accuracy of decision trees increased monotonically, and (2) the random forest method was more accurate than the other methods. Because the random forest method can be considered a refinement of decision tree induction, representation by decision trees may provide insight into hidden structures present in the discharge summaries.
Figure 7 shows the decision tree obtained with 1000 selected keywords extracted by morphological analysis, with 23 attributes used for description. Because the shape of the tree cannot be determined by linear combination, SVM, or linear combination of keywords, it may not show classification accuracy. Second, the selection process based on correspondence analysis may not be appropriate in selecting keywords for SVM.
Figure 8 shows the location of each keyword used in the decision tree based on its ranking in each classification class. All of the keywords were not selected by ranking, perhaps because the differences in distances among the attributes were very small. Future studies are needed to assess the nature of ranking.
A review of the decision tree by medical experts found that the tree was very compact but reasonable and that the selection of keywords was very interesting and explainable. This selection may reflect the differences in the description of disease summaries among the target diseases. Further evaluation should include a detailed examination of discharge summaries.
5.3 Execution Time
Two units of HP Proliant ML110 Gen9 (Xeon E5-2640 v3.2 2.6GHz 8Core, 64GBDRAM) workstation were used.
Table 5 shows an empirical comparison of repeated twofold cross validations (100 trials). The times need for Random Forest and SVM were 183 and 156 mins for 250 keywords, whereas Darch (20) required 672 mins. For 1000 keywords, the times needed for Random Forest, SVM and Darch (20) were 261, 288, and 1101 mins, respectively. The times required by random forest and BNN methods were close to those of Deep Learners. In the case of Darch, the number of intermediate layers resulted in greater computation times, although the growth rate was smaller than that of BNN.
5.4 Comparison with tf-idf
A major approach in text classification is ranking with tf-idf [6, 10]. Thus, tf-idf ranking was compared with the above approach using the same scheme as in Sect. 4. Interestingly, deep learning with tf-idf ranking showed much poorer performance than ranking by correspondence analysis, whereas random forest with tf-idf ranking was slightly better than that with ranking by correspondence analysis for <200 keywords (Table 6). Because the average accuracy of deep learning different by only a few percent from that of the random forest method, ranking by correspondence analysis is a better approach for text classification by deep learning, at least in this applied domain. Ranking by correspondence analysis includes geometric information about keywords and concepts, with embedding of geometric knowledge being important for deep learning.
A major approach in text classification is ranking with tf-ifd, which were introduced by Luhn [10] and Sparck Jones [6]. Thus, here, we compared tf-idf ranking with the above approach by using the same scheme as in Sect. 4.
Interestingly, deep learning with tf-idf ranking performs much worse than that with ranking by correspondence analysis, whereas random forest with tf-idf ranking is a little better than that with ranking by correspondence analysis when the number of selected keywords is smaller than 200. These results are clearly shown in Table 6.
6 Conclusion
This study proposes a five-step method for constructing classifiers for discharge summaries. In the first step, discharge summaries are obtained from the HIS. In the second step, morphological analysis is applied to a set of summaries to generate a term matrix. In the third step, correspondence analysis is applied to the classification labels and term matrix, generating two-dimensional coordinates. Measurements of the distances between categories and assigned points enables ranking of keywords. In the fourth step, keywords are selected as attributes according to rank, and training examples for classifiers are generated. Finally, learning methods are applied to the training examples. This method was experimentally validated using discharge summaries from Shimane University Hospital during the 2015 fiscal year. Optimal performance was provided by the random forest method, with a classification accuracy of about 93%, followed by deep learning with a classification accuracy of about 91%. In contrast, decision tree methods with many keywords was slightly less accurate than neural networks and deep learning methods. The selected keywords and tree structure were deemed reasonable by domain experts, perhaps because the hidden structure of knowledge in a dataset may be close to the structure approximated by a set of trees and because deep learning may generate such structures in the networks. Our future work will attempt to validate this hypothesis.
Notes
Outpatient clinics utilize action-based payment systems, even in large hospitals.
The method can also generate \(p (p\ge 3)\)-dimensional coordinates. However, higher dimensional coordinates did not provide better performance than the experiments shown below.
Darch was removed from R package. Please check the github: https://github.com/maddin79/darch.
Two-fold cross-validation was selected because its estimator resulted in the lowest estimate of parameters, such as accuracy, as well as minimizing estimates of bias.
DPC codes are a three-level hierarchical system, with each DPC code defined as a tree. The first level denotes the type of disease, the second level denotes the primary treatment selected for that patient, and the third-level shows any additional therapy. Thus, in the tables, characteristics of codes were representative of similarities.
References
Discharge summary, http://medical-dictionary.thefreedictionary.com/discharge+summary. Accessed Feb 14, 2021
Deáth, G. (1999). Principal curves: A new technique for indirect and direct gradient analysis. Ecology, 80(7), 2237–2253.
Hastie, T., & Stuetzle, W. (1989). Principal curves. Journal of the American Statistical Association, 84(406), 502–516.
IgakuTsushinsha (ed.) (2020). Quick Reference of DPC points (in Japanese). IgakuTsushinsha, Tokyo
Ishida, M. (2016). Rmecab. http://rmecab.jp/wiki/index.php?RMeCabFunctions
JONES, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21.
Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab - an S4 package for kernel methods in R. Journal of Statistical Software, 11(9), 1–20. http://www.jstatsoft.org/v11/i09/
Kim, J. H. (2009). Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational Statistics and Data Analysis, 53(11), 3735–3745. https://doi.org/10.1016/j.csda.2009.04.009.
Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18–22. http://CRAN.R-project.org/doc/Rnews/
Luhn, H. P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4), 309–317.
Mares, M. A., Wang, S., & Guo, Y. (2016). Combining multiple feature selection methods and deep learning for high-dimensional data. Transactions on Machine Learning and Data Mining, 9, 27–45.
Nezhad, M. Z., Zhu, D., Li, X., Yang, K., & Levy, P. (2017). SAFS: A deep feature selection approach for precision medicine. arXiv:1704.05960
Podani, J., & Miklós, I. (2002). Resemblance coefficients and the horseshoe effect in principal coordinates analysis. Ecology, 83(12), 3331–3343.
Therneau, T. M., & Atkinson, E. J. (2015). An Introduction to Recursive Partitioning Using the RPART Routines. https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S. Springer, New York, 4th edn. http://www.stats.ox.ac.uk/pub/MASS4, iSBN 0-387-95457-0
Acknowledgements
This research was supported by a Grant-in-Aid for Scientific Research (B) 18H03289 from the Japan Society for the Promotion of Science(JSPS).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there are no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research was supported by a Grant-in-Aid for Scientific Research (B) 18H03289 from the Japan Society for the Promotion of Science (JSPS). On behalf of all authors, the corresponding author states that there are no conflicts of interest.
Rights and permissions
About this article
Cite this article
Tsumoto, S., Kimura, T. & Hirano, S. Determination of Disease from Discharge Summaries. Rev Socionetwork Strat 15, 49–66 (2021). https://doi.org/10.1007/s12626-021-00076-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12626-021-00076-7