1 Introduction

Computerization of patient records enables the storage of “big unstructured text data” in hospital information systems (HIS). For example, Shimane University Hospital treats about 1000 patients in its outpatient clinics and about 600 patients in inpatient wards. The HIS of this hospital stores about 200 GB of text data per year, including patient records, discharge summaries and radiology and pathology reports. Text mining of these resources can enable decisions about clinical actions, research and hospital management.

This paper proposes a five-step method of constructing classifiers for discharge summaries. In the first step, discharge summaries are extracted from the HIS. In the second step, morphological analysis is applied to a set of summaries and a term matrix is generated. In the third step, correspondence analysis is applied to the term matrix with class labels, and two-dimensional coordinates are assigned to each keyword; measurements of distances between categories and assigned points can generate a ranking of keywords for each category. In the fourth step, keywords are selected as attributes according to their rank, and training examples for classifiers are generated. Finally, learning methods are applied to the training examples. Experimental validation was performed using four methods: random forest, deep learning (multi-layer perceptron), SVM and decision tree induction. The random forest achieved the best performance, followed by the deep learning method.

The paper is organized as follows. Section 2 explains our motivation. Section 3 describes a proposed mining process. Section 4 shows the experimental results. Section 5 discusses these results. Finally, Sect. 6 provides the conclusions of this study.

2 Motivation

The principal purpose of applying AI to hospital data is to enhance the efficiency of the medical staff in a clinical environment. One of the more laborious tasks for doctors and nurses is documentation, including detailed descriptions of patient records. Careful documentation is needed for several purposes, including submission to insurance companies for reimbursement and exchange of information among hospitals and clinics. The accuracy of medical documents should be evaluated, mainly because most medical payments are based on the submission of medical fee statements, with the information from these statements obtained from medical documents. Large-scale hospitals in Japan must submit statements according to the Diagnosis Procedure Combination (DPC system) [4]. A DPC code is assigned to the condition to which the majority of medical resources were devoted during the hospitalization of a patient. For each day of hospital stay, a payment point is assigned for each DPC code. Thus, medical payments by DPC code depend on the length of hospital stay, diagnosis and medical procedures, and differ from the traditional medical payment system, which depends on a set repayment for each medical procedure.Footnote 1

Because DPC codes in the HIS are used to classify each medical payment during hospitalization, the assigned code may differ from medically classified diseases, making it difficult for doctors to assign DPC codes.

Thus, before submitting requests for payment, medical information managers must review clinical records to determine whether DPC codes are or are not correct. Mainly, managers check the validity of assigned DPC codes by reviewing discharge summaries and patient records. For example, at Shimane University Hospital, an average of 40 patients are discharged per day. In one month, about 1200 patients are discharged, which means that medical information managers must check 1200 discharge summaries and patient records per month, thus making efficient checking very important. At Shimane University Hospital, six managers check patient records and DPC codes.

An automated document classification system with correct DPC codes will help medical information managers at large hospitals submit accurate fee statements, enabling them to focus on complicated cases.

3 Methods

3.1 Discharge Summary

A discharge summary has been defined as a document that outlines the details of the hospitalization and care of a patient [1]. This summary is prepared when a patient is released from a health care facility and is incorporated into the permanent medical records of that patient. Ideally, a discharge summary should include an explanation for the patient’s admission; records of patient complaints, physical findings, laboratory results and radiographic studies while hospitalized; a list of changes in medications at discharge; and recommendations for follow-up care. For optimal patient care, the discharge summary should be transmitted to or reviewed with the patient’s primary care provider.

A discharge summary includes all clinical processes during patient hospitalization. It is written in more formal style than regular patient records. Thus, a conventional text mining approach can be used to extract enough keywords from the text of each discharge summary. Figure 1 shows an example of a discharge summary.

Fig. 1
figure 1

An example of discharge summary

3.2 Motivation for Feature Selection

Feature selection is important even for deep learners [11, 12]. Although deep learners show better performance in image analysis, differences between deep learners and other classification methods are generally very small. This may be due to the lack of a suitable network structure and the absence of suitable features for classification. Empirical results showing that deep learners are good at recognition of images suggest that some type of topological relationship should be explicitly embedded into training data. This study proposes a new feature selection method based on correspondence analysis, which calculates mapping attributes to points of multi-dimensional coordinates. The method can extract the topological relationships between keywords and data concepts.

Fig. 2
figure 2

Data mining process

3.3 Mining Process

Figure 2 shows the proposed total mining process, whose workflow consists of five steps.

3.3.1 Morphological Analysis

Target discharge summaries are extracted from the HIS, followed by morphological analysis [5], which outputs a term matrix, consisting of a contingency table for keywords and concepts.

3.3.2 Correspondence Analysis

The term matrix is subjected to correspondence analysis. Although high dimensional coordinates can be selected, a very large table is obtained. This study, therefore, focused on two-dimensional analysis, which can be easily used for visualization. Two dimensional coordinates are therefore assigned to each keyword and concept.

3.3.3 Ranking

The coordinates of each concept and keyword are used to calculate the euclidean distance between them. Distances are used to rank keywords to each concept, with smaller distances indicating a higher ranking.

3.3.4 Keyword Selection

Prior to analysis, the number of keywords is determined; e.g., 100. All keywords with rankings up that determination are selected for classification. Because some keywords may overlap, any overlapping keywords are deleted. Training examples with a classification label and the value of selected keywords (binary attributes) are subsequently constructed.

3.3.5 Classification

Finally, classification learning methods are applied. This study compared four classification methods: random forest [9], deep learning (multi-layer perceptron) (darch), Support Vector Machine (SVM) [7], Backpropagation Neural Network (BNN) [15] and decision tree (rpart) [14].

Table 1 DPC codes of the top 20 diseases treated during fiscal year 2015

4 Experimental Evaluation

The 20 most frequent DPC codes in the fiscal year 2015 were selected, and discharge summaries that included these codes were extracted from the HIS of Shimane University Hospital. Table 1 shows the statistics of these 20 DPC codes, as well as the average number of characters used in the summaries.

Except for extraction from data from the HIS, all processes were performed using R 3.5.0 software, with analysis and evaluation on two units HP Proliant ML110 Gen9 (Xeon E5-2640 v3.2 2.6GHz 8Core, 64GBDRAM) computers.

4.1 Mining Process

4.1.1 Correspondence Analysis

Morphological analysis was performed using RMeCab [5]. A bag of keywords was generated and used to construct a contingency table for these summaries. Correspondence analysis was applied to the table using the MASS package on R3.5.0. Two-dimensional coordinates were assigned to each keyword and each class.Footnote 2

Figure 3 shows the two-dimensional plot of correspondence analysis. Because discharge summaries are written in English, all the keywords in the figure are shown in Japanese. English tranlations of some important keywords of frequent diseases are shown in Table 3. All the keywords in Fig. 3 are arranged along a horseshoe curve, a specific feature of both correspondence analysis and principal component analysis [2, 3, 13]. These findings indicate that the correspondence analysis fit the correspondence between keywords and DPC codes.

The information important for classification in Fig. 3 is plotted near the target classes, with the target class values (DPC codes) plotted as numerical codes.

For example, the two right bottom numbers denote a cataract, with the keywords for eye symptoms and surgical operations plotted near these classes. In contrast, the right upper class is “Injury to the Elbow and Knee”, with the keywords for rehabilitation and fixation of joints located nearby.

Fig. 3
figure 3

Results of correspondence analysis

Fig. 4
figure 4

Distribution of distances between keywords and target classes for the top ten diseases and top 250 keywords. The horizontal axis denotes the distances between keywords and classes and the vertical axis denotes the number of attributes of the given distance. Because the keywords and classes are very close when the distances were nearly equal to 0, the figures shows that, except for cataracts and injury to the elbow and knee, the keywords were very close to the coordinates of each class

Table 2 Numbers of selected keywords and numbers of actually used keywords

4.1.2 Ranking

Next, the distances between the coordinates of a keyword and those of a class are calculated, and the keywords ranked for each class. Because target classes and keywords are assigned to two dimensional planes, the distances between classes and keywords can be calculated from the assigned coordinates.

Figure 4 shows the distribution of the distances. The horizontal axis denotes the distance between keywords and classes and the vertical axis denotes the number of attributes of the given distance. Because distances close to 0 indicate that keywords and classes are very close, the figure shows that, except for cataracts and injury to the elbow and knee, the keywords are very close to the coordinates of each class. Thus, selection of keywords may be a little subtle and surrogate split may be useful for the decision tree induction and random forest methods.

Using their rankings, a preset number of keywords was selected to generate a table for learning classification.

Table 3 Top 10 keywords selected for each of the top three diseases (English translation)

4.1.3 Keyword Selection

Table 2 shows the total numbers of keywords selected for the top 10 and 20 DPC codes. The selection of 250 keywords for each DPC code would result in a total of 5000 keywords. After the removal of overlapping keywords, only 1932 keywords were used for classification. Some important keywords may be deleted due to overlap if these keywords are frequently used in at least two diseases.

Table 3 shows the top 10 keywords for the three top DPC codes. For comparison, the results obtained by tf-idf are also shown. Interestingly, keywords selected by correspondence analysis differed from those selected by tf-idf, suggesting that frequency based information may not play important roles in the classification of discharge summaries.

4.1.4 Classification

Finally, decision tree (package: rpart [14]), random forest (package: randomForest [9], SVM (kernlab [7]), BNN(package: nnet [15]) and Deep Learner (multi-layer perceptron) (darchFootnote 3) were applied to the generated training examples. To determine the parameters of Darch, the numbers of intermediate neurons were set at 20, (40,20) and (80,20), with an epoch of 100. For all other packages, the default settings of parameters were used.

4.1.5 Evaluation Process

The evaluation process was based on repeated two-fold cross validation [8].Footnote 4 First, the dataset was randomly split 1:1 into training samples and test samples. The training samples were used to construct a classifier, and the derived classifiers were evaluated with the remaining test samples. These procedures were repeated 100 times, and the averaged accuracy was calculated.

The number of keywords varied from 1 to 1000, selected according to their ranking by correspondence analysis.

Table 4 Experimental results (averaged accuracies)

4.2 Classification Results

Table 4 shows the evaluation results of the top 20 diseases. For four or fewer keywords, all the classifiers showed an accuracy of about 70%. At five or more keywords, however, SVM showed a decrease in accuracy, whereas the other methods showed monotonic increases in accuracy, with the latter plateauing at 200 keywords. The Random Forest method performed better than the other classifiers, followed by Darch deep learning. If more than 250 keywords were selected, the performance of Darch decreased, whereas the performances of random forest and decision trees increased monotonically. Although BNN showed poorer accuracy than Darch (default setting) with 5 to 100 selected keywords, the accuracy of BNN approached that of Darch classifiers with a larger number of keywords. Interestingly, the accuracy of the decision tree method increased monotonically, becoming maximal when all the keywords were used for analysis.

5 Discussion

5.1 Misclassified Cases

Figures 5 and 6 show confusion matrices of random forest and darch (multi-layer perceptron), where DPC codes were set in order, indicating that similar codes were to similar diseases.Footnote 5 Shaded regions indicate misclassified patients. Although errors using the random forest method are located near the diagonal, errors using darch were more scattered. This finding suggests that the random forest method was almost correct in classifying a patient if similar DPC codes were grouped into one generalized class. In contrast, the Darch method had unexpected errors.

Fig. 5
figure 5

Confusion matrix obtained with the random forest method

Fig. 6
figure 6

Confusion matrix obtained with the deep learner method

Fig. 7
figure 7

Decision tree with 1000 keywords from the top 20 diseases

Fig. 8
figure 8

Ranking of keywords in the decision tree for the top 20 diseases

5.2 Classification Accuracy of Decision Trees

Two results were unexpected: (1) the accuracy of decision trees increased monotonically, and (2) the random forest method was more accurate than the other methods. Because the random forest method can be considered a refinement of decision tree induction, representation by decision trees may provide insight into hidden structures present in the discharge summaries.

Figure 7 shows the decision tree obtained with 1000 selected keywords extracted by morphological analysis, with 23 attributes used for description. Because the shape of the tree cannot be determined by linear combination, SVM, or linear combination of keywords, it may not show classification accuracy. Second, the selection process based on correspondence analysis may not be appropriate in selecting keywords for SVM.

Figure 8 shows the location of each keyword used in the decision tree based on its ranking in each classification class. All of the keywords were not selected by ranking, perhaps because the differences in distances among the attributes were very small. Future studies are needed to assess the nature of ranking.

A review of the decision tree by medical experts found that the tree was very compact but reasonable and that the selection of keywords was very interesting and explainable. This selection may reflect the differences in the description of disease summaries among the target diseases. Further evaluation should include a detailed examination of discharge summaries.

5.3 Execution Time

Two units of HP Proliant ML110 Gen9 (Xeon E5-2640 v3.2 2.6GHz 8Core, 64GBDRAM) workstation were used.

Table 5 Times required for construction of classifications for the top 20 diseases

Table 5 shows an empirical comparison of repeated twofold cross validations (100 trials). The times need for Random Forest and SVM were 183 and 156 mins for 250 keywords, whereas Darch (20) required 672 mins. For 1000 keywords, the times needed for Random Forest, SVM and Darch (20) were 261, 288, and 1101 mins, respectively. The times required by random forest and BNN methods were close to those of Deep Learners. In the case of Darch, the number of intermediate layers resulted in greater computation times, although the growth rate was smaller than that of BNN.

Table 6 Comparison of accuracies (tf-idf)

5.4 Comparison with tf-idf

A major approach in text classification is ranking with tf-idf [6, 10]. Thus, tf-idf ranking was compared with the above approach using the same scheme as in Sect. 4. Interestingly, deep learning with tf-idf ranking showed much poorer performance than ranking by correspondence analysis, whereas random forest with tf-idf ranking was slightly better than that with ranking by correspondence analysis for <200 keywords (Table 6). Because the average accuracy of deep learning different by only a few percent from that of the random forest method, ranking by correspondence analysis is a better approach for text classification by deep learning, at least in this applied domain. Ranking by correspondence analysis includes geometric information about keywords and concepts, with embedding of geometric knowledge being important for deep learning.

A major approach in text classification is ranking with tf-ifd, which were introduced by Luhn [10] and Sparck Jones [6]. Thus, here, we compared tf-idf ranking with the above approach by using the same scheme as in Sect. 4.

Interestingly, deep learning with tf-idf ranking performs much worse than that with ranking by correspondence analysis, whereas random forest with tf-idf ranking is a little better than that with ranking by correspondence analysis when the number of selected keywords is smaller than 200. These results are clearly shown in Table 6.

6 Conclusion

This study proposes a five-step method for constructing classifiers for discharge summaries. In the first step, discharge summaries are obtained from the HIS. In the second step, morphological analysis is applied to a set of summaries to generate a term matrix. In the third step, correspondence analysis is applied to the classification labels and term matrix, generating two-dimensional coordinates. Measurements of the distances between categories and assigned points enables ranking of keywords. In the fourth step, keywords are selected as attributes according to rank, and training examples for classifiers are generated. Finally, learning methods are applied to the training examples. This method was experimentally validated using discharge summaries from Shimane University Hospital during the 2015 fiscal year. Optimal performance was provided by the random forest method, with a classification accuracy of about 93%, followed by deep learning with a classification accuracy of about 91%. In contrast, decision tree methods with many keywords was slightly less accurate than neural networks and deep learning methods. The selected keywords and tree structure were deemed reasonable by domain experts, perhaps because the hidden structure of knowledge in a dataset may be close to the structure approximated by a set of trees and because deep learning may generate such structures in the networks. Our future work will attempt to validate this hypothesis.