Keywords

1 Introduction

Through thousands of years of clinical practice of traditional Chinese medicine, a large number of formulae have been selected and accumulated. By the end of the late Qing Dynasty, there were more than 100,000 ancient formulae. In addition, the new formulae developed and created, the self-made formulae of hospitals, and the Chinese patent medicine formulas on the market, they were even more “vast”. These formulas embody the therapeutic experience and wisdom of ancient and modern physicians. If they can be effectively sorted out and excavated, especially their functions and indications, they will provide important support for improving clinical efficacy and developing new drugs.

In order to sort out and study the formulae of past dynasties, colleges and research institutes around the country have established various prescription databases and analysis systems [1, 2], which still face problems such as insufficient standardization of data and difficulty in efficient retrieval and analysis [3]. Among all kinds of information contained in formulae, in addition to the structure and standardization of drug composition information, more attention has been paid and better solutions have been achieved [4, 5]. However, there are few studies on information extraction and standardization in terms of the information of the opposite side (including indications and efficacy). The main reason is that its description covers a long history, and there are many phenomena such as polysemy, synonymy, ambiguity, cross meaning. In addition, the classification system of formula is not completely unified, and there is a lack of high-quality and large-scale training corpus. At present, there are few researches on information extraction and standardization based on natural language processing technology of deep learning model.

In recent years, with the extensive application of depth model in the medical field and the release and improvement of the national standards related to formulae, conditions have been provided for the automatic processing of prescription information. In this paper, the classification system and prescription data of the seventh edition of National Medical Insurance Catalogue [6], GBT 31773-2015 Coding Rules and Coding of Traditional Chinese Medicine Prescriptions [7] and Prescription Textbook [8] are manually integrated to form a prescription efficacy classification data set, and a variety of deep learning text classification models are adopted to realize text description based on prescription name, composition and main efficacy, and automatically judge the classification of the prescription. The classification efficiency of various models is compared, analyzed and discussed, which provides a useful reference for the further realization of the automatic information processing of ancient formulae.

2 Related Work

Text classification refers to the process of automatically determining text categories based on text content under a given classification system. The core issue is to extract the features of the classified data from the text, and then select the appropriate classification algorithms and models to model the features, thereby achieving classification. Figure 1 shows the processing procedure of text classification.

Fig. 1.
figure 1

Processing procedure of text classification.

Text classification algorithm model can be divided into traditional machine learning model and deep learning model. Among them, the commonly used traditional machine learning models include Naive Bayes [9], Support Vector Machine (SVM) [10], Decision Tree [11], etc. Deep learning models include TextCNN, TextRNN, TextRCNN, HAN [12], Bert [13], etc.

At present, text classification has been widely used in many fields, such as spam filtering [14], public opinion analysis [15] and news classification [16]. In the medical field, Yu [17] proposed a named entity recognition model to automatically identify the time and place information in the COVID-19 patient trajectory text. Li [18] proposed a three-stage hybrid method based on gated attention bidirectional long short-term memory (ABLSTM) and regular expression classifier for medical text classification tasks to improve the quality and transparency of medical text classification solutions. Prabhakar [19] proposed a medical text classification paradigm, using two novel deep learning architecture to alleviate human efforts. Cui [20] developed a new text classifier based on regular expressions. Machine-generated regular expressions can effectively combine machine learning technologies to perform medical text classification tasks, and have potential practical application value. Zheng [21] proposed a deep neural network model called ALBERT-TextCNN for multi-label medical text classification. The overall F1 value of the model classification reached 90.5 %, which can effectively improve the multi-label classification effect of medical texts. Li [22] proposed a two-level text classification model based on attention mechanism, which is used to classify biomedical texts effectively.

Generally, text classification has also been widely used in the medical field. In this paper, we use the deep learning model to classify the efficacy of prescription. The paper introduces the model used and analyzes the experimental results in detail.

3 Methodology

In this study, we aim to classify the efficacy of prescription into multiple categories using deep learning method which includes TextCNN, TextRCNN, RNN Attention, Bert and their combination models.

3.1 Convolutional Neural Network Model

CNN (Convolutional Neural Network) [23] is widely used in the field of image recognition. The structure of CNN model can be divided into three layers, i.e., convolutional layer, pooling layer and fully connected layer. The main function of convolution layer is to extract features. The pooling layer aims to down sampling but with no damage of the recognition results. The main role of the fully connected layer is classification.

TextCNN is a model proposed by Kim [24] in 2014, which pioneered the use of CNN to encode n-gram features for text classification. The convolutions in the image is two-dimensional, whereas TextCNN uses one-dimensional convolution (filter_size * embedding_dim), with one dimension equal to embedding. This can extract the information of filter_size grams. After inputting data, the Embedding layer converts words into word vectors and generates a two-dimensional matrix. Then, sentence features are extracted in one-dimensional convolution layer. After that, the sentences of different lengths are represented by fixed length in MaxPooling layer. Finally, the probability distribution is obtained in Fully connected layer. The structure of TextCNN is shown in Fig. 2.

Fig. 2.
figure 2

The structure of TextCNN.

3.2 Recurrent Neural Network Model

RNN (Recurrent Neural Network) [25] is a kind of neural network with shortterm memory. In the RNN models, neurons cannot only accept the information of other neurons, but also accept their own information to form a network structure with loops. Compared with feedforward neural networks, RNNs are more in line with the structure of biological neural networks. RNN has been widely used in speech recognition, language models and natural language generation tasks.

TextRNN [26] takes advantages of RNN to solve text classification problems, trying to infer the label or label set of a given text (sentence, document, etc.). TextRNN has a variety of structures and its classical structure includes embedding layer, Bi-LSTM layer, concat output, FC layer and softmax. The structure of TextRNN is shown in Fig. 3.

Fig. 3.
figure 3

The structure of TextRNN.

TextRCNN [27] (TextRNN+CNN) first uses bi-directional RNN to obtain the forward and backward context representations of each word, so that the representation of the word becomes the form of splicing word vectors and forward and backward context vectors, and finally connects the same convolution layer as TextCNN and pooling layer.

RNN_Attention [28] introduces an attention mechanism based on RNN to reduce the loss of detailed information when processing long text information.

3.3 Bidirectional Encoder Representation from Transformers Model

BERT (Bidirectional Encoder Representation from Transformers) [13] is a pretrained language representation model that uses MLM (Masked Language Model) for pre-training and deep bidirectional Transformer component to construct the entire model to generate deep bidirectional language representations that fuse contextual information.

Each token of the input information (the yellow block in Fig. 4) has a corresponding representation, including three parts, i.e., Token Embedding, Segment Embedding, and Position Embedding. The vectors of the final input model are obtained by adding their corresponding positions, and then classified. The structure of BERT-classify model is shown in Fig. 4.

Fig. 4.
figure 4

The structure of BERT-classify model. (Color figure online)

Bert-CNN, Bert-RNN, Bert-RCNN and Bert-DPCNN are combination models using Bert pre-trained models. First, the data is put into Bert model for pretraining, and then the output of BERT (that is, the output of the last layer of the transformer) is used as the input of the convolution (embedding_inputs) to access other models.

4 Experiment

4.1 Data Sources

The experimental data were collected from Chinese patent medicines [6], national standard formulae [7] and the seventh editions of Chinese formula textbooks [8]. A total of 2,618 prescription data were manually integrated. Among them, there are 1,391 pieces of Chinese patent medicine data, 1,089 pieces of national standard formulae data and 138 pieces of Chinese formula textbook data. Each data contains the name, composition, efficacy and indications of the prescription. The experimental data format is as follow: {Prescription: Bufei Huoxue Capsule} {Composition: Membranous milkvetch root720 g, radix paeonia rubra 720 g, malaytea scurfpea fruit 360 g} {Efficiency: benefiting qi for activating blood circulation, invigorating lung and nourishing kidney} {Indication: cor pulmonale (remission stage) diagnosed as qi deficiency and blood stasis syndrome. The symptoms include cough, shortness of breath, asthma, chest tightness, palpitation, cold and weak limbs, soreness and weakness of waist or knees, cyanosis of lips, pale tongue with white coating or dark purple tongue.} {Real label: 1} {Category: reinforcing agents}

4.2 Data Integration and Standardization

We exclude Tibetan medicine, Mongolian medicine, Uygur medicine and other ethnic medicines. We refer to the national classification standard and integrate the efficacy classification based on departmental division. For example, some traditional Chinese patent medicines are classified according to the names of modern diseases, such as drugs for nasal diseases, ear diseases, and anti-tumor drugs. Their efficacy indications are mainly described by modern diseases, which deviates greatly from the terminology of efficacy indications of conventional formula. Referring to the primary and secondary classifications in the national standard prescription classification, the fine curative efficacy classification of Chinese patent medicines is integrated into the primary classification of the same efficacy.

A total of 2,618 prescription data were manually integrated. We retain the data based on efficacy classification, remove duplicate data and the data of modern disease naming efficacy, resulting in total of 2,368 prescription data. Then we sift out 12 items of incomplete efficacy and treatment and 2 items of emetic (too few data in this category), leaving a total of 2,354 prescription data. After data integration and screening, the final prescription efficacy can be divided into 21 categories. The distribution of sample categories is shown in Table 1.

Table 1. Distribution of sample categories.

4.3 Experimental Parameters

The 2,354 prescription data were randomly disrupted and divided into training set, validation set and test set at a ratio of 8: 1: 1.

The model parameters used in this paper are shown in Table 2. To reduce the risk of over-fitting of the model, the training follows the principle of early stopping, and sets the detection parameter detect_imp = 1,000, dropout = 0.1.

Table 2. Model parameters.

4.4 Statistics for Model Evaluation Measures

To evaluate the performance, our experiment used the classic evaluation indexes in text classification, namely Accuracy, Precision, Recall and F1-Measure. Due to the limited number of multi category data in the research, and the imbalanced samples in each category, weighted recall and weighted F1 are used as the overall evaluation indicators. The formulas are as follows:

$$\begin{aligned} Accuracy =\frac{TP+TN}{TP+TN+FP+FN}\times 100\% \end{aligned}$$
(1)
$$\begin{aligned} Precision =\frac{TP}{TP+FP}\times 100\% \end{aligned}$$
(2)
$$\begin{aligned} Recall =\frac{TP}{TP+FN}\times 100\% \end{aligned}$$
(3)
$$\begin{aligned} F1-Measure =2\times {\frac{Precision\times {Recall}}{Precision+Recall}}\times 100\% \end{aligned}$$
(4)

TP, TN, FP and FN are positive samples with correct prediction, negative samples with correct prediction, positive samples with wrong prediction and negative samples with wrong prediction, respectively. In multi-classification task, the currently tested classification is treated as a positive sample, and other classifications are treated as negative samples. Firstly, the accuracy rate, recall and F1 value of each category are measured, and then the weighted value is obtained (different weights are given according to the proportion of each category).

4.5 Results

Included the 3 non-pretrained models, 8 models were used to classify the efficacy of formula. The experimental results are shown in Table 3.

Table 3. Experimental results.

In the non-pre-trained model, the TextRCNN model outperforms the other two models. The accuracy is 3 to 8% points higher, reaching 72.77%, with a loss value of 0.87, slightly worse than TextCNN’s 0.82. Weighted-Precision, Weighted-Recall and Weighted-F1 are the best of the three models. RNN_Attention model performed worst, with an accuracy of only 64.68%.

The combination models using the Bert pre-trained language model outperform non-pre-trained models. Among them, the Bert-CNN model has the best effect in the efficacy classification of experimental prescriptions in this paper. Accuracy reaches 77.87% and Loss value is slightly inferior to the TextCNN model. Based on the above results, Bert-CNN has the best effect and is most suitable for this experiment.

The Precision of each efficacy category in the Bert-CNN model is as follows. Tranquilizing formulae: 66.67%; reinforcing formulae: 73.17%; astringent preparations: 100.00%; settlement formulae: 20.00%; diaphoretic formulae: 63.16%; resuscitating formulae: 50.00%; regulating qi formulae: 77.27%; blood regulating formulae: 88.00%; antipyretic agents: 76.60%; anthelmintic: 100.00%; dampnessdispelling formulae: 90.91%; summer-heat clearing formulae: 100.00%; interiorwarming formulae: 100.00%; digestive formulae: 50.00%; purgative formulae: 100.00%; carbuncle therapeutic formulae: 100.00%; wind-calming medicine: 85.71%; moistening formulae: 0.00%; turbidity lipid-lowering formulae: 100.00%; detumescence formulae: 75.00% and phlegm, cough, asthma formulae: 90.00%.

5 Discussion

5.1 Analysis of the Overall Results of Experimental Classification

Model Performance. The classification prediction accuracy of most models is above 70%. In non-pre-trained models, Text_RCNN performs best and RNN_Attention performs worst. The attention mechanism of RNN_Attention model is usually combined with RNN, which relies on the historical information of t-1 to calculate the information at time t, so it cannot be implemented in parallel, resulting in low prediction efficiency. In Text_RCNN model, the convolutional layer + pooling layer in CNN is replaced by two-way RNN + pooling layer, so the prediction result is better. In general, Bert pre-trained language model and its combination models are superior to the non-pre-trained models, and Bert-CNN has the best classification effect. The main model structure of Bert is Transformer encoder, which has stronger text editing ability than RNN and LSTM.

Classification Data. The accuracy of the larger sample size is about 80%, while the smaller sample size is unstable. In addition to the sample size, it is also related to the classifications’ clear feature words, which are better for machine learning. 7 classifications out of the 21 categories had accuracy of 100%. The primary reason is that these classifications’ defining phrases are comparatively plain, and machines are generally simple to learn. Moistening formulae has the lowest accuracy, followed by digestive formulae and resuscitating formulae (50%), which is easily confused with other classifications.

5.2 Summary and Cause Analysis of Typical Misclassified Data

By analyzing the cases of machine classification errors, we can draw the following conclusions, as is shown in Table 4. It shows that although the model features, recognition features, and learning features are different, the results of some models are very similar. But each model usually has different characteristics, and the characteristics of recognition and learning also have their own emphasis. Taking Dingkun Dan as an example, the results of the model recognition are mainly divided into “reinforcing formulae” (label is 1) and “blood regulating formulae” (label is 7), and the number of both is also relatively close.

Table 4. Typical error cases.

Text classification algorithm based on text feature is difficult to understand text connotation without considering TCM theory knowledge. For example, Maimendong Decoction, as a “moistening formulae” (label is 17), has fewer characteristic words related to “dryness” than “heat” and “fire”, which causes most models to recognize it as a “antipyretic agents” (label is 8). Similarly, the model cannot understand that “Qingbanxia” can dry and damp phlegm, nor “ophiopogon” can nourish yin and moisten dryness.

The classification of existing standards is also controversial. For example, the standard classification of Xingsusan Powder is “moistening formulae” (label is 17), while most models classify it as “phlegm, cough, asthma formulae” (label is 20) and “diaphoretic formulae” (label is 4). From the perspective of traditional Chinese medicine theory, the classification results are also acceptable.

6 Conclusion

In this paper, a standard data set of prescription efficacy classification was constructed. By comparing and analyzing the efficiency of several deep learning text classification models, it is found that the Bert-CNN model is the best, which improves the accuracy of automatic classification of prescription function.

Future work aims to address the following issues: Collect more data. Using methods such as K-fold cross validation to improve data utilization; Adding medical pre-training model to improve the accuracy of the model to identify medical terms, and increase prior knowledge in the field of traditional Chinese medicine, ultimately improve the accuracy of model classification; More experts in relevant fields will be invited to participate in the discussion on the efficacy classification of prescription. On the one hand, the efficacy classification criteria could be optimized. On the other hand, the standard classification results could be optimized and adjusted.