1 Introduction

TCM (Traditional Chinese Medicine) was developed through thousands of years of empirical testing and refinement, and played an important role in health maintenance for the Chinese ancient people [7]. It is a theoretical system that is gradually formed and developed through long-term medical practice. TCM has the advantages of convenience, cheap and low side effects, and is suitable for use in hospitals, even in community hospitals with poor conditions.

Prescription in TCM consists of a variety of herbs, which is the main way to treat diseases for thousands of years. In the long Chinese history, a lot of prescriptions have been invented to treat diseases and more than 100,000 have been recorded [24]. An example of a prescription in Dictionary of Traditional Chinese Medicine Prescriptions is given in Fig. 1 [22, 38].

Fig. 1
figure 1

An example of a prescription in Dictionary of Traditional Chinese Medicine Prescriptions. The Composition herbs are the most important in a prescription, and generally the prescriptions means composition herbs

There are four important diagnostic methods in TCM: Observing, Listening, Inquiry, Pulse feeling. Observing understands the state of health or disease through objective observation of all visible signs and effluents of the whole body and part of the body. Face diagnosis is a common method of observing, which can understand the pathological state of various organs in the body by observing changes in facial features [39]. Face appearance signals information about an individual [16]. The face is rich with capillaries, which is like a mirror that reflects the physiological pathology of humans. From the view of TCM, the characteristics of the various regions of the face represent the health status of various internal organs of the human body. The doctor can judge the physical condition of the patient by observing the facial features of the patient.

Computer aided diagnosis (CAD) based on artificial intelligence (AI) is an extremely important research field in intelligent healthcare [6]. According to a survey, deep learning algorithms, especially convolutional neural networks(CNN), have been widely used in various fields of medical image processing in recent years due to their excellent performances in the field of computer vision, such as disease classification, lesion detection, and substructure segmentation [20]. From the 306 papers reviewed in this survey, it is evident that deep learning has pervaded every aspect of medical image analysis [20]. End-to-end convolutional neural network training has become a good choice for medical image processing tasks.

However, to the best of our knowledge, there has not been research work that mines the relationship between the patient’s face and TCM prescriptions. In realistic TCM, doctors prescribe through features of face, tongue, pulse, voice, and symptoms. Using face images to generate TCM prescription is of great significance to assist doctors in the diagnosis and treatment. Especially for some young doctors, the generated prescriptions can give them some references. It can recommend prescriptions to doctors. After making some modifications, doctors can apply them to practice. It saves treatment costs compared to directly prescribe from scratch and improve the efficiency of the doctor’s prescribing. A large number of data samples can be used to learn the relevant information of patient’s face and TCM prescription. Learning to how to prescribe through the patient’s diagnosis data can provide a reference for TCM doctors to observe and diagnose patients.

In this paper, we propose to use deep learning (convolutional neural network) to prescribe (TCM prescriptions) based on the patient’s face image. The main work is as follows:

  1. 1.

    A conventional convolutional neural network was designed to encode the patient’ face image features and generate TCM prescriptions.

  2. 2.

    Considering different facial organs(eyes, nose, mouth) and regions(cheeks, chin) represent the status of internal organs(heart, liver, spleen, lungs and kidney) in various parts of the human body, a multi-scale convolutional neural network based on three-grained face is proposed to extract feature of facial organs, facial regions and entire face to learn to generate TCM prescriptions.

  3. 3.

    We conduct experiments to verify the effectiveness of convolutional neural network for face feature encoding and prescription generation.

The rest of this paper is organized as follows. In Section 2, we discuss some related work on TCM prescriptions and medical image processing. Section 3 illustrates task description and methodology. Section 4 elaborates and analysis the experiment results. We have some discussions in Section 5 and conclude this paper in Section 6.

2 Related work

Deep learning in medical image processing

Deep learning and convolutional neural network have become popular topics in medical image processing. There are already a lot of research works that apply deep learning to medical image processing. In terms of disease classification, there are studies on breast cancer image classification [2, 8], lung pattern classification [1], Alzheimer’s disease classification [12]. In the detection of lesion targets and diseases, there are cancerous tissue recognition [28], detecting cardiovascular disease [31], predicting pancreatic cancer [25], and melanoma recognition [40]. In the segmentation of organs and substructures, there are studies on skin lesion segmentation [23, 42], microvasculature segmentation of arterioles [17], tumor segmentation [47]. In addition, there are many other applications, such as studies of visual attention of patients with Dementia [5], diagnosis of cirrhosis stage [36], constructing brain maps [46].

TCM prescriptions

On the other hand, some work has been devoted to the study of TCM prescriptions. Some studies analyzed and explored TCM prescriptions and discovered the regularity [21, 34, 45, 48, 49]. Some studies used topic model to discover prescribing patterns [37, 38]. There are other studies such as TCM medicine classification and recognition [9, 33], knowledge graph [32, 41] for TCM.

In practice, TCM doctors can judge the health of various internal organs of the body by observing the patient’s face. Combining other characteristics, doctors can give TCM prescription based on their knowledge. Our work is to try to simulate and learn this process. Using deep learning techniques, we can learn how to prescribe from a large amount of medical data. At the current stage, the medical data from which we learn are the patient’s face images and corresponding prescriptions. The study is of great importance to assist doctors to diagnosis and treat.

3 Methodology

3.1 Data collection and task description

The data set used in our study are collected from cooperative hospitals. After preprocessing, there are 9,653 data pairs totally. Each data pair contains a patient’s face image and a corresponding TCM prescription.

All Chinese herbal medicines are included in a unified dictionary H = {h1, h2,..., hn}. The i-th element hi in H represents the i-th Chinese herbal medicines, and there are n Chinese herbal medicines. In our dataset, n is 559. Each TCM prescription can be represented by a vector y = [y1, y2,..., yn]. The element yi in y can only be 1 or 0, indicating whether the Chinese herbal medicine is prescribed. Each patient’s face image is represented by a pixel matrix x, and the size of x is 224x224x3. X represents all face images in dataset and Y represents all prescriptions in dataset.

The task of this paper is to input a patient’s face image (pixel matrix x) and output the patient’s corresponding prescription y. The prescription y is a multi-label vector. In fact, the task is a multi-label learning. Multi-label learning studies the problem where each example is represented by a single instance while associated with a set of labels simultaneously [44].

3.2 Construction of conventional convolutional neural network

Deep convolutional neural networks are widely used in the field of image processing. It can extract potential features from the original pixel matrix with RGB color channels for use in various image tasks such as classification, detection, segmentation. Classical convolutional neural network structures include AlexNet [19], VGGNet [26], GoogleNet [29], ResNets [11, 35, 43], DenseNet [14] and SENet [13].

The convolutional neural network used for prescription generation is composed of several convolutional modules and fully connected layers. Each convolutional module includes a convolutional layer and a pooling layer. In order to extract features from the image, the convolutional layer uses some convolutional kernels to scan image matrix to reconstruct a feature map C. A convolutional kernel is a weight matrix. We use K to represent it. The above operation can be abstracted as a function with the relu [10] activation function:

$$ C(x,K)=ReLU(Conv(x,K)). $$
(1)

In order to extract more important features and reduce the computational complexity, the max-pooling layer is used to downsample the feature map C, which can be represented by the following function (the parameters of max-pooling layer are omitted):

$$ \hat{C}(x,K)=Max(C(x,K)). $$
(2)

Three consecutive convolution and pooling operations can be abstracted into the following function:

$$ \hat{C}^{3}(x,K)=\hat{C}(\hat{C}(\hat{C}(x,K))). $$
(3)

In order to encode features, several fully connected layers are usually connected to the end of several convolution modules. The weight parameters of the fully connected layer layer are denoted by W1. An operation of fully connected layer (with a relu activated function) can be abstracted as the following function:

$$ f(\hat{C}^{3},W1)=ReLU(FC(\hat{C}^{3},W1)). $$
(4)

The last layer is the output layer, which is a fully connected layer with sigmoid activation function. The weight is represented by W2. It outputs the probability of whether each Chinese herbal medicine is prescribed, which can be abstracted as the following function:

$$ \begin{array}{lll} P(x,{\varTheta} )&=sigmoid(FC(f,W2))\\ &=[P(h_{1}|x,{\varTheta} ),...,P(h_{n}|x,{\varTheta})]. \end{array} $$
(5)

Θ = {K, W1, W2} is the set of all parameters, and the convolutional kernel K for each convolutional operation described above is different.

The loss function of the convolutional neural network is designed as the average value of multiple cross-entropy. Each cross-entropy measures the difference between the probability of prescribing of each Chinese herbal medicine P(hi|x, Θ) and actual output yi. The neural network minimizes the loss function by optimizing all parameters using stochastic gradient descent [3], which can be abstracted as the following functions(m is the size of the dataset):

$$ \begin{array}{lll} J(\varTheta, x)=&\frac{1}{n}\sum \limits_{i=1}^{n}[-y_{i}log(P(h_{i}|x,{\varTheta}))-\\ &(1-y_{i})log(1-P(h_{i}|x,{\varTheta}))]; \end{array} $$
(6)
$$ {\varTheta}^{*} = arg \mathop {min }\limits_{\varTheta}\frac{1}{m}\sum \limits_{j=1}^{m}J(\varTheta, x_{j}). $$
(7)

The structure of the convolutional neural network is shown in Fig. 2. It contains three convolution modules for extracting features, a fully connected layer for coding features, and the final output layer. All the sizes of convolution kernels are 3x3. The input of the network is the face image matrix x of the patient, and the size is 224x224x3. The number of elements in the output layer is n, the size of the Chinese herbal medicine dictionary H, and each unit represents the probability that a certain Chinese herbal medicine is prescribed. The number of dimensions of the real output y is n, and each value is 0 or 1, indicating whether to prescribe. The loss is the average cross-entropy loss calculated from the network output P and the real output Y. P contains the probabilities of being prescribed for all Chinese herbal medicine in dictionary H. Finally, according to dictionary H, a final prescription is obtained by sampling from P through a probability threshold t.

Fig. 2
figure 2

A conventional convolutional neural network for generating TCM prescriptions

3.3 Construction of multi-scale convolutional neural network based on three-grained face

Different regions of a face image have different local statistics [30]. Taigman et al. [30] use locally connected layers, which like a convolutional layer but every location in the feature map learns a different set of filters, to deal with this problem. However, the use of local layers greatly increases the parameters of the model. Only a large amount of data can support this approach, so instead of doing this, we extract features of different facial regions using different small convolutional networks.

According to TCM, the characteristics of various regions of the face represent the health of various internal organs of the human body. In order to encode the features of each region of the face more efficiently, the paper proposes a multi-scale convolutional neural network based on three-grained face. The “three-grained” refers to the organ block, the local region block, and the face block. Each block extracts characteristics of the face area from different granularities. The organ block includes the left eye, right eye, nose, and mouth. The local region block includes the left cheek, right cheek, and chin. The face block means the entire face. The network is expected to extract and encode more effective facial features from different granularities, thereby improving the effectiveness of prescription generation.

In the data preprocessing stage, the patient’s face is segmented to obtain different region images of the face. An example of different region images after cutting the face [15] is given in Fig. 3. The sizes of different regions images are reduced. The organ block images Xorgan includes a left-eye image Xo− 1, a right-eye image Xo− 2, a nose image Xo− 3, a mouth image Xo− 4, and their sizes are 56x56x3. The local region block images Xregion includes a left cheek image Xr− 1, a right cheek image Xr− 2 and a chin image Xr− 3, and their sizes are 112x112x3. The face block means to the entire face Xface, and the size of face image is 224x224x3.

Fig. 3
figure 3

Different organ and local region images of the segmented face. Left: The organ block images. Right: The local region block images

3.3.1 Extracting feature of facial organ

Firstly, feature extraction is performed on the organ block. After convolution of four organ block images, concatenate the four feature maps. The operation can be abstracted as the following functions:

$$ \begin{array}{lll} C_{o-i}=C(X_{o-i},K), i=\{1,2,3,4\}; \end{array} $$
(8)
$$ \begin{array}{lll} Concat_{o}=Concat(&C_{o-1},C_{o-2},C_{o-3},C_{o-4}). \end{array} $$
(9)

In the field of computer vision applications, there is often not enough data, and the overfitting of models easily occur. Usually, dropout [27] is used to prevent overfitting. Dropout randomly discards neural units during training phase. This prevents units from co-adapting too much and force the network to learn more robust features. It reduces the size of the network during the training phase and gets a number of more streamlined networks that have similar integration effects [27]. After dropout the above feature map Concato, a convolution operation is performed again to obtain a feature map Co, which extracts features of organ block. The above operations can be abstracted as the following function:

$$ C_{o}=C(Concat_{o},K). $$
(10)

3.3.2 Extracting feature of facial local region

Secondly, feature extraction is performed on the local region block. After convolution and max-pooling of the three local region block images, concatenate the three local region block feature maps together with the feature map extracted by the organ block. The above operation can be abstracted as the following functions:

$$ C_{r-i}=\hat{C}(X_{r-i},K),i=\{1,2,3\}; $$
(11)
$$ \begin{array}{lll} Concat_{o\_r}=Concat(&C_{o},C_{r-1},C_{r-2},C_{r-3}). \end{array} $$
(12)

After dropout the above feature map \(Concat_{o\_r}\), convolution and max-pooling operations are performed to extract features to obtain a feature map \(C_{o\_r}\) (fuses the features of the organ block and local region block). The above operation can be abstracted as the following function:

$$ C_{o\_r}=\hat{C}(Concat_{o\_r},K); $$
(13)

3.3.3 Extracting feature of entire face

Finally, feature extraction is performed on the face block. After several convolution and max-pooling of the entire face, concatenate the face block feature map together with the feature map \(C_{o\_r}\). The above operation can be abstracted as the following function:

$$ C_{face}=\hat{C}(\hat{C}(\hat{C}(X_{face},K))); $$
(14)
$$ Concat_{o\_r\_f}=Concat(C_{o\_r},C_{face}). $$
(15)

After dropout the above feature map \(Concat_{o\_r\_f}\), two fully connected layers are used to encode feature to get the final features (fuse the features of organ block, region block and face block). The above operation can be abstracted as the following function, where W3 and W4 are the weights of the fully connected layers.

$$ C_{o\_r\_f}=f(f(Concat_{o\_r\_f},W3),W4) $$
(16)

3.3.4 Training based on three-grained face features (organ, local region, entire face)

The convolutional neural network has three output layers. The first output Porgan uses the feature map Co, which extracts the features of organ block, to predict. The second output Pregion uses the feature map \(C_{o\_r}\), which extracts the features of organ block and region block, to predict. The third output Pface uses the final feature \(C_{o\_r\_f}\), which extracts the features of organ block, region block and face block, to predict. The above operation can be abstracted as the following function, where Wo1, Wo2 and Wo3 represent the weights of output layers.

$$ \begin{array}{lll} P_{organ}=f(C_{o},W_{o1}) \end{array} $$
(17)
$$ \begin{array}{lll} P_{region}=f(C_{o\_r},W_{o2}) \end{array} $$
(18)
$$ \begin{array}{lll} P_{face}=f(C_{o\_r\_f},W_{o3}) \end{array} $$
(19)

Porgan, Pregion, and Pface denote the probabilities of being prescribed for all Chinese herbal medicine in dictionary H. Among them, Pface is the main output of the neural network, which is the decision output of the final generation. Porgan and Pregion are auxiliary outputs, which are used to assist the training of the entire network. The final loss is addition of three losses, which are calculated by Porgan, Pregion, and Pface and the real output Y. We use stochastic gradient descent to optimize the parameters so that the final loss is minimized. The loss functions are as follow, where Θ denote the set of all parameters of the neural network and n means the dimension of each real prescription y.

$$ \begin{array}{lll} J1({\varTheta} )=\frac{1}{n}\sum \limits_{i=1}^{n}[&-Y_{i}log(P_{organ})-\\&(1-Y_{i})log(1-P_{organ})] \end{array} $$
(20)
$$ \begin{array}{lll} J2({\varTheta} )=\frac{1}{n}\sum\limits_{i=1}^{n}[&-Y_{i}log(P_{region})-\\&(1-Y_{i})log(1-P_{region})] \end{array} $$
(21)
$$ \begin{array}{lll} J3({\varTheta} )=\frac{1}{n}\sum \limits_{i=1}^{n}[&-Y_{i}log(P_{face})-\\&(1-Y_{i})log(1-P_{face})] \end{array} $$
(22)
$$ {\varTheta}^{*} = arg \mathop {min }\limits_{\varTheta}{J1({\varTheta})+J2({\varTheta})+J3({\varTheta})}. $$
(23)

The multi-scale convolutional neural network based on three-grained face structure is shown in Fig. 4, in which the sizes of the input organ block images are 56x56x3, and the sizes of the input region block images are 112x112x3, the size of the input face block image is 224x224x3. All the sizes of convolution kernels are 3x3.

Fig. 4
figure 4

A multi-scale convolutional neural network based on three-grained face for generating TCM prescriptions

The network is divided into three parts. The first part extracts the features of organ block to obtain output Porgan. The second part extracts the features of region block and then merges them with the features of the organ block to continue to extract feature to get the output Pregion. The third part extracts the features of face block and then merges them with the features of the organ block and region block to continue to extract feature to get the output Pface. The three outputs denote the probabilities of being prescribed for all Chinese herbal medicine in dictionary H. The loss used to train the entire network is addition of three losses, which are calculated by Porgan, Pregion, Pface and the real output Y. Finally, the final generated prescription is obtained by sampling from the output Pface through the probability threshold t.

3.4 Data augmentation

In the real world, patient’s medical data is precious and difficult to collect. Therefore, the data collected from the patient’s faces and prescriptions are very limited. Due to the limited data set, it is easy to cause the model to overfit, which is one reason for not choosing an overly complex network. Data augmentation is an effective way to cope with not enough data. It can reduce overfitting of the model and improve the model’s predictive performance.

In order to make full use of limited data, data augmentation is performed. The “data augmentation” randomly extracts some of the original patient’s face images, then randomly transforms the images (such as rotation, zoom) and then saves the image as a new patient’s face image. The original patient’s prescription are used as the prescription labels of the new patient’s face image. Data augmentation can increase the size and diversity of the data set. The sample size of the original data set is 9653. After data augmentation, the data set size increases to 18,463. Some parameters used in data augmentation are shown in Table 1.

Table 1 Parameters of data augmentation

4 Experiment

4.1 Dataset

“Face image - TCM prescription” dataset is collected from some cooperative hospitals. Due to the limited collection conditions, the collected raw data have a certain noise. For example, there are different medicine names but exactly they are the same medicine. After some preprocessing, the experimental dataset is obtained. One of the preprocessing is face detection. Although we have tried to ensure the image quality when collecting face images: let the patient face the camera, put the face as accurately and clearly as possible in the middle position, and fill the entire image, there is still more or less background noise, so we should use face detection to reduce noise. Firstly we used dlib library [18] to detect the approximate position of the face. Because the bounding box of face is small, the image may lose some information if dlib give an inaccurate detection. And the bounding box doesn’t contain forehead. Therefore, we increase the bounding box by doing a certain percentage of expansion. The final detection effect can be seen from face images in Table 8.

The size of experimental dataset Dorigin is 9653. After data augmentation, the size of dataset is increased to 18463, and the dataset is denoted as Daug. In order to train multi-scale convolutional neural network based on three-grained face, the face images are segmented into different face areas: eyes, nose, mouth, cheeks, and chin. The specific description of the dataset is shown in Tables 2 and 3.

Table 2 Face images information
Table 3 Prescription information

In order to enhance the accuracy and persuasiveness of the experimental results, we use 5-fold cross-validation method to train and evaluate model: repeatedly performs training for five times and 500 samples are taken as test set for each time(conventional approach should divide data into five equal parts, each equal part is taken as the test set for each time. Only 500 samples are taken as test set for each time here due to the limited dataset). The 500 test samples taken for each time do not overlap. The average of five evaluation results is used as the final evaluation result.

4.2 Experimental setup

According to conventional convolutional neural network, multi-scale convolutional neural network based on three-grained face, and data augmentation, five models are run for TCM prescription generation, briefly described as follows:

Random forest (baseline)::

Random forest [4] classifier is used to generate TCM prescriptions. The features are face images matrix and the labels are multi-label vectors representing the TCM prescriptions.

ConventionalCNN::

Construct a CNN as described in Section 3.2 to train according to face images and TCM prescriptions to obtain a model for generating TCM prescriptions. The experimental data set used is Dorgin.

ConventionalCNNaug::

The method is the same as conventional CNN, but the experimental data set used is Daug.

Multi-scaleCNNbased on three-grained face::

Construct a CNN as described in Section 3.3 to train to obtain a model for generating TCM prescriptions according to images of different face regions and TCM prescriptions. The experimental data set used is Dorgin.

Multi-scaleCNNaugbased on three-grained face::

The method is the same as multi-scale CNN based on three-grained face, but the experimental data set used is Daug.

The structure and some parameters of the conventional CNN and multi-scale CNN based on three-grained face have been described in Sections 3.2 and 3.3. The more specific parameters are shown in Table 4. The optimization algorithm is SGD (stochastic gradient descent), and learning rate decay is 1e-6, and momentum is 0.9.

Table 4 Parameters of neural networks

4.3 Evaluation metrics

In order to measure the similarity between the generated TCM prescription and the actual TCM prescription, the indicators precision, recall, and f-score are set as shown in the following formulas. n_truei denotes the number of Chinese herbal medicine appearing in both the i-th generated prescription and the i-th real prescription. n_predicti denotes the number of Chinese herbal medicine appearing in the i-th generated prescription. n_reali denotes the number of Chinese herbal medicine appearing in the i-th real prescription. precisioni measures the how the Chinese herbal medicines are precise in generated prescription, and recalli measures the how the Chinese herbal medicines are complete in generated prescription. f1_scorei (f_scorei) is the harmonic mean of precisioni and recalli, neutralizing these two indicators.

$$ \boldsymbol{precision}_{\boldsymbol{i}} = \frac{n\_true_{i}}{n\_predict_{i}} $$
(24)
$$ recall_{i} = \frac{n\_true_{i}}{n\_real_{i}} $$
(25)
$$ f1\_score_{i} = \frac{2*{precision}_{i}*recall_{i}}{precision_{i}+recall_{i}} $$
(26)

The indicators are calculated for each sample generated by the model, and then averaged to obtain the indicators used to evaluate the quality of the model:

$$ precision=\frac{1}{m}\sum \limits_{i=1}^{m}{prescision_{i}}, $$
(27)
$$ recall=\frac{1}{m}\sum \limits_{i=1}^{m}{recall_{i}}, $$
(28)
$$ f1\_score=\frac{1}{m}\sum \limits_{i=1}^{m}{f1\_score_{i}}, $$
(29)

where m is the size of the dataset. Test set is used to evaluate the model and the size is 500.

For each example xi, f1_scorei is the harmonic mean of precisioni and recalli. But note that precision, recall, f1_score are averages, so f1_score is not the harmonic mean of precision and recall.

4.4 Results and analysis

4.4.1 Training process

In order to prevent overfitting, the model uses data augmentation, dropout methods. In addition, the strategy “EarlyStopping” is also used in the experiment. During training, a certain percentage of data is divided from the training set as a validation set for training observations. The proportion used in the experiment is 0.1. The 10% of training data is used as a validation set that does not participate in training. During the training process, observe the loss of the model on the validation set. After the validation set loss is no longer declining, wait for a certain number of iterations (we use 10 in the experiment) to stop the training. This can prevent the model from overfitting the training set and make a better prediction of the test set.

Take one of the training results in the 5-fold cross-validation. The changes of the training set and the validation set’s loss during the training process are shown in Figs. 5 and 6. It can be seen that although the number of epochs is 300(ensure sufficient number of iterations), training is usually stopped at about 30-70 iterations, and the later iterations overfit in the training set. With data augmentation, compared to the conventional CNN, the relative gap between the loss of the training set and the validation set in multi-scale CNN based on three-grained face is smaller, which indicates that the generalization ability of the multi-scale CNN based on three-grained face is relatively high.

Fig. 5
figure 5

Learning curve of conventional CNN and multi-scale CNN based on three-grained face prescriptions

Fig. 6
figure 6

Learning curve of conventional CNN and multi-scale CNN based on three-grained face (with data augmentation)

4.4.2 Influence of threshold parameter

From the final output of the neural network, a series of probability values can be obtained. Finally, the outputs are 559 neurons, representing 559 Chinese herbal medicines. Finally, 559 corresponding probability values are obtained. The final prescription is predicted based on a threshold value t. The Chinese herbal medicine is prescribed if the probability of the Chinese herbal medicine is more than t.

One general choice for threshold is 0.5. Furthermore, when all the unseen instances in the test set are available, the threshold can be set to minimize the difference on certain multi-label indicator between the training set and test set [44]. As shown in Figs. 7 and 8, setting different thresholds, the final evaluation results will be different (the results in the figure are the average results of 5-fold cross validation). When a larger threshold is set, a higher precision will be obtained because the prescription generated by the model try to be as precise as possible without errors, and it prefer to give fewer medicines to prevent errors. When a smaller threshold is set, a higher recall is achieved because the prescription generated by the model attempted to be as complete as possible and at the expense of a certain of precision. The “f1_score” is the harmonic mean of precision and recall, which neutralizes the accuracy and completeness. Note that the f1_score shown in the experimental data is not harmonic mean of precision and recall, because the f1_score is an average. We choose 0.25 as the final threshold, because at this time the value of f1_score is high relatively, and the difference between precision and recall is small, which can ensure high precision and recall simultaneously.

Fig. 7
figure 7

Influence of threshold on conventional CNN and multi-scale CNN based on three-grained face

Fig. 8
figure 8

Influence of threshold on conventional CNN and multi-scale CNN based on three-grained face (with augmentation)

4.4.3 Performance comparison

The experimental results of the five models are shown in Table 5. In order to enhance accuracy and persuasiveness of results, the evaluation results are averaged by 5 results, calculated by 5-fold cross validation methods. The values after “±” indicate the standard deviation of the 5 results.

Table 5 Experimental performances of different models

Random forest is a ensemble learning technique, which should give good performances. However, it can be seen from the experimental results that the other four models improve the performances compared to the baseline classifier random forest, indicating that the convolutional neural network is better than the random forest in this task. The neural network can extract and represent useful features from large and complex data. There are a large number of original image features that need to be extracted and represented on the task, so using a convolutional neural network for image processing to build a model is a better choice.

The performances of conventional CNNaug are better than conventional CNN, and the performance of multi-scale CNNaug based on three-grained face are better than the multi-scale CNN based on three-grained face. It can be seen that after using data augmentation, the models perform better because using data augmentation increases the size and diversity of the data, allowing the convolutional neural network to learn more knowledge when training. It can reduce overfitting of model.

The performances of multi-scale CNN based on three-grained face are better than conventional CNN, and the performances of multi-scale CNNaug based on three-grained face are better than the conventional CNNaug. A reasonable explanation for this result is that the multi-scale CNN based on three-grained face extracts features from different granularities(organs, local regions, and the entire face), and it can extract and utilize local features and global features more effectively.

As shown in Table 8, three samples were taken to show the actual predicted results. For each example, patient’s face image and corresponding prescriptions are shown. The red bold type of Chinese herbal medicines indicate that it appears both in the real prescription and predicted prescription, from which we can seen the precision of model. The more red bold medicines, the higher the precision of model. The cyan bold type of Chinese herbal medicines indicate that it doesn’t appear in the predicted prescription but appear in real prescription, from which we can seen the recall of model. The less cyan bold medicines, the higher the recall of model. It can be seen that the results of the model prediction have certain similarities with the actual prescriptions, which shows that the model has indeed learned something. In the four models, the results of multi-scale CNNaug based on three-grained face(we omit the “multi-scale” just for neat alignment in Table 8) are the most precise and complete. It can be seen that for common Chinese herbal medicines, the prediction of the model will be more accurate, such as Radix Glycyrrhizae and Poria Cocos. For some unusual Chinese herbal medicines, the model cannot accurately predict, such as Perilla Stem and Curcuma Zedoary. A reasonable explanation for this phenomenon is that common Chinese herbal medicines always appear in the training samples, and the model can learn more useful distinguishable features from a large number of training data. However, it is rarely used for the unusual Chinese herbal medicines, which only occasionally used by a few patients. With a small amount of data, the model is difficult to learn. The model cannot find distinguishable features.

4.4.4 Effect of different image size

The input size of conventional CNN is 224x224. The “multi-scale” of CNN based on three-grained face means to the three-scale input 56x56, 112x112, 224x224, but the actual input size is still 224x224, which is the size of face image. We just input a 224x224 face image and the face image is segmented to 112x112 local region block images and 56x56 organ block images during preprocessing. So we say that the input size of CNNs used above is 224x224. However, the size of the patient’s face image in reality is uncertain.

In order to verify that the CNN models can adapt to various images of different sizes, we retrain the networks of different input sizes with Dorgin and get the experiment results, as shown in Table 6. The evaluation results(precision, recall, f1-score) are calculated by 5-fold cross validation methods. For multi-scale CNN based on three-grained(we omit the “multi-scale” just for neat alignment in Table 6), the image size means the size of face, and the size of local region block images is half of the size of face, the size of organ block images is half of the size of local region block images. The “average” in Table 6 means the average results of different image sizes(32, 56, 84, 112, 168, 224).

Table 6 Experimental performances of different image sizes

It can be seen from Table 6 that the models obtain similar results for different sizes of input images, indicating the robustness of the models. From the average, multi-scale CNN based on three-grained face performs better than conventional CNN. In addition, for smaller image sizes(32, 56, 84), multi-scale CNN based on three-grained face is slightly worse than conventional CNN, but when the input image is relatively large(112, 168, 224), multi-scale CNN based on three-grained face is still excellent in the three evaluation indicators. We conjecture that multi-scale CNN based on three-grained face needs to capture three-grained face features and fine-grained features are difficult to mine when the image is small. But on average, multi-scale CNN based on three-grained face is still superior, and in reality it is unlikely to take too small patient’s images. From the results of larger image sizes(112, 168, 224), the performances of multi-scale CNN based on three-grained face are all higher than conventional CNN.

In addition, as can be seen from the results in Table 6, in conventional CNN, the performances of 224x224 images are worse than 168x168. This is mainly because our dataset isn’t very large, and the image features of 224x224 are too much (compared to 168x168). When the ratio of the number of features to the number of samples is relatively large, the model is easier to over-fitting. By comparison, in our multi-scale CNN based on three-grained face, although the model is more complicated than conventional CNN, it alleviates the over-fitting, which can be seen from that the performance gaps of 224x224 and 168x168 images become smaller in multi-scale CNN based on three-grained face, and even the recall of 224x224 is slightly higher than 168x168. This also illustrates the advantages of our model from another perspective. We speculate that this should be due to the nature of this model. The motivation for its design is based on the actual experience of Traditional Chinese Medicine, that is, to mine the facial features of the patient from the three granularities of the face, so that more comprehensive and useful information can be obtained.

4.4.5 Ablation study

To analyze the effects of the three branches on the results, we performed an ablation experiment to illustrate the importance of the three branches. We did several experiments with CNN based on three-grained face, CNN based on double grained face (remove one block), CNN based on single grained face (remove two block). The experimental results are shown in Table 7.

Table 7 Ablation study about removing different block

As we can see, the best model is CNN based on three-grained face. If the organ block or the region block is removed, there will be a slight decrease in performance. If both the organ block and the region block are removed, the performance will be reduced even more. The phenomena illustrate the need for organ block and region block for our tasks. The model does learn some more detailed features from organ block and region block to help prediction.

In addition, it can be clearly seen that if we remove the face block, the performance of the model will be greatly reduced, which means that the face block is the most important block. This is in line with the doctor’s intuition and our motivation for model design, because when the doctor is treating, he first looks at the whole face, extracts features from the whole face to model, and then observes some details from organs and regions. So the face block is the most important feature sources, while the organ block and region block are complementary to get some of the detailed features.

5 Discussion

Our results show that convolutional neural networks are capable of mining the prescription information from patient’s face images to generate prescription, and the multi-scale convolutional neural networks based on three-grained indeed can generate prescriptions that are closer to real prescriptions, as shown in the actual prediction results in Table 8 and the evaluation results in Tables 5 and 6. By building such a prescription generation system, the doctors can obtain recommended prescription, and then modify it, finally apply it to the actual treatment.

Table 8 Real predicted results of different models

Generation of TCM prescriptions from face image using deep learning can provide us with a possible result. Although the predicted result is not an inevitable conclusion, it provides us with a choice, a kind of opinion for reference, which greatly reduces the blindness of work. In fact, in reality, different TCM doctors do not always give the same prescriptions to patients, and there may be multiple prescriptions for the same patient. It is possible that system-generated prescriptions can inspire doctors to develop new useful prescriptions.

6 Conclusion

In this paper, we propose to use convolutional neural network to generate TCM prescriptions according to the patient’s face image. In order to more fully and effectively extract and utilize the features of the patient’s face, we propose a multi-scale convolutional neural network based on three-grained face and compare it with the conventional convolutional neural network. In addition, we use data augmentation to increase the size and diversity of the data to improve the effect.

To the best of our knowledge, few people do the work to generate TCM prescriptions. Chinese herbal medicine is a medical asset accumulated by the Chinese ancient people’s long-term practice. It is extremely rich and precious. It is of great significance to fully mine and learn information from the prescribing data of patients using deep learning technique.

In fact, when treating patients, doctors of TCM need to integrate multiple features (face, tongue, pulse, voice, symptoms) and their own experience to give solutions, which can overcome the limitations of using face images alone. Due to the limited data, in our preliminary research work we only consider to using patient’s face image to generate TCM prescriptions. In the future work, we plan to collect more quantities, more types of patient data.