1 Introduction

Deep-neural networks (DNNs) have been creating tremendous change and advance in various academic or industry fields. In particular, Convolutional Neural Networks (CNNs) [1] have made rapid progress in academic field due to the ImageNet dataset and many advanced networks such as GoogleNet [2], ResNet [3], and VGGNet [4]. Recently, for the applications that provide intelligent services using these developments, CNNs are getting be imbedded to various mobile devices such as smartphone, tablet personal computer, drone, and embedding board. The CNNs perform well on the image classification such as detecting dangerous situations by predicting people’s behavior from videos captured through mobile device cameras, and taking pictures on a mobile device and giving information of the pictures. However, in general, the CNNs have a complex network structure which consists of heavy layers and a lot of parameters such as the convolutional, pooling, relu-activation, and fully-connected layers. Due to the complexity and computation load, the CNNs are learnt and run on large-scale cloud networks. However, there are two obstacles on learning and performing on the cloud network: 1) Security of personal information becomes vulnerable because an interchange of information is getting easy, and the possibility of personal information leakages by hacking operations. 2) Learning and running can be unstable because cloud communications have a high dependency of mobile network state. Thus, some researchers [5,6,7,8,9,10,11,12, 18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] have studied for making the structure of CNNs compressed and efficient and light learning of CNNs in order to learn and run the network (“network” can be exchangeable with “network of CNN or DNN” from here) directly at the mobile devices. The representative methods for the compression of network structure are weight matrix reconstruction [5,6,7,8], quantization [9, 10], and pruning [11, 12]. Also, a popular method for efficient and light learning is the transfer learning that trains the networks by utilizing the information of an existing network. It reduces learning cost when new classes or patterns, which are not known during training the existing network, happen.

A variety of approaches for the transfer learning, such as the incremental-transfer [18], the parameter-transfer [19,20,21,22,23], and the feature-transfer [24,25,26,27,28,29,30,31,32,33], have been developed by transferring additional network structures, feature values, classifier weights, etc. for new classes. The incremental-transfer method (transferring an additional network structure) reconfigures an existing network by connecting an additional network for new classes in parallel with the existing network. However, the method has a disadvantage that requires the structure overhead for designing the additional network structure. The parameter-transfer (transferring characteristic values) constructs a network by utilizing the common features extracted from the feature filters of an existing network. The feature-transfer (transferring the weights) trains an existing network by using only the common weights obtained from analysis of the differences of the weights. The analysis is done based on the distribution of input data of existing classes and the new class. However, the methods require unnecessary computation overhead of analysis of the weight differences.

To solve the problems issued from the transfer learning, we propose an on-device partial learning technique with no requiring additional network structures and reducing the unnecessary computation overhead. First, we select the weights which need to be trained from an existing network when a new class happens. The weights are analyzed using the information of importance of their contribution to outputs. The importance of weights means that the weights have big contribution to outputs for existing classes. The information of weights is measured by modifying the entropy concept which is widely used in information theory [34,35,36,37,38,39]. The entropy is defined as the expected amount of information or the potential amount of information. Thus, it is calculated by the probability of an event occurring. However, the entropy formula cannot be directly applied to the networks because the magnitude of weights plays a critical role on whether the weights are important to the output or not. Thus, we combine the entropy concept with the magnitude of weights in order to develop a new importance metric which is named as the qualitative entropy. Based on the qualitative entropy, we partially learn the existing network by training only the little important weights on the existing classes. In the experimental section, we demonstrate and analyze the qualitative entropy technique for the CNN image classifier. We also show that our method outperforms in learning time and performance to the conventional transfer leaning using two data sets such as Mixed National Institute of Standards and Technology (MNIST) [40] and ImageNet dataset.

The remainder of this paper is organized as follows. Section 2 introduces various transfer leaning techniques which are highly related to our work. Also, we explain about various entropy applications using CNNs to help understand the derivation of the qualitative entropy metric. In Section 3, we derive the quantitative entropy metric step by step. Section 4 demonstrates and analyze our method and compare the classification performance with the transfer learning method. Section 5 concludes this paper.

2 Related Works

2.1 Transfer Learning for CNN

In order to reduce cost for rebuilding an existing CNN as new classes occur, many researchers [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] worked on developing the transfer learning techniques. The transfer learning makes it easier to learn new classes by using knowledge of an existing network. There are three approaches for the transfer learning such as the incremental-transfer, parameter-transfer, and feature-transfer.

The incremental-transfer builds an architecture consisting of a basic inference network and a small incremental network. The basic inference network is fixed after learning with large data set. The small incremental network is learned with user-customized data set that users provide. The learned small incremental network is connected with the basic inference network in parallel so that the network can classify the new classes without hurting the existing network. The approach increased the classification accuracy from 76.3% to 93.2% in experiments using the 19 hand-printed character image data sets provided from National Institute of Standards and Technology [41] and the users’ customized dataset. However, the approach requires computation overhead to learn the new networks connected to the existing networks. The parameter-transfer extracts the common feature of weight information in a classifier by comparing and analyzing principal components of the weights. In order to transfer the common feature of the network for the new class, Killian et al. [19] used a Hidden Parameter Markov Decision Process (HiP-MDP). Also, Shin et al. [20] used an unsupervised CNN pre-training with supervised fine-tuning. The found features are used to train the network for new classes. Fernandes et al. [21] used a method that encourages the source and target to share the same coefficient signs. However, it is difficult to achieve real benefits with these techniques because they use the computational resource as a whole matrix-based operation. To solve the problem, Long et al. [25] proposed a feature-transfer technique that uses domain adaptation in neural networks, which can jointly learn adaptive classifiers and features from labeled data in the source and unlabeled data in the target. Once the adaptation is done, the network is elaborately trained using residual transfer network [42, 43]. They used 20 newsgroups [44] dataset to model the approximate network, and then the network is trained using SRAA 2 [45] and Reuters–21,578 [46]. Tjandra et al. [27] proposed a feature transfer learning to assist the training process of the end-to-end network. First of all, the method removes the data of existing classes which are not categorized in the existing network and learns the network by giving more benefit to the weights of the new class. They evaluate the accuracy performance of the method using the part-of-speech tagging [47], the named entity recognition [48], the relation extraction [49], and the semantic role labeling [50, 51]. Also, they demonstrated that integrating and leveraging information from the new class is more useful for improving the performance than excluding misleading training cases from existing classes. Even though the feature transfer approaches show good performance with respect to the accuracy, unnecessary computational costs are incurred by fine-tuning all weights of the existing networks. Thus, their methods cannot be applicable to learning directly in mobile devices as mentioned in the previous section.

In order to solve the problem, we learn partial weights in the existing network without an additional structure and training full weights for new classes. To do that, we apply the concept of entropy to select the weights for the partial learning. We mention about the applications of entropy for deep learning systems in the following section.

2.2 Applications of Entropy for CNN

Entropy is defined as the expectation of amount of information under uncertain circumstance, which has been used as a criterion for detecting and selecting important or unimportant information in many fields such as data mining, pruning, weight quantization, and sampling in CNN. Bereziński et al. [52] proposed an entropy-based approach for detecting malware based on abnormal patterns in the computer network. They use the entropy of probes passing bi-directionally across the router in order to estimate the traffic characteristics such as flows duration, packets, and in(out)-degree. The detection of abnormal probes is made by comparing the amount of information of each characteristic with the estimated entropy. The phenomenon of a computer network is usually very small, and the abnormal information of network traffic represented by packet or byte number is hidden. In this situation, the entropy concept can work well on detecting the abnormal patterns because it tells potential amount of information which is hidden between packets. Han et al. [53] proposed an entropy-based filter pruning method to accelerate and compress existing CNNs. The importance of each filter in the convolution layer is evaluated by the entropy. The filter with a low entropy is considered to hold less information, then the filter is considered as less important. They remove the filter with the smaller value of entropy according to the evaluated filter value by considering the amount of information transferred from the filter to the next layer. The entropy technique showed good pruning performance on the CNN structure of VGGNet and ResNet trained with ImageNet [54] dataset. Park et al. [55] proposed an entropy-based quantization technique to reduce the inference cost of neural networks. They cluster weights according to the importance of the weights using the entropy concept to improve quantization quality. The weights are grouped so as the entropy of weights of each cluster is uniform unlike the random binarization technique such as the binary quantization [13,14,15]. By using the entropy for the quantization, the compression performance can be improved because the hidden information in weights is considered during clustering. The entropy technique provided good compression performance on the CNN structure of AlexNet [16], GoogleNet, and ResNet with ImageNet dataset. Zilly et al. [56] proposed an entropy sampling technique to reduce the computational complexity in retinal image segmentation using a CNN structure based on ensemble learning. They estimate the probability for pixels with a histogram of 256 bins represented using the retinal image of each pixel. The entropy of the pixel is calculated using the estimated probability. The entropy is used to find the important pixels that have larger entropy than the average entropy of whole pixels. They showed outperformance of their method on the CNN trained with DRISHTI-GS [17] dataset.

From above methods, the entropy has been proved that it can be a useful approach for determining importance of information in many fields. However, the entropy has a limit for being directly expanded to various fields because the information depends on network structure or dataset, so the entropy should be modified. From the following we show how the entropy is modified for finding weights suitable for the partial learning.

3 Qualitative Entropy Based Partial Learning

Figure 1 shows the schematic of overall process of the qualitative entropy-based partial learning method which enables to classify the part of weights to train the existing network for the new classes. The entropy is employed for calculating the expectation of amount of information of weights for generating outputs as shown in Equation (1). If the amount of information of a weight is lower than the entropy of a set of weights, the weight is considered as less important.

$$ E\left[{N}_r^L\right]=\sum \limits_k\mathit{\Pr}\left({\omega}_{rk}^{L-1}\right)\cdotp \mathrm{A}\left[{\omega}_{rk}^{L-1}\right] $$
(1)

where, \( {N}_r^L \) is the rth node in the Lth layer. \( E\left[{N}_r^L\right] \) is the entropy of \( {N}_r^L \), \( {\omega}_{rk}^L \) is a weight connected from the kth node in the Lth layer to the rth node of the next layer. \( \mathit{\Pr}\left({\omega}_{rk}^L\right) \) is the probability of \( {\omega}_{rk}^L \)among the weights connected to Nr. \( \mathrm{A}\left[{\omega}_{rk}^L\right] \) is the amount of information that \( {\omega}_{rk}^L \) holds. Each entropy means the expected value of the amount of information held by \( {\omega}_{rk}^{L-1} \) connected to \( {N}_r^L \). The amount of information of each \( {\omega}_{rk}^{L-1} \) can be evaluated by Equation (2).

$$ \mathrm{A}\left[{\omega}_{rk}^L\right]=-\log \mathit{\Pr}\left({\omega}_{rk}^L\right) $$
(2)
Fig. 1
figure 1

The overall process schema for partial learning based on qualitative entropy.

If each \( \mathrm{A}\left[{\omega}_{rk}^{L-1}\right] \) has a smaller value than \( E\left[{N}_r^L\right] \), it is meant that less information is transmitted to \( {N}_r^L \). It is deciphered that \( {\omega}_{rk}^{L-1} \) is not important for generating outputs during learning. However, as shown in Equation (1) and (2), the entropy is produced by products based on probability distributions of weights only. If the probability distributions of the weights are similar, the weights have the similar entropy. Even though the weights have the similar entropy, the influence on the output node can be different because the weights are trained in a black-box pattern [57].

For example, as shown in Fig. 2, four nodes (i1, i2, i3, i4) have the same value of 1 and are completely connected to the node of L2. If the values of the weights connected to n1 of L2 are 1, 1, 1, and 2, then the entropy of the nodes is 0.431. However, if the values of weights connected to n2 of L2 are 1, 1, 1, and 3, the entropy is 0.431, too. The entropy values of n1 and n2 are the same. It is because the entropy is calculated using the probability distribution only. Even though the entropy of n1 and n2 is the same, n2 has a greater impact on o1 and o2 in L3. In order to solve the problem, we consider the quality of the weights to avoid the case that the weights with much information are misunderstood as insignificant due to their high probability.

Fig. 2
figure 2

An example for calculating the qualitative entropy.

The quality of the weights needs to be normalized into [0, 1] by using the sigmoid function because it helps to avoid over-emphasizing on the contribution of weights which are outside of a certain range of the distribution of weights. In this paper, the normalized quality of the weight as shown in Equation 3 is called as the qualitative characteristics for convenience.

$$ \mathrm{Q}\left[{\omega}_{rk}^L\right]=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$\left(1+{e}^{-{\omega}_{rk}^L}\right)$}\right. $$
(3)

where, \( \mathrm{Q}\left[{\omega}_{rk}^L\right] \) is the quality of \( {\omega}_{rk}^L \). The qualitative characteristic depends on the degree of the importance of each weight connected to one node. The formula of the qualitative information of each weight is obtained by plugging Equations (2) and (3) into Equation (4).

$$ \mathrm{QA}\left[{\omega}_{rk}^L\right]=\mathrm{Q}\left[{\omega}_{rk}^L\right]\cdotp \mathrm{A}\left[{\omega}_{rk}^L\right] $$
(4)

where, \( \mathrm{QA}\left[{\omega}_{rk}^L\right] \) is the qualitative information amount of \( {\omega}_{rk}^L \). Using the Equation (4), the qualitative entropy can be derived as Equation (5).

$$ {\displaystyle \begin{array}{c}\mathrm{QE}\left[{N}_r^L\right]=-\sum \limits_k\mathit{\Pr}\left({\omega}_{rk}^{L-1}\right)\cdotp logPr\left({\omega}_{rk}^{L-1}\right)\cdotp \mathrm{A}\left[{\omega}_{rk}^{L-1}\right]\\ {}=-\sum \limits_k\mathit{\Pr}\left({\omega}_{rk}^{L-1}\right)\cdotp \mathrm{QA}\left[{\omega}_{rk}^{L-1}\right]\end{array}} $$
(5)

where, \( \mathrm{QE}\left[{N}_r^L\right] \) is the qualitative entropy of \( {N}_r^L \). For most CNNs. The probability distribution of the weights can converge into a Gaussian distribution with bell-shape by the central limit theorem [58] because CNN has a huge number of weights in the fully-connected layer at least 5000. Therefore, Equation (5) can be simplified as Equation (6).

$$ {\displaystyle \begin{array}{c} QE\left[{N}_r^L\right]=-\sum \limits_k\left(\frac{1}{\sigma_{\omega}^{L-1}\sqrt{2\pi }}{e}^{\left(-\frac{{\left({\omega}_{rk}^{L-1}-{\mu}_{\omega}^{L-1}\right)}^2}{2{\sigma_{\omega}}^2}\right)}\right)\mathit{\log}\left(\frac{1}{\sigma_{\omega}\sqrt{2\pi }}{e}^{\left(-\frac{{\left({\omega}_{rk}^{L-1}-{\mu}_{\omega}^{L-1}\right)}^2}{2{\sigma_{\omega}}^2}\right)}\right)\cdot \mathrm{A}\left[{\omega}_{rk}^{L-1}\right]\\ {}=\raisebox{1ex}{$1+\mathit{\ln}\left(2\pi {\sigma_{\omega}^{L-1}}^2\right)\cdot {\sum}_k\mathrm{A}\left[{\omega}_{rk}^{L-1}\right]$}\!\left/ \!\raisebox{-1ex}{$2$}\right.\\ {}=\mathit{\ln}\left({\sigma}_{\omega}^{L-1}\sqrt{2\pi e}\right)\cdotp \sum \limits_k\mathrm{A}\left[{\omega}_{rk}^{L-1}\right]\end{array}} $$
(6)

where, \( {\sigma}_{\omega}^L \) is the standard deviation of the weights in the Lth layer, \( {\mu}_{\omega}^L \) is the average of the weights in the Lth layer. If Equation (6) is applied for evaluating the entropy of Figure 2, n1 and n2 become 0.116 and 0.212 respectively. From the result, it is concluded that n2 has a greater impact than n1. As shown in Figure 2, if values of nodes at lower layer are changed, its subsequent nodes are affected. Consequently, by partially learning about the weights that have a little influence on the output, the classification performance for the existing classes can be preserved. The weights with little amount of information can be selected by Equation (7) obtained from Equation (4) and (6).

$$ \mathrm{QA}\left[{\omega}_{rk}^L\right]\le \mathrm{QE}\left[{N}_r^L\right] $$
(7)

However, the partial learning does not apply the above method in the last layer. Since the last layer has no next nodes to transmit information, the weights influence on the output node independently. Therefore, it is efficient to learn only the weights associated with the new classes. Revisiting to the example of Figure 2, in the node n1, the qualitative amount of information of ω11 obtained from Equation (7) is 0.091, and the qualitative amount of information of ω14 is 0.530. The qualitative entropy of the n1 is 1.789. In the node n2, the qualitative amount of information of ω21 is 0.091, and the qualitative amount of information of ω24 is 0.573. The qualitative entropy of the n2 is 4.011. Because of the qualitative entropy, both qualitative and probabilistic characteristics are taken into consideration, so that we can identify the weights with little amount of information to generate outputs and train them only. Our algorithm is summarized as below.

figure c

4 Experiment

We analyze the mechanism of our method and show the performance using networks such as Lenet-5 [59] and AlexNet. Lenet-5 consists of two convolutional layers and two fully-connected layers. Each layer has weights of 500, 25,000, 400,000, and 5000 respectively. The network is trained with the MNIST data set having 10 handwritten image classes, 6000 pieces of each. AlexNet consists of five convolutional layers and three fully-connected layers. Each layer has weights of 3500, 307,000, 885,000, 663,000, 442,000, 38,000,000, 17,000,000, and 4,000,000 respectively. The network is trained with the ImageNet dataset having 1 M images classified into 1000 categories. We modify the Tensorflow framework [60] by adding a mask to ignore the weights not selected. Also, we use the NVIDIA Titan X Pascal graphics processing unit and NVIDIA Jetson TX1.

4.1 Analysis of our Learning Method

Figure 3 shows the distribution of weights connected from fully-connected layer1 to fully-connected layer2 of LeNet-5 and AlexNet, respectively. Since the distribution of weight has a bell-shape, the qualitative entropy is calculated using Equation (6) resulting in Figure 4.

Fig. 3
figure 3

Distribution of weights of fully-connected layer1 of LeNet-5 and AlexNet.

Fig. 4
figure 4

The entropy of a node and the amount of information of weights connected to the node in LeNet-5 (a), in AlexNet (b), The qualitative entropy of a node and the qualitative amount of information of weights connected to the node in LeNet-5 (c), in AlexNet (d).

Figure 4(a) and 4(b) show the entropy and the amount of information obtained by using Equations (1) and (2) at a node, in which the qualitative property is not applied yet. Figure 4(c) and 4(d) show the qualitative entropy and the qualitative amount of information obtained by using Equations (4) and (6) at a node.

As shown in Figure 4 (a) and (b), more than 95% of the weights is selected because the weights with smaller amount of information than entropy are the majority. In this case, the weights with high probability can be misunderstood as unimportant to the outputs because Equation (1) and (2) consider the probability distribution only, regardless of the magnitude of weights. In other words, most of weights are selected as not retraining for the new cases. The entropy of (a) and (b) is almost equal to 1.45 because the probability distribution of both networks are almost the same as the Gaussian distribution as shown in Figure 3. On the other hand, the qualitative entropy of (c) and (d) in Figure 4 is 0.015 and 0.09, respectively. Although the probability distributions of (c) and (d) are almost same, unlike (a) and (b), the weights are properly divided by the qualitative entropy as seen in Table 1 and Table 2.

Table 1 The weights selected by Equation (7) for LeNet-5
Table 2 The weights selected by Equation (7) for AlexNet

Table 1 shows the number of weights selected for each layer in LeNet-5. The selected weight in fully-connected layer1 and fully-connected layer2 is 272,195 out of 400,000, 3634 out of 5000 respectively. As the result, 68.10% of the weights in the fully-connected layer is selected. Table 2 also shows the number of weights selected for each layer in AlexNet. The selected weights in fully-connected layer1, fully-connected layer2, and fully-connected layer3 is 23,943,187 out of 38,000,000, 11,053,214 out of 17,000,000, and 2,800,745 out of 4,000,000 respectively. As the result, 64.06% of the total weights of the fully-connected layer is selected. As the size of network is bigger, the selected weight ratio is smaller because Gaussian distribution is getting close to bell-shape by the central limit theorem.

From the following section, we show the performance of partial learning that trains the weights selected by qualitative entropy.

4.2 Performance of Partial Learning

Table 3 shows the performance of classification accuracy of the partial learning by adding a new class, using MNIST. First, five classes are trained using LeNet-5 as an initial network structure.

Table 3 Performance of partial learning as adding new class at a time for MNIST datase

The classification accuracy of the initial network is 99.20%. From the structure, we analyze the performance by adding one class at a time. The accuracy results in 99.06% (total of six classes), 98.25% (total of seven classes), 97.61% (total of eight classes), 93.19% (total of nine classes), and 89.74% (total of ten classes), respectively, as new class is added at a time. When a new class is added, there is almost no accuracy difference between the six-class network structure and the original network structure. When two and three new classes are added, the accuracy difference is less than 1% and 1.6%, respectively. From adding the fourth new class, the performance degrades because the network runs out of information resources. The network should adopt new classes as well as keep the performance of the existing ones.

For ImageNet dataset, as shown in Table 4, the accuracy is 62.8% when the 500 classes are trained using AlexNet as an initial network structure. Again, we analyze the performance by adding one class at a time. For the classes 500 to 700, the accuracies result in 62.2% (total of 550 classes), 61.5% (total of 600 classes), 61.4% (total of 650 classes), 61.0% (total of 700 classes), where there is almost no accuracy degradation. When 250 and 300 classes are added to the network, the accuracies are 59.5% (total of 750 classes) and 56.3% (total of 800 classes) with accuracy loss of 3.3% and 6.5%, respectively. From 350 new classes, the performance starts degrading gradually.

Table 4 Performance of partial learning as adding new class at a time for ImageNet dataset

Through the experiment, the partial learning gives better performance than the transfer learning when adding up to three new classes to the existing network. Our method can be acceptable for the partial retraining about 40% of additional new classes from an existing network.

Table 5 shows the training time required for partial learning and transfer learning on LeNet-5 using MNIST. The time gaps between our method and transfer learning are 207 s, 241 s, 274 s, 309 s, and 348 s, respectively, as adding a new class to the existing network up to five new classes. The time gap increases as the network size gets bigger because the partial learning gradually effects on reduction of the computational complexity. For the transfer learning, the training time increases as new classes are added due to the increase in the size of the network structure. As the network gets bigger, the number of weights increases exponentially. Table 6 shows the training time performance for AlexNet trained on ImageNet dataset. As in Table 5, the time gap increases linearly because the time required for learning is usually determined by the number of data. This experiment takes more time than previous because the size of AlexNet is 140× bigger than LeNet-5 and ImageNet dataset is more complex than MNIST. In larger networks, the difference can be even greater. Tables 5 and Tables 6 show how unnecessary computations exist for traditional transfer learning because it initializes all the information of the weight in the network and relearns upon addition of a new class to the learned network. The tables show our partial learning technique reduces computational overhead.

Table 5 Training time of partial learning and transfer learning for LeNet-5
Table 6 Training time of partial learning and transfer learning for AlexNet

4.3 Analyzing Embedded Memory

Table 7 shows the selected weights for partial learning on embedded device and the FLOPs. For LeNet-5, there are 232,195 selected parameters in the Fully-connected1 and 3034 in Fully-connected2, with a total of 235,229 weights. As a result, the computational cost decreases for 1.7× due to reducing the existing computational resource of 810 K FLOPs to 470.5 K FLOPs. Whereas for AlexNet, 17,523,673, 6,112,835 and 2,175,981 are selected from Fully-connected1, Fully-connected2, and Fully-connected3, respectively, with a total of 8,844,199 weights is partially learned. As a result, the computational resource of 117 M FLOPs is reduced to 51.2 M FLOPs, which improves the computational performance by 2.3 × .

Table 7 The Number of selected partial learning weight and FLOPs on embedded device.

To test the performance of partial learning on embedded devices, experiments made in Table 5 and Table 6 are repeated for Jetson TX1. Table 8 shows the results training time performance of Lenet-5 on embedded devices. When added up to 3 classes, partial learning takes 1121s, 1328s, and 1404s, respectively. Our method is faster than transfer learning by 5188 s, 6028 s, and 6847 s, respectively. Similarly, Table 9 shows the results of Table 6 performed on embedded devices. The time gaps between our method and the transfer learning are 126,854 s and 140,057 s, respectively. Our experiments are limited to three cases for Table 8 and two cases for Table 9 because Titan X Pascal GPU has an FP32 performance of 12.15 TFLOPs, while Jetson TX1 has only 500 GFLOPs. Therefore, there is a performance difference of about 25×. As seen in experiments, the more complex are the data and the network, the more effective is the partial learning technique.

Table 8 Training time performance of partial learning and transfer learning for LeNet-5 on embedded device.
Table 9 Training time performance of partial learning and transfer learning for AlexNet on embedded device.

5 Conclusion

In this paper, we proposed the on-device qualitative entropy-based partial learning method to adapt the new class to an existing network. We derived the qualitative entropy metric to select the weights to be trained for new classes by employing and modifying the entropy concept. In order to derive a mathematical formula for the metric, we assumed that the statistical distribution of weights is a Gaussian. The derivation was achieved by considering not only the probabilistic information but qualitative characteristics of the weights. As shown in the experimental section, the existing network is partially trained based on the qualitative entropy metric and it outperforms the existing transfer learning with no loss of accuracy in terms of learning cost and embedded memory.

Even though our method achieved good performance compared to the existing method, there is a further work. We need to improve our technology by analyzing optimizing algorithm so that it can more practically be used on the mobile device. The other is to generalize our methodology by extending our technology to other types of learning networks such as recurrent neural network and generative adversarial nets.