Keywords

1 Introduction

Visual cognitive development is vital for intelligent robots in unstructured and dynamic environments. However, limited by the number of available labeled samples, an intelligent robot faces lots of novel categories in realistic environments. An ideal robot can transfer the knowledge and experience from the base classes to novel ones independently. The ability of predictive knowledge transferring efficiently cuts down the training cost and extends the range of cognition. Since the robot only obtains a few labeled samples of the novel categories other than some supplementary information, few-shot learning is often used.

Previously, the basic idea of few-shot learning is to exploit the knowledge from the base classes to support the novel ones. Most existing methods can be divided into two groups: metric-learning based method [1, 8, 9] and meta-learning method [10,11,12]. However, these methods mainly focus on inter-class dissimilarity. The relationship among classes is also important for knowledge transfering. To improve the efficiency of the sample utilization, some researchers exploit the knowledge graph to make up a relation map among categories [2, 3, 5]. Thus the structural information is also taken into account and the propagation becomes more reasonable. However, pure knowledge information shows a deviation between semantic space and visual space. This one-sidedness of pure knowledge graph leads to an unsatisfactory accuracy. Besides, the amount of knowledge graph to support the novel classes is huge, due to the sparseness of the information. Thus a more sophisticated inference mechanism is in need.

Humans can rapidly adapt to an unfamiliar environment. This is mainly based on two parts of information: common sense and visual experience. With the common sense, they know the descriptions of the novel class and its relationship with the base ones. With the experience of learning the base classes, they quickly grasp the method to classify the novel ones. These two parts of information enable humans to develop their visual cognitive ability accurately and quickly.

Fig. 1.
figure 1

We jointly explore the source information from visual experience and common sense to predict the general classifier of novel categories.

Motivated by this, we propose a model called knowledge-experience fusion graph network (KEFG) for few-shot learning. The goal of KEFG is to jointly explore the common sense (knowledge) and visual experience to accomplish the cognitive self-development of robots. For convenience and according to the daily usage, below the common sense is directly abbreviated by knowledge, while the visual experience is denoted by experience. Specifically, KEFG obtains experience from the original trained recognition model based on Convolutional Neural Network (CNN). It recalls the visual representation of the base classes and generates the predictive classifiers of the novel ones. KEFG further explores the prestored knowledge graph from WordNet [7] and builds a task-specific subgraph for efficiency. With the employment of GCN [4], novel classes generate its own classifier following the mechanism of related base classes on the fusion graph. To evaluate the effectiveness of KEFG, cross-task experiments are conducted to transfer the cognition ability from ImageNet 2012 to two typical datasets, fine-grained medium size dataset Caltech-UCSD Birds 200 (CUB) [13] and coarse-grained small size dataset miniImageNet [8]. The results show satisfactory performance.

The main contributions are as follows: (1) The knowledge graph builds a developmental map suitable for cognitive development. KEFG conducts a developmental framework to transfer the base information to specific tasks. (2) KEFG jointly explores information from the visual space and word space. It cuts down the number of nodes to support the inference and decreases the deviation. (3) The experiments show that KEFG conducts well not only on the coarse-grained small size dataset but also on the fine-grained medium size dataset.

2 Methodology

The set of all categories contains training set \(C_{train}\), support set \(C_{support}\), and testing set \(C_{test}\). \(C_{train}\) has sufficient labeled images. \(C_{support}\) and \(C_{test}\) are from the same categories called novel classes, while the training categories called base classes. If the support set contains K labeled samples for each of the N classes, we call this problem \(N-way\) \(K-shot\) few-shot problem. KEFG is built on an undirected knowledge graph, denoted as \(G = (V,E)\). V is a node set of all classes. Each node represents a class. \(E=\{e_{i,j})\}\) is an edge set. The classification weights are defined as \(w=\{w_{i}\}_{i=1}^{N}\) where N is the number of total categories.

2.1 Information Injected Module

KEFG employs the knowledge graph from WordNet. Better than taking the whole graph, KEFG adds the novel classes to the constant graph of the base classes in ImageNet 2012. If there are N classes, KEFG takes a subgraph with N nodes. To transfer the description into vectors, we use the GloVe text model and get S input features per class. The feature matrix of knowledge is \(V_{K}\in R^{N\times S}\). KEFG only uses the hyponymy as the principle of the edge construction. The edge matrix from the knowledge space refers to \(E_{K}\in R^{N\times N}\).

KEFG learns from the experience of the original model which is denoted as \(C( F ( \cdot |\theta )|w)\). It consists of feature extractor \(F(\cdot |\theta )\) and category classifier \(C(\cdot |w)\). \(\theta \) and w indicate the classification parameters of the model. Feature extractor \(F(x|\theta )\) takes an image as input and figures out the feature vector of it as \(z_{i}\). The parameter \(w^{train}\) refers to the classification weights of different classes in training set. The final classification score is computed as \(s=\{z^{T}w\}\)

The feature extractor part \(F(\cdot |\theta )\) can compute feature representations of the \(C_{support}\). According to the rule of the template matching, the feature representation of the novel class can well represents its classification weights. Thus the initial weights can be represented as the average of the features.

$$\begin{aligned} v_{i}^{E}= {\left\{ \begin{array}{ll} w_{i}^{train},&{} x_{i}\in C_{train}\\ \frac{1}{P}\sum _{k=1}^{P}F(x_{i,p}|\theta ),&{} x_{i}\in C_{test} \end{array}\right. } \end{aligned}$$
(1)

Where \(v_{i}^{E}\) refers to the visual feature of the ith class. \(x_{i,p}\) refers to the pth image in the ith class. P is the total number of the images in the ith class.

Motivated by the denoising autoencoder network, KEFG injects the word embedding to the initial classification weights to generate a more general classifiers. The features of the ith class in the fusion graph is represented as follows

$$\begin{aligned} v_{i}=\alpha \frac{v_{i}^{K}}{\Vert v_{i}^{K}\Vert _{2}}+\beta \frac{v_{i}^{E}}{\Vert v{i}^{E}\Vert _{2}} \end{aligned}$$
(2)

where \(\alpha \) and \(\beta \) refers to the proportion of each source of information.

Except the relationship of hyponymy, KEFG also introduces cosine similarity to the graph. The edges are denoted as follows

$$\begin{aligned} e_{i,j}= {\left\{ \begin{array}{ll} 1,&{} Simi(x_{i},x_{j})> S \text { or }Hypo(x_{i},x_{j})\\ 0,&{} otherwise \end{array}\right. } \end{aligned}$$
(3)

\(Simi(x_{i},x_{j})\) refers to the cosine similarity. S represents the similarity boundary to judge whether there is a relationship between two classes and it is a hyper-parameter. \(Hypo(x_{i},x_{j})\) refers to the mechanism to judge whether there if the relationship of hyponymy between ith class and jth class.

2.2 Information Transfer Module

With the framework of the GCN, KEFG propagates information among nodes by exploring the classes relationship. The mechanism of GCN is described as

$$\begin{aligned} H^{(l+1)}=ReLu(\hat{D}^{ - \frac{1}{2}}\hat{E}\hat{D}^{ - \frac{1}{2}}H^{(l)}U^{(l)}) \end{aligned}$$
(4)

where \(H^{(l)}\) denotes the output of the lth layer. \(\hat{E}=E+I\), where \(E\in R^{N\times N}\) is the symmetric adjacency matrix and \(I\in R^{N\times N}\) represents identity matrix. \(D_{ii}=\sum _{j}E_{ij}\) . \(U^{l}\) is the weight matrix of the lth layer.

The fusion graph is trained to minimize the loss between the predicted classification weights and the ground-truth weights.

$$\begin{aligned} L=\frac{1}{M}\sum _{i=1}^{M}(w_{i}-w^{train}_{i})^{2} \end{aligned}$$
(5)

where w refers to the output of base classes on GCN. \(w^{train}\) denotes the ground truth obtained from the category classifier. M is the number of the base classes.

KEFG further applies the general classifiers to the original model. By computing the classification scores \(s=z^{T}w\), KEFG distinguishes novel classes with few samples and transfers the original models to other datasets efficiently.

3 Experiments

3.1 Experimental Setting

The fundamental base classes remain the training set of ImageNet 2012. We test the developmental ability on CUB and miniImageNet. The knowledge graph is exploited from the WordNet. CUB includes 200 fine-grained classes of birds. We only take 10 classes, which are disjoint from the 1000 training classes of ImageNet 2012. MiniImageNet consists of 100 categories. For fairness, we only take 90 base classes as the training set. The remaining 10 classes in the miniImageNet consist of the novel task with few examples. The original recognition model is pre-trained on the ResNet50 [6] with base classes.

3.2 Comparision

Table 1. Comparision with prior models

The comparison between KEFG and other exiting methods is reported in Table 1, where the performance is evaluated by the average top-1 accuracy. KEFG achieves the best or competitive performance for both 10-way 1-shot and 10-way 5-shot recognition tasks. Especially, KEFG shows remarkable improvement in the fine-grained dataset. The accuracy on CUB increases almost twenty percentage the most. However, the training set in KEFG completely comes from ImageNet 2012. The relationship between the base and novel classes is weaker. We owe this excellent transferability to two aspects. First, the prestored graph knowledge provides an excellent developmental graph for the model. Second, the combination of the knowledge and the experience provides abundant information for the novel classes to refer to. Thus the transfer accuracy increases notably.

Table 2. Comparison on details

In Table 2, we analyze the details. Both DGP and SGCN only exploit the knowledge graph for the inference. From the experiments, KEFG declines the amount of subgraph a lot. Furthermore, it cuts down the training time as well. SGCN and DGP only exploit the inheritance relationship in the knowledge graph. To gather abundant information, the subgraph involved in the inference should be large. On the other hand, KEFG takes visual similarity into account, which leads to a dense graph. Thus the novel nodes can gather more information with a smaller amount of graph.

Fig. 2.
figure 2

In the test, w refers to the knowledge while v refers to the visual experience.

Table 3. Ablation study

We further test the effectiveness of the fusion idea. With different fusion proportions of knowledge and experience, the recognition accuracy changes as well. From Fig. 2, it is obviously noticed that the accuracy increases rapidly when the two sources of information are combined. After the peak accuracy of 73.78\(\%\) for 1-shot and 79.47\(\%\) for 5-shot, the accuracy declines as the combination becomes weak. Table 3 shows that KEFG improves the performance by almost 30% than only using knowledge or experience. Because the novel nodes gather more supplementary information from its neighbors. Both the information from word space and visual space is taken into account. Besides, not only the parent nodes and offspring nodes but also the visual similar nodes are connected to the novel ones. Furthermore, we combine the experience with Gaussian Noise to test the effectiveness of the knowledge. Table 3 shows that the combination of knowledge increases the accuracy by about 4% than Gaussian noise. Thus the knowledge information makes sense in the process of inference.

4 Conclusion

In this paper, we propose KEFG which takes advantage of information from both knowledge and experience to realize visual cognitive development. To take the interrelationship among categories into account, KEFG is based on the framework of the graph convolution network. During experiments, the ability of the proposed model outperforms previous state of art methods and obviously declines the time of training. In future work, we will devote to improving the mechanism of fusion to further improve the performance of our model.