1 Introduction

With the big-data era approaching, the large-scale web images bring out a great challenge to image understanding and retrieval. Thus, related works like image classification and automatic image annotation have been well explored.

Previous works on automatic image annotation mainly focus on the better probabilistic representations and the adoption of learning-based methods [10, 14, 19]. However, the prior or domain knowledge has been ignored. Knowledge is used by humans when learning the visual appearance of objects [15]. For example, babies sometimes learn new things by the knowledge their parents or teachers tell them. So in our opinion, knowledge is useful for learning and thus ontologies are particularly important.

In this paper, we propose to augment the statistical learning methods with ontology and apply this idea to image attribute learning. The so-called attributes are interpreted as inherent characteristic in Webster dictionary, which are intrinsic human-nameable qualities of images. Attribute-based ideas have been shown to be helpful in various applications like face verification [18], image retrieval [28, 37], action recognition [21], robotics and mobile communications [16], and zero-shot transfer learning [16, 19, 25, 36, 38].

However, knowledge embedded in the inter-attribute relationship is rarely considered and human efforts are usually involved such as to label the attributes. To solve these issues we propose a method called Image Attribute Learning with Ontology Guided Fused Lasso (IAL-OGFL). Ontologies are used for mining inter-attribute similarity and graph-guided fused lasso [17] is exploited for sparse feature selection.

Why Ontology?

knowledge representation is central to the applications of knowledge-based methods. According to the “modelling view” of knowledge acquisition proposed by Clancey [4], a knowledge base is not a repository of knowledge extracted from one expert’s mind, but the result of a modeling activity whose object is the observed behavior of an intelligent agent embedded in an external environment. This implies that it may not get good results for learning by exploring experiential knowledge to some extent. For example, many papers acquire priors from a manual class-attribute correlation matrix for attribute learning recently [13]. Since the matrix is generated with some skilled workers, which is not authoritative and hard to reuse when changing the labels of classes and attributes, its suitable to be improved with ontology. From another perspective, according to the kinds of primitives used, knowledge representation formalisms can be classified into five categories (Fig. 1) [12]. We can see from Fig. 1 that interpretation with logical and epistemological is arbitrary and with conceptual and linguistic is subjective. But in the ontological level, the ontological commitments associated to the language primitives are specified explicitly which can restrict the number of possible interpretations. For these characteristics and the purpose of sharing and reuse of knowledge, we propose to utilize ontology in image attribute learning.

Fig. 1
figure 1

Classification of knowledge representation formalisms

Why Graph-Guided Fused Lasso?

For the images in the real world, high dimensional low-level features can be extracted and only a small fraction of them are associated with their corresponding attributes. So it may increase the computational complexity without feature selection. Lasso [31] is suitable for sparse feature selection, but it is incapable of capturing any structural information among attributes, structured-sparsity-inducing penalty should been considered [3]. Unlike the group lasso separating attributes into groups and fused lasso treating attributes as chain structure, graph-guided fused lasso introduced a general class of structure and therefore more priors can be included.

The main contributions of our work can be summarized as follows:

  1. 1)

    Inter-attribute similarity is integrated into the graph-guided fused lasso model. Different from previous works, the WordNet-based metric space is exploited for inter-attribute similarity measurement (Section 3).

  2. 2)

    The idea that statistical learning is directed with ontology is shared and a principled framework of IAL-OGFL is proposed (Section 4).

  3. 3)

    Comprehensive experiments are conducted to demonstrate the effectiveness of our approach. Our method has outstanding performance with higher accuracy rate and faster convergence than similar works (Section 5).

2 Related work

Attribute-based methods have received much attention in the area of computer vision. Ferrari et al. [10] and Lampert et al. [19] presented a series of interesting applications which have demonstrated the power of semantic attributes. The probabilistic generative model [10] and the Direct Attribute Prediction (DAP) model [19] both treats each visual attribute as independent and train the attribute classifiers not considering their relationships. For example, the DAP model [19] trains a non-linear support vector machine (SVM) for each binary attributes and no inter-attribute information exchange in this process.

However, in the real world, dependencies between attribute pairs are ubiquitous which has also been proved by [13] with Animal with Attributes (AwA) database. For example, “ocean” has strong correlation with “water” and a weak correlation with “dessert” in AwA. Many methods considering the dependencies have been proposed. Hwang et al. [14] believed that all attributes can rely on some shared structure in the low level feature space, so a convex multi-task feature learning method with an 1/ 2-norm is adopted. But according to the research of [13], some attributes are more likely to share common relevant low-level features, and they proposed a method with graph-guided fused lasso which exploits graph to describe the correlations of attributes. Similarly, Yu et al. [37] design a novel two-layer probabilistic graphical model for finding the relevance of attributes. Wang et al. [35] also proposed a discriminative model for joint modeling object class labels and their attributes. They also assumed there are certain dependencies between some attribute pairs and an attribute relation graph is used for their model. Zhang et al. [39] proposed a method to organize the semantic concepts into multiple semantic levels and argument each concept with a set of related attributes. Their method is used for image retrieval and achieves good results.

The common point of the above graph-based methods is that they explore experts experiential knowledge for learning. For example, Han et al. [13] constructed the graph with a manual class-attribute correlation matrix by skilled workers. The matrix is illustrated to be intuitive but not discriminative possibly [16, 22, 26, 36].

Instead, we share the idea that statistical learning can be directed with ontology. An ontology formally represents knowledge as a set of concepts within a domain, using a shared vocabulary to denote the types, properties and interrelationships of those concepts [11]. Comparing with experts experiential knowledge, an ontology is a more formal representation of a set of concepts and their relationship, so it is more authoritative for mining inter-attribute similarity information. Whats more, inter-attribute similarity information can be pre-learned easily with ontologies no matter how attributes scales. Ontologies have been widely used for designing concepts correlations in the area of computer vision such as image annotation [7, 27, 29, 33], object detection [1, 2, 6, 23, 30], image retrieval [24, 34] and scene understanding [20]. For example, a concept ontology composed of several types of concepts (spatial concepts and relations, color concepts and texture) was combined with machine learning techniques, which was used for complex object recognition in [6]. The strength of this method is that the visual concept ontology acts as user-friendly intermediate between image processing layer and the expert. Li et al. [20] proposed a hierarchical generative model for scene classification, object component segments, and image annotation. WordNet was used in order to provide a handful of relatively clean images in which some object regions are marked with their corresponding tags. In order to solve the problem that the returned results of ranking methods for tag-based image search are irrelevant or not diverse, Wang et al. [34] proposed a diverse relevance ranking scheme, in which WordNet is used for words relevance estimation.

3 Ontology guided fused lasso

The Ontology Guided Fused Lasso (OGFL) is a model proposed to compute the relevancy between features and attributes. In this section, we first present a definition of the proposed Ontology Guided Fused Lasso as follows:

Definition 1

Ontology Guided Fused Lasso is defined as O G F L = (I, T, M).

Initial State: I = {X S, Y S, G}stands for the initial state the model, where X SR N×P represents the source image data matrix for N samples and P-dimensional features and \({Y^{S}} \in {\left \{ {0,1} \right \}^{N \times L}}\) is the attribute indicator matrix of source image data for L attributes. G is an inter-attribute similarity graph constructed with ontologies.

Terminal State: T = {B} stands for the terminal state of the model. B is the feature-attribute relevancy graph represented with matrix, where each column is a P-vector of regression coefficients for every attribute.

Model: The model used to bridge from the initial state I to the terminal state T is graph-guided fused lasso (\(_{B}^{min}\parallel {Y^{S}} - {X^{S}}B\parallel + \gamma {{\Omega }_{G}}\left (B \right ) + \lambda \parallel B{\parallel _{1}}\)).

As ontologies can mine the inter-attribute semantic similarity and graph-guide fused lasso leads attributes to be more similar in the low-level feature space, OGFL which integrates ontologies with the graph-guided fused lasso can bridge the low level feature space with the high level attribute space naturally. Hence, the learned feature-attribute graph will be a convenient model for selecting the most valuable features for every attribute and attribute learning.

In this section, OGFL will be introduced in detail. We will first introduce how to combine ontology with the graph-guide fused lasso model (Section 3.1).Then we will describe the construction of the ontology guided inter-attribute similarity graph, which is built in the WordNet-based metric space (Section 3.2).

3.1 Graph-guided fused lasso with ontology

Assume that we have a set of L attributes for the problem of attribute learning. Lasso tends to solve a set of L independent regressions for each attribute with its own L1 penalty, and it doesn’t provide a mechanism to combine information across multiple attributes such that the similarity can be reflected in the regression coefficients for those correlated attributes. However, several attributes are often highly correlated and they often share some structures in the feature space. That is to say, highly correlated attributes may share more features. So it is difficult for lasso to describe this characteristic.

GFlasso extends the standard lasso, and it is a new penalized regression method with pleiotropic effect on correlated attributes. GFlasso regards the correlation structure over the set of L attributes as an edge-weighted graph, and use this graph to guide the learning process. The GFlasso is particularly suitable for attribute learning problems because no attribute is isolated and universal correlation exits between attributes. As mentioned above, the GFlasso model we used in the attribute learning problem is:

$$ _{B}^{min}\parallel {Y^{S}} - {X^{S}}B\parallel + \gamma {{\Omega}_{G}}\left(B \right) + \lambda \parallel B{\parallel_{1}} $$
(1)

Where \(Y \in {\left \{ {0,1} \right \}^{N \times L}}\) is the attribute indicator matrix of source image data. XR N×P represents the source image data matrix and B is the mapping of the feature space and attribute space trying to get. γ and λ are regularization parameters respectively that control the complexity of the model. A larger value of γ leads to a greater fusion effect.Considering that effective features for every attribute are usually sparse, regular Lasso (\(\parallel B{\parallel _{1}} = \sum {\sum {\parallel B\left ({:,:} \right )\parallel }}\)) is used. However, Lasso is prone to selecting features individually. As described above, attributes share some structures in the feature space. That is to say, highly correlated attributes may share more features, which is beneficial semantics for attribute feature selection. In order to encode the structured priors of attribute correlation into the model, graph penalty \({{\Omega }_{G}}\left (B \right )\) is considered:

$$ {{\Omega}_{G}}\left(B \right) = \sum\limits_{e = \left({m,l} \right) \in E,m < l} \tau \left({{r_{ml}}} \right)|{B_{m}} - sign({r_{ml}}){B_{l}}| $$
(2)

Where B m and B l are the m t h and l t h columns of B respectively and they are the regression coefficients for the m t h and l t h attributes. \(\tau \left (r \right ) = |r|\) weights the fusion penalty for each edge of graph G. s i g n(r m l ) = 1 for two positively correlated attributes and s i g n(r m l ) = −1 for two negatively correlated attributes. \({{\Omega }_{G}}\left (B \right )\) encourages B m and B l to take the same value by shrinking the difference between them toward 0.

Assume that we have construct an ontology guided inter-attribute correlation graph G o from a preprocessing step consisting of a set of nodes V, each representing one of the L attributes and a set of edges E. The weight of each edge (m, l) ∈ E is s i m m l standing for the relevancy of every two attributes. For two high correlated attributes m and l in the feature space (low value of |B m s i g n(r m l )B l | ), they should very close in the attribute space (high value of s i m m l ). Hence, \(\tau \left ({{r_{ml}}} \right )\) can be replaced with s i m m l in (2), which enriches the interpretability and can improve the accuracy of attribute learning as shown in the experiment part.

3.2 WordNet-based metric space and attribute relation graph construction

Considering the information of inter-attribute similarity, there are some ways to construct the graph for attribute learning problem. In [13] a class-attribute matrix which is constructed with skilled workers is exploited for clustering in order to get an inter-attribute similarity graph A(see in Fig. 2), and [17] adopts an approach which computes pairwise Pearson correlation coefficients for all pairs of attributes using the label matrix Y. These methods are statistical, and the time complexity will increase with the increasing number of classes of A and the growing numbers of samples of Y, which makes them to have poor expansibility. Besides, experts experiential knowledge is required for labeling the attributes for every class in [13], it is less objective and authoritative than knowledge extracted from ontologies. We proposed a method constructing attribute graph with ontology and without learning, which is simple and effective.

Fig. 2
figure 2

Attribute graph construction with learning-based (upper left) ideas and ontology-based ideas (lower left). In order to acquire inter-attribute similarity information, clustering strategy is applied to learning-based idea with a manual labeled class-attribute matrix while human prior knowledge is obtained with ontology-based ideas

We construct graphs with WordNet. Information in WordNet is grouped into sets of cognitive synonyms (synsets). Synsets are interlinked by means of conceptual-semantic and lexical relations. We adopt a simple and commonly used approach for learning such graphs in this article, where we first compute pairwise WUP similarity (Wu and Palmer, 1994) for all pairs of attribute in WordNet, and then connect every two nodes with an edge to build the graph.

WUP views WordNet as a graph and is a function of the path length from the lowest super-ordinate (LSO) of the two concepts m and l, which is the most specific concept that they share as an ancestor. For example, if m was ‘pest#n#4’ and l was ‘arthropod#n#1’ then the \(LSO\left ({m,l} \right )\) would be ‘animal#n#1’. The WUP similarity between m and l can be calculated as follows:

$$ si{m_{ml}} = \frac{{2 \times depth\left({LSO\left({m,l} \right)} \right)}}{{len\left({m,LSO\left({m,l} \right)} \right) + len\left({l,LSO\left({m,l} \right)} \right) + 2 \times depth\left({LSO\left({m,l} \right)} \right)}} $$
(3)

Where \(len\left ({m,LSO\left ({m,l} \right )} \right )\) measures the length of the shortest path in WordNet from concept m to concept \(LSO\left ({m,l} \right )\), \(depth\left ({LSO\left ({m,l} \right )} \right )\) means the length of the path to \(LSO\left ({m,l} \right )\) from the global root, i.e. \(depth\left ({LSO\left ({m,l} \right )} \right ) = len\left ({root,LSO\left ({m,l} \right )} \right )\).

The semantic relations between attribute ’pest#n#4’ and attribute ’arthropod#n#1’ can be calculated as in Fig. 3. The similarity between them is 0.8421 which means two concepts are closed enough. The WUP measurement is simple with low complexity. It only relies on the depth based on ontologies for every pair of attributes, and the complexity doesnt increase with the growing numbers of samples or classes. Besides, since ontologies is built in line with cognitive science, ontology guided learning gets better results.

Fig. 3
figure 3

WUP measurement of pest#n#4 and arthropod#n#1. The LSO of the two attributes is animal#n#1 and the depth of animal#n#1 is the path from root to itself which is 8. Hence, the similarity between the two attribute is 0.8421

4 Image attribute transfer with ontology

The learned matrix B with OGFL in Section 3 is integrated with inter-attribute similarities and corresponds to coupling pairs of attributes in the adjacent rows of the same column. Besides, it reflects the correlativity of every attributes with its features. A Larger value of element in B means a greater relevancy for the attribute with its corresponding feature. Hence, the matrix B is consistent with the assumptions mentioned before that there is a shared structure between the attribute space and the original image descriptor space, and it is very suitable for feature selection for individual attribute. Since the matrix B is learned with ontology, it reflects the intrinsic characteristics of attributes and is relatively easier to transfer learning with different samples or different databases.

Assume that we have a target image dataset T = {X T} with X TR N×P which can be annotated with the L attributes. Then an algorithm for feature selection and attribute transfer can be get (Algorithm 1). Every column of matrix B (e.g. B (:, l) l=1,,L) corresponds to one attribute and reflects how all the features influence the attribute. Hence, the characteristics of matrix B can be exploited to perform feature selection of every attribute for target images. We rank elements in vector B (:, l) according to the value of \(\left \| {{B_{(p,l)}}} \right \|(p = 1,...,P)\) in descending order, and the top f features are the most beneficial features for B (:, l). After feature selection, various classifiers can be trained. In this paper, we have tried the knn classifier and SVM to test the result of feature selection. In this process, the correlated information among attributes is transferred from the source images to the target images in order to get a better representation for the target images.

figure d

The framework of IAL-OGFL can be illustrated with Fig. 4. The key points are as follows: (1) constructing a WordNet-path-based metric space and mining semantic relation of attributes to construct the graph (Section 3.2); (2) using the pre-learned inter-attribute correlation graph and source samples to solve the graph-guided fused lasso model with a smoothing proximal gradient method proposed in [3] with multi-task extension (For reasons of space, it is not introduced here)(Section 3.1); (3) transferring the matrix to selecting features of every attribute with target samples; (4) predicting attributes with the selected visual features.

Fig. 4
figure 4

The framework of IAL-OGFL. For the training set, a projection from the low-level feature space to the ontology-based attribute space is found which encourages high correlated attributes sharing similar features. The projection is represented with a feature-attribute correlation matrix (top center) and can be used for feature selection and attributes prediction for the target image set

5 Experiment and result

5.1 Dataset and image features

Generally speaking, attributes are usually designed by manually picking a set of words. In [8], besides semantic attributes, discriminative attributes (e.g. “cats and dogs have it but sheep and horses dont”) are designed by experts. In [19], attributes are collected by experts according to “relative strength of association” between attributes and classes. The common ground of these methods is that additional human efforts are involved. To solve this problem, Yu et al. [36] proposed to design “category-level attributes” which will not have concise names as the manually specified attributes. Unlike these methods, the hierarchical of ImageNet is taken advantage of to acquire attributes which we think is easy and suggestive.

Semantic hierarchies are always used for image annotation [32]. Similarly, in this paper, the hierarchy of ImageNet is exploited to define the image attributes. The hierarchy of ImageNet is built mostly cording to hyponymy, which is also called “is-a” relation. For example, a “human” is an “animal”, and a “worm” is an “invertebrate”. The “is-a” relation is a very important inherent characteristic. Naturally, we treat the father node as the attribute of the son node. For example, “animal” can be exploited as an attribute of “male” and “invertebrate” is an attribute of “worm”.

ImageNet contains over 10 million images and over 15000 synsets (sets of cognitive synonyms)[5]. We do our experiment on the animal branch. 30 classes (see Fig. 5) in 3 layers with 31288 images are selected to build the dataset. The number of images in each class is various, ranging from tens of pictures to thousands of pictures.

Fig. 5
figure 5

Attributes our experiment used and their hierarchy. All these attributes are from ImageNet

SIFT Bag of Visual Words feature is used in our experiment for its robustness with image rotation and stability with visual angle variation. First SIFT (Scale-Invariant Feature Transform) points are extracted for the entire image in the database. Then the randomly selected set of SIFT points are clustered and produced the 1,000 centers as the visual dictionary. At last each image is quantized into a 1,000 dimensional histogram of bag-of-visual-words.

5.2 Parameter tuning

In order to train a optimal regressor with (2), the weight value λ and γ need to be determined first. Since B is trained and learned from regression process, the values of predicted responses are continuous but binary. Thus, the Area under the Roc Curve (AUC) [9] is used as the evaluation metric. We use different parameters of λ and γ and select the best. The ranges of λ and γ are {0.0001, 0.001, 0.01, 0.1, 1, 10, 100 }. For every value of λ and γ, we randomly select half images for training and remaining for testing, each experiment is repeated for 10 times and the average value is reported as in Table 1. The highest AUC value 0.780219 indicates the best result when λ = γ = 1.

Table 1 AUC value with different parameters λ and γ

5.3 Ontology guided fused lasso performance

A source image set containing randomly selected 22207 images with 30 attributes is used for training and testing the performance of the Ontology Guided Fused Lasso model. We compare our method with CAT-MtG2F (correlated attribute transfer with multi-task graph-guided fusion) [13] and other flat methods.

As shown in Table 2, with a no-graph method that only uses 1 norm ( 1 Method), a flat graph-guided fusion method where all attributes have the correction of 1 with the other attributes (FlatMtG2F), a kNN based CAT-MtG2F method with different k (kNN-MtG2F), a Pearson correlation coefficient based method (PearsonCC) used in [17] and our ontology-based idea, we randomly select half images from the source image set for training and left for testing, each experiment is repeated for 10 times. The average iterations, running time and AUC value of B are reported.

Table 2 Performance of B with different methods

It shows in Table 2 that our ontology guided method is much easier to convergence with the least iterations and has the highest AUC value. That means that the WordNet-based metric space is more like to describe the inter-attribute similarity. Hence, the ontology guided regressor is more discriminative. Our method uses less time compare with FlatMtG2F and CAT-MtG2F when k ≠ 1 which also implies that the graph constructed with ontology is better.While our method is slower than 1 Method and 1NN-MtG2F because their methods are simpler with fewer constraints ( 1 Method has no graph and 1NN-MtG2F has a simple graph with attributes having no corrections with the others).

5.4 Result of attribute transfer

We use classifiers to verify the effectiveness of the learned matrix B. The remaining 10000 images with 30 attributes are used as the target image set. We randomly select 90 % of the samples as training set and the remaining for test. Feature selection is performed on every attribute with 50 features selected, and every attribute is attached with a classifier.

We test the learned model with SVM classifier. We use libSVM and the best C and G is selected for every classifier. Our method is compared with CAT- MtG2F, PearsonCC and PCA by accuracy and mean squared error (Table 3). From Table 3 we can see our method also has the best performance with accuracy and mean squared error.

Table 3 Accuracy and Mean Squared Error with libSVM

6 Conclusion

We have augmented the statistical learning methods with ontology and proposed a novel ontology guided fused lasso method for image attribute learning. Our method has several advantages compared with previous methods. Firstly, we obtain the priors of interrelationship of attributes from ontology, which is more explicable relative to pure statistical methods. Secondly, a WordNet-path-based metric is used for designing inter-attribute correlations, which is very flexible, which can be easily modified to improve upon many different performance measurements. Thirdly, the WordNet-based attribute space has the advantage to scale up the process to develop a large number of attributes. The experiments show that our method can both accelerate the convergence and improve the accuracy rate with SVM classifier. It implies that the WordNet-based metric space is more like to describe the inter-attribute similarity and proves that ontology is beneficial for learning. As well, the Ontology Guided Fused Lasso has outstanding transfer ability. Our future work is to consider various measurements with different ontologies and find a more feasible metric universally.