1 Introduction

Automatic Image Annotation (AIA) is one of the most fundamental problems in image retrieval and computer vision. The aim of the AIA is to assign suitable annotation words to any given image, which reflects its content. In general, AIA consists in learning models from a training set of pre-annotated images in order to generate annotation words for unlabeled images. Therefore, due to the link between the visual features and the annotation words, AIA becomes a difficult issue in computer vision. In this context, machine learning approaches are used to learn the mapping between low-level and semantic features, and then generate annotation words for a test image. These include classification approaches that allow classifying images in semantic classes based on their visual features.

In addition, in the literature, several works dealing with image classification and annotation revolve around BoVW method, which consists of building a visual vocabulary from image features [31], [8]. The image features are quantified as visual words to express the image content through the distribution of visual words.

Recently, special attention has been shifted to the use of complex architectures which are characterized by multi-layers. Indeed, the biologically-inspired HMAX model was firstly proposed by [21]. The HMAX model has attracted a great deal of attention in image classification due to its architecture which alternates layers of feature extraction with layers of maximum pooling. The HMAX model was optimized in the work of [24] in order to add multi-scale representation as well as more complex visual features.

In order to achieve a finer representation of the semantic content in images, several annotation approaches based on ontologies have been proposed. The use of ontologies is generally motivated by the need to use semantic relations and describe data at a more semantic level for better annotation. However, such methods do not exploit both visual and semantic features during the image annotation process.

In this paper, we propose an ontology-based image annotation driven by classification using HMAX features. Our idea is to train the classifiers with visual features and to build an ontology that can finely represent the semantic content associated with the training images. Both classifiers and ontology are used for annotating testing images.

Thus, the main contributions consist; 1) in integrating the classifiers and ontology in the training phase; and 2) in evaluating a membership value that serves to select annotation words depending mostly on relationships which are detected in the ontology.

The remainder of this paper is organized as follows. Section 2 presents an overview of the related research, along with our motivations and objectives. Section 3 describes the proposed image annotation approach and its components. In Section 4, we report the experimental results of our approach. Finally, Section 5 concludes this paper and proposes directions for future works.

2 Related work, motivation and objectives

In image retrieval field, two basic image retrieval approaches have been proposed in the literature: 1) content based image retrieval (CBIR) and 2) semantic image indexing and retrieval (SIIR). In this context, most works turn their focus on content based image retrieval that can be considered a principal helps to organize images by their visual content. However, it was shown that CBIR approaches are unable to automatically describe the semantic content of images. As a result, Automatic Image Annotation (AIA) has acquired more attention of researchers in computer vision and multimedia areas. In AIA area, several methods and approaches have been introduced and applied. In the next subsection, we give a general overview of the main related works.

2.1 Related work

2.1.1 Approaches based on learning techniques

In the AIA area,a large amount of methods based on learning techniques has been applied [10], [19] and [14]. Recently, to annotate images, some researchers have attempted to learn detectors that can localize objects in images. In this context, [10] proposed a weakly supervised part selection method with spatial constraints for fine-grained image classification. The goal of the work is, firstly, to learn a whole-object detector automatically aiming at localizing the object through jointly using saliency extraction and segmentation; secondly, to propose spatial constraints that serve to select the distinguished parts. The spatial constraints define the relationship between an object and its parts and the relationships between the object’s parts. The aim is to ensure that the selected parts are located in the object region and are the most distinguishing parts from other categories. The results of this work demonstrate the superiority of this method compared with the methods that used expensive annotations.

In addition, in [36] a fast binary-based HMAX model (B-HMAX) is proposed for object recognition. The goal is to detect corner-based interest points and to extract few features with better distinctiveness. The idea is to use binary strings to describe the image patches extracted around detected corners, and then to use the Hamming distance for matching between two patches.

Moreover, several image annotation approaches based on deep learning models have been proposed. For instance, in [17], two main issues in large-scale image annotation are addressed: 1) how to learn a rich feature representation suitable for predicting a diverse set of visual concepts ranging from object, scene to abstract concept; 2) how to annotate an image with the optimal number of class labels. For the first issue, a novel multi-scale deep model has been proposed, the aim is to extract rich and discriminative features capable of representing a wide range of visual concepts. The deep model is also made multi-modal by taking noisy user-provided tags as model input to complement the image input. For tackling the second issue, a label quantity prediction auxiliary task has been introduced to explicitly estimate the optimal label number for a given image. In this work, extensive experiments are carried out on two large-scale image annotation benchmark datasets and the results show that this method significantly outperforms the state-of-the-art.

In [26], a multi-modal deep learning framework has been introduced, the aim is to optimally integrate multiple deep neural networks pretrained with convolutional neural networks. In particular, the proposed framework explores a unified two-stage learning scheme that consists of learning to fine-tune the parameters of deep neural network with respect to each individual modality, and learning to find the optimal combination of diverse modalities simultaneously in coherent process. The result of this work validate the effectiveness of the proposed framework.

In addition, AIA methods are considered as a kind of efficient schemes to solve the problem of semantic-gap between the original images and their semantic information. In this context, to address this problem, [16] combined the CNN feature of an image into their proposed model which is based on a CNN model-AlexNet. The idea is to extract a CNN feature by removing its final layer. Also, based on the experience of the traditional KNN models, they proposed a model to address the problem of simultaneously addressing the image tag refinement and assignment while maintaining the simplicity of the KNN model. The proposed model divides the images which have similar features into a semantic neighbor group. Moreover, using a self-defined Bayesian-based model, [16] distributed the tags which belong to the neighbor group to the test images according to the distance between the test image and the neighbors. The experiments of this work show the effectiveness of the proposed model.

2.1.2 Approaches based on ontologies

To improve image annotation ontological techniques have been used for AIA [35],[22] and [30]. For instance, in [22], a complete framework to annotate and categorize images has been proposed. This approach is based on multimedia ontologies organized following a formal model to represent knowledge. In this work, ontologies use multimedia data and linguistic properties to bridge the gap between the target semantic classes and the available low-level multimedia descriptors. The multimedia features are automatically extracted using algorithms based on MPEG-7 standard. The informative image content is annotated with semantic information extracted from the ontologies and the categories are dynamically built by means of a general knowledge base. Experimental results of this work show the efficiency of this method in the annotation and classification tasks using a combination of textual and visual components.

Moreover, in [20], an ontology based supervised learning for multi-label image annotation approach has been proposed, where classifiers’ training is conducted using easily gathered web data. This work takes advantage of both low-level visual features and high-level semantic information of given images. The goal is to use ontologies at several phases of supervised learning from large scale noisy training data. Experimental results show the effectiveness of the proposed framework over existing methods.

In [1], an approach based on semantic hierarchies has also been proposed for hierarchical image classification. The goal is to decompose the annotation problem in several independent classification tasks using two methods for computing a hierarchical decision function that serves to annotate images.

In [18] an approach for automatic image annotation has been proposed in order to automatically and efficiently assign linguistic concepts to visual data such as digital images based on both numeric and semantic features. The goal of this approach is to compute a multi-layered active contour and to extract visual features within the regions segmented by these active contours in order to map them into semantic notions. The method relies on decision trees trained using these attributes, and the image is semantically annotated using the resulting decision rules.

Other recent works tackle how coarse and fine labels can be used to improve image classification. In this context, [4] address the problem of classification of coarse and fine grained categories by exploiting semantic relationships. In this work, the idea is to adjust the probabilities of classification according to the semantics of the classes or categories. An algorithm for doing such an adjustment is proposed to show the improvement for both coarse and fine grained classification.

In [13], a weakly supervised image classification method with coarse and fine labels has been proposed. In this work, they investigated the problem of learning image classification when a subset of the training data is annotated with fine labels, while the rest is annotated with coarse labels. The goal is to use weakly labeled data aiming at learning a classifier to predict the fine labels during testing. To this end, they proposed a CNN- based approach to address this problem, where the commonalities between fine classes in the same coarse class are captured by min-pooling in the CNN architecture. The experimental results of this work show that this method significantly outperforms the work that addresses the same problem.

In addition, [23] addressed the problem of learning subcategory classifiers when only a fraction of the training data is labeled with fine labels while the rest only has labels of coarser categories. In particular, the aim is to adopt the framework of Random Forests [2] and to propose a regularized objective function that takes into account relations between categories and subcategories. The results show that the additional training data with the category-only labels improve the classification of sub-categories.

More closely related is the work of [9]. They proposed a joint framework for describing an image by the proposed context. This approach is based on integrating the multi-layer semantic elements detection and ROI (Region of Interest) identification into one optimization process. The idea is to combine a multi-label regression for hierarchical concept detection and a multi-class SVM for ROI identification in order to better describe the testing images. The experimental results demonstrate the effectiveness of the framework and the output descriptions improve the performance of image retrieval.

To summarize the recent related work, we present in Table 1 a review of the related approaches.

Table 1 Overview of the related image annotation approaches

2.2 Motivation and objectives

An image classification and annotation approaches based on visual features and ontologies were proposed in previous works [5, 8] and [7]. However, an improvement of image annotation precision is needed.

In this paper, we propose a novel image annotation method driven by classification and based on HMAX features and ontology.

Our motivation is to exploit both visual and ontological semantic features to improve image annotation.

In particular, we propose an ontology-based image annotation driven by classification using HMAX features. Our method is inspired by the approaches presented above.

Our objective is two-fold, we aim at:

  • (1) Training visual-feature-classifiers and building an ontology from image labels that can finely represent the semantic content associated with the training images;

  • (2) Exploiting classifier outputs and ontology for image annotation. For this purpose, we need to define a membership value based on both classifiers’ confidence value and semantic similarity of words depending on relationships detected in the ontology.

The originality of our proposal lies in the integration of classifiers with ontology that cover the semantic content of images in order to improve image annotation.

3 The proposed image annotation approach

In this section, we describe the architecture of the proposed image annotation approach and detail the different phases and their components. The proposed image annotation approach is composed of two main phases: (1) training phase and (2) image annotation phase. The different components are detailed below.

3.1 Training phase

The training phase includes three components, namely: feature extraction component, classifiers training component, and word extraction and ontology building component.

Firstly, visual features are extracted from the training set (Fig. 1: feature extraction). Our approach uses HMAX features [11, 27, 28], [12], because they are generic, do not require hand-tuning, and can represent well complex features (a detailed description is given below). Secondly, HMAX features are used to train the classifiers. We selected a multi-class linear SVM in order to classify images (Fig. 1: classifiers training).

Finally,image labels from the train set are used to extract words and to build the ontology as a final step, which consists in establishing relationships between words using taxonomic relationships found in WordNet (Fig. 1: word extraction and ontology building).

Fig. 1
figure 1

Architecture and components of the proposed image annotation approach

3.1.1 Feature extraction component

To extract visual features from training images, we used HMAX model; in particular, we adopted the HMAX model to provide complex and invariant visual information and to improve the discrimination of features. The HMAX model follows a general 4 layer architecture. Below we describe the operations of each layer. Simple (“S”) layers apply local filters that compute higher-order features and complex (“C”) layers increase invariance by pooling units.

  • Layer 1 (S1 Layer): In this layer, each feature map is obtained by convolution of the input image with a set of Gabor filters gs,o with orientations o and scales s. In particular S1 Layer, at orientation o and scale s, is obtained by the absolute value of the convolution product given an image I:

    $$ L1_{s,o} = |{g}_{s,o}*I| $$
    (1)
  • Layer 2 (C1 Layer): The C1 layer consists in selecting the local maximum value of each S1 orientation over two adjacent scales. In particular, this layer divides each L1s,o features into small neighborhoods Ui,j, and then selects the maximum value inside each Ui,j.

    $$ L2_{s,o} = {\max }_{{U}_{i,i}\in {L1_{s,o}}} * {U}_{i,j} $$
    (2)
  • Layer 3 (S2 Layer): S2 layer is obtained by convolving filters αm, which combine low-level Gabor filters of multiple orientations at a given scale.

    $$ L3_{s,m} = {\alpha}_{m} * {L2}_{s} $$
    (3)
  • Layer 4 (C2 Layer): In this layer, L4 features are computed by selecting the maximum output of \({{L3}_{s}^{m}}\) across all positions and scales.

    $$ L4 ={{\max}_{(x,y),s} L{3_{S}^{1}}(x,y),...,{\max}_{(x,y),s} L{3_{S}^{M}}} $$
    (4)

The obtained layer 4 vectors define the HMAX features that are the input of the next component.

3.1.2 Classifiers training component

SVMs are mainly designed for the discrimination of two classes. Also, they can be adapted to multi-class problems where a multi-class SVM classifier can be obtained by training several classifiers. In our work, the aim is to learn a discriminative model for each “class” in order to predict the visual features membership. To achieve this goal, we focus on linear SVM classifiers since the diversity of image categories makes using nonlinear models is impractical.

In particular, given the visual features (HMAX features) of the training images, we train a One-vs-All SVM classifier [3] for each class to discriminate between this class and the other classes.

3.1.3 Word extraction and ontology building component

As depicted in Fig. 2, the ontology building component consists of two main steps, namely, word extraction and ontology building.

Fig. 2
figure 2

Word extraction and ontology building components

Word extraction

Let us consider that the meta-data of the train images consists of the synset IDs (synsets are called a “synonym set” or “synset” or concepts and described by one or multiple words) which are defined by the WordNet.Footnote 1 As depicted in Fig. 2, word extraction component consists firstly, in determining synsets which are associated with the train images; secondly, in extracting words from the obtained synsets. Finally, we need to filter all words after extracting them from the obtained synsets. The filtering of words is performed by removing the words that are not defined by the WordNet dictionary.

Ontology building

Let us consider an image database DB consisting of a set of pairs (images, synsets) and each synset is composed of words where:

  • I=i1, i2,...,iL is the set of all images in DB,

  • L is the number of images in the DB,

  • S=S1, S2,...,SM is the synsets which are associated with the train images in DB,

  • W=w1, w2,...,wN is the words that are extracted from S

  • N is the size of the words set,

  • M is the number of the synsets associated with the train images in DB,

  • LD is a lexical database of nouns, verbs, adjectives and adverbs which are grouped into sets of cognitive synonyms (synsets). Synsets are interlinked by means of semantic and lexical relations.

Given the previous parameters, the aim is to build an ontology, consisting of a set of words W dedicated to this specific annotation problem and depending on the annotation vocabulary.

At the beginning, we define a set of main symbols which are necessary for the definition of our ontology.

Definition 1

Ontology

We define an ontology, denoted in the sequel 𝜃, by words W and relations R among words. The ontology 𝜃 relies mainly on the two following concepts: “Thing” represents the top concept of the ontology and “Word” represents the word in our ontology, is any word from the annotation vocabulary used to describe the content of images.

Formally, 𝜃 is a triplet defined as follows:

$$ \theta=\left\{ Root, W,{R}_{TS}({W}_{i},{W}_{j})\right\} $$
(5)

where:

  • Root = Thing is the top concept of the ontology;

  • RTS represent the relationships among words WiWj;

  • WiWjW and i,j = 1,...,N with ij.

To extract relationships between words, we used LD. We are interested in relationships that are detailed in Table 2.

Table 2 Words relationships used in our ontology

The type of relationships can be classified into hyponymy or specialization relationships generally known as kindOf or isA, hypernymy or generalization relationships known as “hasKind”, partitive or meronymy relationships called partOf which describe words that are parts of other words and holonymy relationships known as hasPart which defines the whole-to-part relationships, and synonym relationships.

The ontology is defined by considering only the words extracted from the image database itself. The resulting sets of taxonomic and semantic relationships as well as the resulting set of words are the basis of our ontology.

Once the taxonomic and semantic relationship extraction is carried out, ontology building is then performed. To construct successfully the ontology, we used the OWL API.Footnote 2 Some rules are applied in order to transform the extracted relationships in OWL language.

3.2 Image annotation phase

The image annotation phase includes three main components which are: feature extraction, image classification and image annotation (Fig. 1: Image annotation). Firstly, features of the testing images are extracted (Fig. 1: feature extraction). Secondly, the image is classified (Fig. 1: image classification). Thirdly, a membership value is computed using both the outputs of classifiers and ontology. In particular, the membership value is computed using the confidence value of the classes and the semantic similarity between the ontology. The membership value depends mostly on relationships found in the ontology. Annotation words are ranked according to their membership values, in order to assign a set of annotation words to the query image (Fig. 1: image annotation). In the following subsections, we describe the formalization of our annotation problem and we detail the image annotation phase.

3.2.1 Problem formalization

We work in multi-class classification images, so for each “word” in our ontology a related classifier is trained. We consider 𝜃 the ontology which is built and N is the number of classifiers associated to the “words” in 𝜃. Let us present the following definitions to explain the image annotation problem:

  • wt is the obtained class or “word” from the best classifier of the test image I.

  • \(C_{w_{i}}\) is the classifier of the word wi with wi𝜃.

  • AI is the annotation words for image I , consisting of the words \(\left \{ w_{j} \in W, j = 1,...,K \right \}\) that will be assigned to the test image I .

Given the 𝜃 ontology, wt and \(C_{w_{i}}\) classifiers, the aim is to assign K annotation words to the test image where \(A_{I}= \left \{ w_{1}, w_{2},...,w_{K} \right \}\), the assignment of K annotation words depends on the membership value between the test image and the related words in 𝜃.

3.2.2 Ontology-driven image annotation using classification

To annotate images, we focus on proposing a membership value of the closest words to the test image. To this end, firstly we propose to assign to each closest word a semantic weight. The semantic weight depend on the neighborhood degree of the closest word to the target word wt, and the semantic similarity of the pair (wt, closest word). Secondly, according to the relationships related to wt, the confidence values of the closest words and their semantic weights are used for computing the membership values.

Let us present the following definitions to explain the functions that we used for computing the membership degree of the related words in the ontology to the test image according to the ontological relationships:

  • RTS(wi, wj)= (RisA(wi, wj),RhasKind(wi, wj),RpartOf(wi, wj),RhasPart(wi, wj),Rsynonym(wi, wj)) represents the set of relation types existing in 𝜃 with wi and wjW;

  • ClosestWords(wt, LMax)=Clw= \(\left \{ w_{1},..,w_{M} \right \}\) is a function allowing to find the closest words of wt in 𝜃 according to a length LMax, with M as the number of the closest words and LMax as the maximum path length between wt and the words returned by the ClosestWords(wt, LMax);

  • getWordsDirectLink(wt, 𝜃) = \(\beta = \left \{\beta _{1},...,\beta _{b}\right \}\) is a function allowing to return the words that have a direct link with the target word wt;

  • CV(wi) is the confidence value of the word wi that is obtained by its own classifier \(C_{w_{i}}\);

  • L(wt, wi ) is a function that returns the shortest path length between wt and wi;

  • SW(wi) is the semantic weight of wi in the closest semantic space of wt with wiClw;

  • \(MV_{R_{TS}}(I,w_{i})\) is the membership function to compute the membership degree of word wi to the test image I according to the relationship RTS(wt, wi);

In our image annotation method, to compute the membership value, we are interested in the set of the closest words of wt that are returned by the ClosestWords(wt, LMax) function. For this purpose, we propose to assign a semantic weight to each closest word. The semantic weight depends, firstly, on the neighborhood degree of the closest word wi to wt, secondly, on the semantic similarity of the (wt, wi) pair. We assign the maximum semantic weight to wt so SW (wt)= 1.

Thus, we define the following function to compute the semantic weight of each closest word wi:

$$ SW(w_{i}) = Nd_{L}(w_{t},w_{i} ) (w_{i}) * Sim(w_{t},w_{i} ) $$
(6)

Where:

  • Nd(wi) is the neighborhood degree of wi to wt according to the length between wt and wi: in our case, we define the Nd(wi) as follows:

    $$ Nd(w_{i})=\frac{1}{L(w_{t},w_{i})} $$
    (7)

    where L(wt, wi) is the path length between wt and wi.

  • Sim(wt, wi) is the semantic similarity between wt and wi based on the Wu-Palmer metric (WUP) [34]:

    $$ Sim(w_{t},w_{i} )=\frac{2 \cdot depth(LCS(w_{t},w_{i}))}{depth(w_{t})+depth(w_{i})} $$
    (8)

    with LCS is the Lowest Common Subsumer(s).

Subsequently, for any relation type found between wt and the words that are returned by the getWordsDirectLink(wt, 𝜃) function, we start from wt node and then we follow the path according to the relation types until reaching a path length equal to LMax. We are generally interested only in the relationship sense that started from wt. Thus, for each relation type, we propose a method to compute a membership value of the closest words to the test image I.

In the following, we detail how to compute the proposed membership value for each relation type.

Proposed membership value according to isA/hasKind relationships

In the case where wt is linked to a word from the word set β by the isA relation type (wt is a hyponym of the βj), semantically, we could be sure that wt is a βj, since it is clear that wt is necessarily a βj.

For example, we could be sure that a “tree” is necessarily a “woody plant”, also a mammal is necessarily an animal. To confirm this certitude, the minimum membership value of the hypernym words of wt must be equal to the wt membership value.

Thus, to compute the membership value for this case, we are defined the flowing function:

$$ MV_{R_{isA}} (I,w_{i} )= SW (w_{t}) * CV(w_{t}) = CV(w_{t}) $$
(9)

Where wi is a hypernym word of wt and wiClw, CV (wt) is the confidence value of the target word wt and SW(wt) is the semantic weight of wt.

The hypernym words of wt that are included in the closest words set Clw are added to the annotation words.

In case that the wt is linked to the βj word by the hasKind relationship type (wt is a hypernym of the βj), the hasKind relation conveys the specific information from the upper class (class of wt) to its children classes (the wt hyponym word classes). Also, starting from the upper word node (wt) and following the hasKind relationships, this relation type keeps the generic information of the wt class in their hyponym word classes.

For example, as depicted in Fig. 3, if we have wt = “automobile” and it has a hyponym word “taxicab”, following the hasKind relation from the “automobile” to “taxicab”, a specific information allowing to define “taxicab” object, is added to the generic information which presents the “automobile”. Thus, the representation of the object (“taxicab”) can be defined by the generic object representation (“automobile”) and the specific detailed information of the object (“taxicab”).

Fig. 3
figure 3

Visual and semantic representation of hasKind relationship

So, the percentage of having a taxicab on an image is the percentage of having an automobile added to the percentage of having a taxicab on the image. To this end, to estimate the membership value of the hyponym words, we merged the confidence values of the hyponym words and thewt, that are weighted by their semantic weights. The goal is to improve the accuracy of image annotation.

In this case, the final function to compute the membership value is:

$$ MV_{R_{hasKind}} (I,w_{i} )= \frac {CV (w_{t})+ {\sum}_{j=1}^{|P|}{SW(P_{j})}*CV(P_{j})} { 1+ {\sum}_{j=1}^{|P|} {SW(P_{j})}} $$
(10)

Where:

  • CV(wt) is the confidence value of the wt;

  • P is the set of words in the path \(P_{w_{t},w_{i}}\) from wt to wi following the hyponym relationship of wt;

  • SW(Pj ) is the semantic weight of the hyponym word which exists in P.

For this aim, starting from the wt and following the hasKind relationship until reaching the maximum path length LMax, we compute the membership value of the hyponym words according to the previous function.

Proposed membership value according to hasPart/partOf relationships

In the case that the wt is linked to the βj word by the holonymy (wt is a holonym of the βj or wt has partβj) / meronymy (wt is a meronym of theβj or wt is a part of βj) relationships, it is clear that the wt class represents the whole information parts of meronym word classes or the parts of the whole information of holonym word classes. Thus, it is obvious that the relationship between the classes represents composition relationships.

For example, as depicted in Fig. 4 (a), let us suppose that we have three test images (1), (2) and (3). We suppose that they are assigned by “tree” word as a target word wt, which is obtained by the best classifier. The three images are composed of a tree object. The content of the first image is represented by a tree object, also, by both “trunk” and “leaves” sub-objects with a significant appearance. In the second image, only the tree object and the leaves sub-object appeared. However the third image is composed of only the forest object.

Fig. 4
figure 4

Visual and semantic representation of hasPart/partOf relations

In the semantic representation, as depicted in Fig. 4 (b), to annotate the test image, we process to follow from the holonym word (“tree”) to the meronym nodes, and we select the word that has the highest confidence value. But, we could not be sure that the selected meronym word will appear in the top annotation words sorted according to their confidence value. Also, in the opposite case, walking from the target word “tree” to the holonym words “forest” and “wood”, although the word that has the greater confidence value is selected, we could not be sure if the holonym word object appeared in the test image.

Thus, the main problem, in this case, is how to predict if the “trunk” object (meronym word) or “forest” object (holonym word) belong to the content of a test image or not.

To overcome this problem, we propose to estimate a visual similarity between holonym and meronym words. The aim is to estimate a distance between the word classes. Thus, the visual similarity between the words is inversely proportional to the distance between their visual classes. The function to estimate the visual similarity between the words is:

$$ VisSim(w_{t},w_{i})= \frac{1}{1+d(C_{w_{t}},C_{w_{i}})} $$
(11)

Where:

d(C(wt),C(wi)) is the Euclidean distance between the classifiers of the words wt and wi, and wi is a holonym or meronym word of wt.

The objective being to combine the visual similarity of words and their semantic weights in order to improve the annotation accuracy. Thus, to compute the membership value, we proposed the following function:

$$ MV_{R_{hasPart/partOf}}(I,w_{i} )=\frac {CV (w_{t})+ {\sum}_{j=1}^{|P|}{SW(P_{j})}*visSim(w_{t},P_{j})} { 1+ {\sum}_{j=1}^{|P|} {SW(P_{j})} } $$
(12)

Where:

  • wi is a meronym word of wt;

  • P is the set of words in the path \(P_{w_{t},w_{i}}\) from wt to wi following the meronym relationship;

    visSim(wt, Pj) is the visual similarity of the pair (wt, Pj);

For the partOf relation, we compute the membership value with the same manner as the previous function, but we are interested in the holonym words instead of the meronym words.

Illustrative example of our image annotation method Let us suppose that we have an ontology part, as depicted in Fig. 5, the target word of the test image is “tree”. In this case, we suppose that LMax= 4 to select the closest word of the wt. Therefore, each closest word has a confidence value (the value in blue), and a semantic weight value (the value in green) which is obtained according to the function (6). To annotate the test image, initially the annotation words contained only the wt =“tree” as a target word. Then, starting from the target word and following each relationship with wt until reaching the maximum path length LMax. Then, according to the relation type, membership value of the words is computed using the functions defined previously. After computing the membership values, all closest words are ranked and the top K(= 10) words are assigned to the test image. The algorithm of our image annotation model is presented in Algorithm 1.

figure d
Fig. 5
figure 5

Detailed example of image annotation

4 Experimental results and discussion

Throughout this section, we illustrate the experimental results of our work. We start with the experimental setup, then, we present the evaluation of our approach by introducing image classification and annotation performance.

4.1 Experimental setup

To evaluate our approach, we used ImageNet Footnote 3 and OpenImages datasets Footnote 4:

  • ImageNet dataset: images are organized according to the WordNet hierarchy. This database contains about 1,281,167 images from 1000 synsets. The number of images for each synset (category) ranges from 732 to 1300 and all images are in JPEG format. Images are heterogeneous and represent diverse themes. In our work, we used about 200K images from 1000 categories or synset, there are 190K as a training set and 10K images as a testing set.

  • OpenImages dataset: where images have been annotated with labels spanning over 600 categories, there are 1,743,042 images as a training set and 125,436 as a testing set. In our work, we used about 200 images for each categories.

In order to evaluate our proposed approach, we used, as evaluation metrics, accuracy for image classification, and precision for image annotation. We provide precision at the top K (P@K) of the annotation words.

For annotation results obtained by ImageNet dataset, to evaluate the ability of our model to annotate correctly the test images, we studied the capability of our approach to detect the semantic relations between the annotation words which are assigned to the test images. For this purpose, we proposed a novel metric, called Target Precision, that is inspired from the method of [33].

In particular, we suppose that we have some ground-truth in the form of a matrix S, where the value of Si,j = 1 if wi and wj are equivalent or there exists a synonym relationship between wi and wj, or the word wj is a hyperonym of wj and Si,j = 0 otherwise.

In order to build the matrix S, we defined S as follows:

$$ S_{i,j}=\begin{cases} 1, & \text{if i=j $\vee \exists R_{synonym} (w_{i},w_{j}) \vee \exists R_{isA} (w_{i},w_{j})$}.\\ 0, & \text{otherwise}. \end{cases} $$
(13)

Where:

  • Rsynonym(wi, wj) is a synonym relation between the word i and the word j.

  • RisA(wi, wj) is a isA relation between the word i and the word j.

For this purpose, we provide Target Precision at the top K (TP@K) of the annotation words. In particular, for a ranking K annotation words= Aw1, Aw2,...,Awk, TP@k is defined as:

$$ TP@K(I)=\frac{{\sum}_{i=1}^{K} { S_{{w_{t}},Aw_{i}} } } {K} $$
(14)

Where:

  • I: is the test image;

  • K: is the number of the annotation words that are assigned to I;

Each image data set has its own annotation vocabulary that is used for annotating images. For that, we need to build ontology for each image collection.

Using ImageNet, we started by extracting about 1648 words from 1000 synsets (that were related to the database images). After filtering words, there were 1400 words that are used to build ontology.

For OpenImages, using the 600 classes (that are related to this database), we extracted about 1134 related words using WordNet.

For the two cases, relationships between words are extracted using WordNet. To successfully construct each ontology, we used the OWL API.Footnote 5 Some rules are applied in order to transform the extracted relationships in OWL language.

Figure 6 represents a part of each ontology that has been built using ImageNet and OpenImages datasets.

Fig. 6
figure 6

A part of each ontology that has been built using ImageNet and OpenImages datasets

4.2 Experimental results

4.2.1 Classification results

In this section, we are interested in showing and analyzing the image classification performance. We use different image classification strategies. We introduce the proposed strategies below:

  1. 1.

    HMAX-SVM: HMAX features are extracted and classified with SVM.

  2. 2.

    BoVW-SVM: classical BoVW model is used with SVM.

In the case of the classification method that based on HMAX model (HMAX-SVM), HMAX features are extracted as detailed in Section 3.1.1 and they are used to train SVM classifiers. The size of the final features is, in the BoVW model, given by the size of the vocabulary. However, in the HMAX model, the size is given by the number of the C2 features. For the classification method based on BoVW model (BoVW-SVM), SIFT features are extracted and quantized with KMeans and histograms of visual words are used to train SVM classifiers. For both methods, multi-class classification is done using one-versus-all SVM.

In this study, firstly we focus on evaluating the HMAX-SVM method, for this purpose, we study the variation of orientations (O) and scales (S) that are used to convolve images with Gabor filters. The goal of this part is to analyze how many orientations and scales are fit enough to improve the classification accuracy. Secondly, we report the performance of image classification depending on the number of features for each classification strategy using ImageNet and OpenImages datasets. Finally, we compare image classification accuracy obtained by both HMAX-SVM and BoVW-SVM methods.

To evaluate image accuracy of HMAX-SVM method depending on the variation of the number of orientations and scales, we tested the HMAX-SVM strategy using six different scales 2,4,6,8,10, 12 and five different orientations 2,4,6,8,10. The obtained classification results using ImageNet and OpenImages are presented in Fig. 7.

Fig. 7
figure 7

Accuracy results obtained by HMAX-SVM depending on the variation of scale and orientation numbers using ImageNet and OpenImages datasets

As depicted in Fig. 7, it can be seen that the best accuracy achieved 0,73 with 10 orientations and 10 scales using ImageNet dataset. We notice that the accuracy value increases with the increase of the scale until reaching 10 scales and 10 orientations. Moreover, we observe the same evaluation when using OpenImages dataset, the best accuracy achieved 0,77 with 10 orientations and 8 scales.

For both ImageNet and OpenImages, this increase of the classification accuracy could be explained by the impact of the amount of data that are extracted. However, when the number of scales tends to 12, the accuracy value decreases to 0,65 using ImageNet and to 0.68 using OpenImages. The degradation in the classification accuracy can be explained by the lack of additional data to be extracted. There is no more data to be exploited for computing response Gabor filters. So, the redundancy of the same information that are extracted with 10 scales (8 scales), and also with 12 scales (10 scales) using ImageNet (OpenImages) dataset, respectively, can decreases the accuracy value.

To compare the classification performance between both HMAX-SVM and BoVW-SVM strategies, we focus on the influence of the number of features on classification accuracy. For this purpose, different vocabulary sizes are applied to experiment and a comparison of classification accuracy is shown in Tables 3 and 4.

Table 3 Accuracy results for HMAX-SVM and BoVW-SVM methods on ImageNet dataset
Table 4 Accuracy results for HMAX-SVM and BoVW-SVM methods on OpenImages dataset

We observe that the classification method which is based on the HMAX model (HMAX-SVM) provides a better performance than the classification method which is based on the BoVW model (BoVW-SVM) for both ImageNet and OpenImages datasets. The best improvement reaches 13,55% for ImageNet (cf. Table 3) and 19,14% for OpenImages (cf. Table 4).

Table 3 shows that the best accuracy for the HMAX-SVM method is obtained with a dictionary of 3500 features for ImageNet dataset (0,73). But, by setting the dictionary size to 4000 and using OpenImages dataset, the HMAX-SVM achieves the best performance as 0,79 (cf. Table 4).

The differences in performance between the HMAX-SVM and BoVW-SVM classification strategies can be explained by considering that the HAMX model build complexes visual features with richer information using multiple orientations and scales of image structures, however BOW model select only the interest points that are detected by SIFT detectors, and represent images by only the distribution of features, as histogram which reflects the clusters frequency of occurrence. We conclude that the classification method based on HMAX model provides a better performance than the classification method based on BoVW model on a large image database.

4.2.2 Image annotation results

To show a better performance of the proposed image annotation approach, we define different image annotation scenarios. We introduce the proposed annotation scenarios as follows:

  1. 1.

    BoVW-SVM: to perform this strategy, we annotate testing image by keeping the top K words of the best classifiers;

  2. 2.

    BoVW-SVM-ONTO: image annotation strategy based the previous strategy and the ontology: to perform this strategy, we annotate testing image by applying our method;

  3. 3.

    HMAX-SVM: image annotation strategy based on the HMAX-SVM classification strategy: to perform this strategy, we annotate testing image by keeping the top K words of the best classifiers;

  4. 4.

    HMAX-SVM-ONTO: image annotation strategy is based on the previous strategy and the ontology: to perform this strategy, we annotate testing image by applying our method.

The goal, in this experiment, is to show the effect and the advantages of integrating ontology with the output of the training classifiers by exploiting the ontological relationships and the confidence value of classifiers, on the performance of image annotation.

For this purpose, we compare our approach, based on the exploitation of the ontology and the classifier’s confidence value by computing their membership values according to the ontological relationships, with the baseline method that consists of annotating images based only on SVM classification.

In particular, our image annotation model is based on two parameters: 1) the LMax value (that can be equal to 1,2,3, 4,5 and 6) and 2) the K parameter which presents the number of annotation words.

To achieve the goal of this experiment, we set the Lmax value of our image annotation model to 6, and we analyze the image annotation results of the strategies that were presented previously.

Table 5 shows the comparison image annotation results in terms of the Target Precision metric (TP) for the different proposed strategies on both ImageNet and OpenImages datasets.

Table 5 Image annotation results evaluation with LMax= 6 in the terms of Target Precision TP on ImageNet and Open Images datasets

As depicted in Table 5, using ImageNet, we observe that the annotation results of the strategy that’s based on the BoVW model and ontology (BoVW-SVM-ONTO) are clearly higher than the strategy that’s based on the BoVW model without ontology (BoVW-SVM). Also, we observe the same comparison when using OpenImages dataset. The best target precision obtained by BoVW-SVM-ONTO for both ImageNet and OpenImages datasets, were performed with TP@3. The best TP@3 achieves 0.68 for ImageNet and 0,69 for OpenImages (cf. Table 5). In addition, we observe the same comparison for HMAX-SVM and HMAX-SVM-ONTO. It highlights a similar increase in Target Precision when the ontology is used (HMAX-SVM-ONTO) for both ImageNet and OpenImages datasets. The best target precision (TP@3) achieves 0,73 for ImageNet and 0,75 for OpenImages (cf. Table 5).

In fact, using BoVW-SVM-ONTO, the best improvement of P@10 reaches 77,77% for ImageNet and 70,96% for OpenImages. Using our method HMAX-SVM-ONTO, the best improvement of P@10 reaches 47,05% for ImageNet and 38,46% for OpenImages (cf. Table 5).

According to the results, we conclude that our ontology-based annotation method increases the annotation results for both HMAX and BoVW features. This explains that exploiting ontological relationships with output classifiers, can improve the image annotation results.

To show a better performance of our proposed image annotation approach, we introduce a comparison of image annotation results in terms of the precision metric. Table 6 shows the comparison results of image annotation in term of precision for BoVW-SVM vs BoVW-SVM-ONTO and HMAX-SVM vs HMAX-SVM-ONTO methods on both ImageNet and OpenImages datasets.

Table 6 Image annotation results evaluation with LMax= 6 in the terms of Precision on ImageNet and Open Images datasets

The obtained image annotation results using BoVW model and ontology (BoVW-SVM-ONTO) is clearly higher than the strategy that’s based on the BoVW model without ontology (BoVW-SVM) using both ImageNet and OpenImages datasets. The best precision improvement achieves 17,14% for P@10 using ImageNet and 35,48% for P@10 using OpenImages (cf. Table 6).

We observe the same evaluation results for HMAX-SVM and HMAX-SVM-ONTO methods. It highlights a similar increase in precision when the ontology is used (HMAX-SVM-ONTO) for both ImageNet and OpenImages datasets. The best precision (P@3) achieves 0,63 for ImageNet and 0,65 for OpenImages (cf. Table 6). Using HMAX-SVM-ONTO, the best improvement for P@10 reaches 34,21% for ImageNet and 52,94% for OpenImages (cf. Table 6).

This results indicate that our proposed method, brings an increase in image annotation precision, independently of the selected features. Moreover, we can explain that the adoption of the ontological relationships with output classifiers improves the image annotation results due to using the proposed membership value that combine classification results and ontology.

4.2.3 Impact of variation of L Max value on the image annotation performance

In this section, we study the impact of the variation of the LMax parameter value. The LMax parameter presents the path maximum length which is used in the ontology between the wt and the closest words that are returned by the closesetWords(wt, LMax) function (section 3.4.2) during the image annotation process.

To this end, we assigned the LMax value to 1, 2, 3,4,5 and 6. Then, we tested the image annotation results by applying our method which is detailed in the section 3. In particular, we apply the image annotation process by assigning 6 values to the LMax parameter (LMax = 1,2,3,4,5 and 6).

For this purpose, we introduced the impact of LMax value on improving the image annotation performance in the terms of P@3, P@6 and P@10. In particular, to measure the significance of the impact on improving the image annotation, we carried out 6 different runs on the P@3, P@6 and P@10 of our method (HMAX-SVM-ONTO). The annotation results according to the variation of LMax parameter value for both ImageNet and OpenImages are shown in Table 7.

Table 7 Annotation results in terms of precision for our method according to the variation of LMax value on ImageNet and OpenImages datasets

As depicted in Table 7, using ImageNet dataset, the value of P@3 of our method is better with LMax = 3 (0,78), in case where LMax, the P@3 decreases to 0,63. Thus, we observe that if we increase the value of LMax to 6, the number of the closest words of wt also increases. This influences on the precision of the image annotation. Also, for the P@6, as depicted in the Table 7, the image annotation precision is better when we have LMax = 3. However, the low value of P@6 is obtained with LMax = 6.

In terms of P@10, we observe the same impact of the variation of LMax on improving the image annotation performance.

For OpenImages, we observe that the value of P@3 is better with LMax = 4 (0,8) (cf. Table 7), in case where LMax = 6, the P@3 decrease to 0,65 . Thus, we observe the same comparison for P@6 and P@10.

According to the analysis of the experimental results, we conclude that, if the LMaxvalue increases to 3 and 4, we obtained an important impact on improving the image annotation performance, but when increasing the LMax value to 6, the precision values of image annotation decrease. This explains that increasing the closest words number of wt reduces the annotation precision. This can be due to the appearance of irrelevant words in the closed semantic space of wt in the ontology. Thus, the words can affect the performance of image annotation.

4.3 Comparison of our method with a deep learning model: inception-V3

To better evaluate our proposed approach, we illustrate a comparison of our method with a deep learning model, Inception-V3, which is a widely used for image classification and annotation.

For this purpose, we perform an annotation strategy using the Inception-V3 model, in particular, we annotate images using words that have the best scores which are obtained by this model. We compare the annotation results obtained by this strategy to our proposed strategies that are introduced in the previous subsection: BoVW-SVM-ONTO and HMAX-SVM-ONTO.

Table 8 shows the image annotation results comparison in terms of the precision for our method and Inception-V3 method on both ImageNet and OpenImages datasets.

Table 8 Annotation results comparison in terms of precision for our method to Inception-V3 method on ImageNet and Open Images datasets

We remarked that our method outperforms the inception-v3 method for P@6 and P@10 (cf. Table 8). Especially, the HMAX-SVM-ONTO method leads to an increase in the P@6 and P@10. In fact, the best P@6 improvement (+ 14.28%) and the best P@10 improvement (+ 36,84%) was performed when annotating images based on our method using OpenImage dataset (cf. Table 8). However, for P@3, we observe that the difference in performance of annotation results is much smaller.

We conclude that the adoption of the ontology to annotate images improves the results due to the combination of the semantic level introduced using ontologies with the classifier outputs by computing our proposed membership value.

4.4 Discussion

In this paper, we introduced our ontology- based image annotation driven by classification using HMAX features.

Our main contributions concern training visual-feature-classifiers, building an ontology that can finely represent the semantic content of images, and evaluating a membership value for each relation type found in the ontology based on both classifiers’ confidence value and the semantic similarity of words. The membership value serves to rank annotation words that are assigned to a test image. The main goal is to improve the image annotation results. The experimental results show the interest of the proposed approach.

We point out that, the built ontologies, which cover about 1000 concepts and represent a rich semantic content may be seen as a reusable component of image annotation and retrieval tasks; hence, the originality of our work concerns the automatic construction of the ontology.

In addition, in order to measure the significance of the improvement obtained by our approach, we carried out several tests on the image annotation precision of the different proposed scenarios.

The experimental results show that the improvements of image annotation obtained by our approach are statistically significant. Therefore, the results indicate that the gain between our approach and the baseline methods is significant.

Our proposed approach can have a great interest in Automatic Image Annotation and it can contribute to improving the performance of image annotation.

5 Conclusion

This paper describes an ontology-based image annotation using HMAX features classification. Our goal is to improve image annotation results.

Our contribution is, firstly, to extract invariant and complex visual features from training images and training classifiers, and then to automatically build the ontology that can finely represent the semantic information associated with the training images. Secondly, to combine both the classifiers’ confidence values and the ontology for annotating test images. To this end, we proposed and we evaluated a membership value that is depended on each relationship that is found in the ontology. During the image annotation process, the membership value serves to select k annotation words, which are assigned to testing images.

The experiments that have been carried out highlight an improvement in image annotation results compared to baseline methods. Indeed, our proposal contributes to significantly increase the relevance of annotation results, by enhancing the precision of annotation. This improvement confirms our proposal about using ontology and visual features by exploiting both relationships and classifier outputs.

In a future work, we intend to expand our approach by exploiting other semantic relationships in order to enrich the annotation vocabulary. We also intend to improve our approach by combining a deep learning model with the ontology.