1 Introduction

Automatic image annotation is a challenging problem dealing with the textual description of images. This process usually consists in the building of a computational model that enables to associate a text description (often reduced to a set of semantic keywords) to digital images. A wide number of approaches have been proposed to address this concern and to narrow the well-known semantic gap problem [35]. Most approaches rely on machine learning techniques to provide a mapping function that allows classifying images in semantic classes using their visual features [5, 9, 27]. However, these approaches face the scalability problem when dealing with broad content image databases [30], i.e. their performances decrease significantly when the concept number is high and depend on the targeted datasets as well [21]. This variability may be explained by the huge intra-concept variability and the wide inter-concept similarities on their visual properties that often lead to conflicted and incoherent annotations. Yet, more and more concept classes are introduced for annotating multimedia content in order to enrich the description of images and to satisfy user expectations in an image retrieval system. Consequently, current techniques are struggling to scale up, and the only use of machine learning seems to be insufficient to solve the image annotation problem. Firstly, because of the lack of a reliable computational model that allows to model the correlation between the low-level features of images and the semantic concepts. Secondly, because it seems that there is a lack of coincidence between the high-level concepts and the low-level features, and that image semantics is not always correlated with the visual appearance. Therefore other alternatives need to be explored in order to improve existing approaches. In particular, some recent work proposed to use explicit semantic structures, such as semantic hierarchies and ontologies, to improve the image annotation [2, 12, 17, 43].

Indeed, ontologies defined as a formal, explicit specification of a shared conceptualization [19] have shown to be very useful to narrow the semantic gap. They allow identifying, in a formal way, the dependency relationships between the different concepts and therefore provide a valuable information source for many problems. Moreover, ontological reasoning can also be used to formulate image annotation and interpretation tasks. For instance, in [12] the authors proposed a framework for the extraction of enhanced image descriptions based on an initial set of graded annotations generated through generic image analysis techniques. Explicit semantics, represented by ontologies, have also been intensely used in the field of image and video indexing and retrieval [2, 26]. In most of these approaches, only the descriptive part of ontologies is used as a common multi-level language to describe image content [34], or more recently as semantic concept networks to refine image annotation [17, 43], or to perform image classification [3, 32].

In this paper, we propose to go deeper in the use of ontologies for image annotation. Our objective is twofold. We first propose an approach to automatically build a fuzzy multimedia ontology dedicated to image annotation. Indeed, given a training database consisting of pairs of image/textual annotation, our approach allows to automatically build an ontology representative of the image semantics by mining these images and their annotations. Thereafter, we propose a generic approach for image annotation combining both machine learning techniques such as hierarchical classification and fuzzy ontology reasoning. The rest of this paper is structured as follows. In Section 2, we review some related work. Section 3 presents an overview of the proposed approach for multimedia ontologies building. Section 4 introduces the proposed formalism for our multimedia ontology and the set of axioms and inferences rules allowing to perform the reasoning tasks. In Section 5, we introduce the proposed method for building multimedia ontologies suitable for reasoning about image annotation and interpretation. Section 6 introduces the proposed multi-stage reasoning framework for image annotation. Section 7 reports the experimental results obtained on the Pascal VOC dataset. A discussion about the proposed approach and the usefulness of our ontology for computer vision tasks is presented in Section 8. The paper is concluded in Section 9.

2 Related work

Despite significant progress shown by statistical approaches for images annotation, the semantic gap problem is still an open issue for image annotation. In this context, several recent approaches have proposed to improve this task by the use of explicit knowledge models. A first category of approaches have proposed to use semantic hierarchies for image annotation and classification [3, 17, 32, 42]. Bannour et al. [3] have identified three types of hierarchies used for image annotation: (1) language-based hierarchies: based on textual information (ex. tags, surrounding context, WordNet, Wikipedia, etc.) [14, 32], (2) visual hierarchies: based on low-level image features [6, 18, 46], and (3) semantic hierarchies: based on both textual and visual features [3, 17, 29]. However, most of these approaches use semantic hierarchies to reduce the complexity of the classification problem or as a framework for hierarchical image classification and they do not use the semantic structure of these hierarchies (i.e. the inherent semantic relationships of concepts within these hierarchies). Consequently, only a limited improvement in the classification results was shown by these approaches.

Other approaches proposed to use multimedia ontologies in order to define a standard for the description of low-level multimedia content [13, 33], or to use it as a semantic repository for storing knowledge about image domain [34], or to allow semantic interpretation and reasoning over the extracted descriptions [12, 22, 24]. Indeed, ontologies allow to model many important semantic relations between concepts which are missing in the semantic hierarchy models, as for instance the contextual and the spatial relationships. These relations have been proved to be of prime importance for image annotation [22, 24, 25, 40]. The reasoning power of ontological models has also been used for semantic image interpretation. In [12, 24, 25], formal models of domain application knowledge are used through fuzzy description logics to help and to guide the semantic image analysis.

However, much remains to be done in order to achieve more expressive ontologies of images semantics. Firstly, almost all existing approaches for building multimedia ontologies start from an existing specification of a domain (defined by an expert or inferred from a generic commonsense ontology). These specifications are not always relevant for modeling image semantics and are often incomplete, subjective and subject to many inconsistencies. Indeed, many assumptions about the concepts, their properties and relationships must be done in order to achieve a given specification, which finally do not hold in the real world. Secondly, most recent approaches for building multimedia ontologies are based either on a conceptual specification, or a visual one. Consequently, these approaches do not accurately model images semantics. Furthermore, many of these approaches are limited to provide a formalism allowing to use ontologies as a repository for storing knowledge about multimedia content. However, since these approaches have not addressed the problem of reasoning about this knowledge, the effectiveness of stored knowledge has to be proved. Finally, ontology modeling in description logics is not an intuitive task. The representation of each single real world object is split into many axioms about concepts and roles, leading to an overall design that is very difficult to apprehend [36]. This makes the design of a well-defined ontology by humans a big challenge, with no guarantee of success (scalability problem of ontology building).

Our approach goes further than the aforementioned ones and allows answering many of the previously stated limitations. Specifically, we propose in this paper a methodology for building multimedia ontologies as knowledge bases that contain explicit and structured knowledge about image context. To ensure that the structure of our ontology is representative of the image semantics, we propose to use a semantico-visual specification (which incorporates the visual and conceptual semantics of image concepts) for designing our ontology. In addition, we propose to build our multimedia ontology in an automatic manner and based on mining image databases to gather valuable information about image context. Thereby, we reduce the scalability problem of ontology building and we ensure that the depicted knowledge is faithful to image semantics. Finally, the proposed ontology is built using a highly expressive formalism (Fuzzy OWL2-DL), which allows a good interaction with it, i.e. a good querying and reasoning capabilities. Our belief is that such formal ontology will allow performing reasoning tasks in order to achieve an effective decision-making to provide a semantically consistent image annotation.

3 Overview of our approach for building multimedia ontologies dedicated to image annotation

This paper proposes an approach for building a fuzzy multimedia ontology dedicated to image annotation. As illustrated in Fig. 1, our ontology incorporates several types of knowledge about image context in order to achieve a relevant representation of image semantics. Moreover, this knowledge is automatically extracted from a training image database using data mining techniques. Therefore, assuming that the considered training dataset is enough representative of current image databases, our approach allows for building multimedia ontologies faithful to the image semantics.

Fig. 1
figure 1

From image data to structured knowledge models: architecture of our approach for building multimedia ontologies dedicated to image annotation

Figure 1 depicts the workflow of our approach. As shown in this figure, the knowledge discovery process is performed through the following steps:

  1. 1.

    Processing the set of images in the training dataset to discover useful knowledge about the image domain (i.e. perceptual semantics), such as the visual similarity between concepts.

  2. 2.

    Mining the image annotations (provided in the metadata) to gather useful information about images context, namely contextual and spatial knowledge about image concepts.

  3. 3.

    Query a commonsense knowledge base to gather precise information about the semantics of image concepts, and in order to link the initial concepts to their hypernyms using the method proposed in [3].

Thereafter, the building of our multimedia ontology is fully automatically performed, i.e. without any human intervention. This is achieved by converting the previously extracted information about image context into explicit knowledge using the formalism described in Section 4.

Problem formalization

Given:

  • \(\mathcal{DB}\), a training image database consisting of a set of pairs \(\langle\) image/textual annotation \(\rangle\), i.e. \(\mathcal{DB} =\{[i_1,\mathcal{A}_1],[i_2,\mathcal{A}_2],\cdots,[i_\mathcal{L},\mathcal{A}_\mathcal{L}]\}\), where:

    • \(\mathfrak{I}=\langle i_1,i_2,\cdots,i_\mathcal{L}\rangle\) is the set of all images in \(\mathcal{DB}\),

    • \(\mathcal{L}\) is the number of images in the database.

    • \(\mathcal{C}=\langle c_1,c_2,\cdots,c_\mathcal{N}\rangle\) is the annotation vocabulary used for annotating images in \(\mathfrak{I}\),

    • \(\mathcal{N}\) is the size of the annotation vocabulary.

    • \(\mathcal{A}_i\) is a textual annotation consisting of:

      • the set of concepts \(\{c_j \in \mathcal{C}, j=1..n_{i_i}\}\) associated with a given image \(i_i \in \mathcal{DB}\),

      • the spatial location of each concept c j in the image i i given by its minimum bounding box defined as \((c_{j_{x{\rm min}}}, c_{j_{y{\rm min}}}, c_{j_{x{\rm max}}}, c_{j_{y{\rm max}}})\), where \(c_{j_{x{\rm min}}}\) and \(c_{j_{y{\rm min}}}\) are the coordinates of the low left corner of the bounding box (and respectively \(c_{j_{x{\rm max}}}\) and \(c_{j_{y{\rm max}}}\) are the coordinates of the upper right corner of the bounding box).

  • \(\mathcal{CO}\), a generic commonsense ontology containing \(\mathcal{N'}\) concepts (\(\mathcal{C}\)), such that \(\mathcal{C}\subseteq\mathcal{C}\). In this paper, we used WordNet as a commonsense ontology.

Our objective is to build a multimedia ontology, consisting of a set of \(|\mathcal{C}|+|\mathcal{C'}|\) concepts (\(s.t. \ \mathcal{C}\cup\mathcal{C'} \subseteq \mathcal{C}\), and \(\mathcal{C'}\) could be probably the empty set), dedicated to this specific annotation problem, i.e. dependent on the initial annotation vocabulary but which could be extended at any time later. This ontology should not only incorporate the subsumption relationships between the different concepts, but also richer semantic relations, such as contextual and spatial relationships. The overall goal is to extend the use of this ontology to previously unseen images (i.e. \(\forall\ i_x \notin \mathcal{DB}\)) in order to reason on the consistency of their annotations and to provide them a relevant textual description.

The design of our multimedia ontology as a well defined formal knowledge base is achieved through the following main steps, which are detailed in the remaining of this paper:

  • ⋆ Definition of the DL formalism of the proposed ontology, i.e. the expressiveness of the ontology.

  • ⋆ Definition of the set of axioms and inferences rules allowing to perform the reasoning tasks on the proposed ontology.

  • ⋆ Definition of the main concepts of the ontology.

  • ⋆ Definition of the RBox, i.e. definition of the key roles (relationships between concepts) and their properties.

  • ⋆ Definition of the TBox, i.e. definition of the subsumption hierarchy, and consequently the subsumption relationships between the ontology concepts.

  • ⋆ Definition of the ABox, i.e. the instances of concepts and the relations between them with respect to the roles defined in the RBox.

4 Formalism of our multimedia ontology

4.1 Preliminaries

The Web Ontology Language (OWL) is the current standard language for representing ontologies. It allows describing a domain in terms of: concepts (or classes), roles (or properties), individuals and axioms. Concepts (C) are a set of objects, individuals (I) are instances of concepts in C, roles are binary relationships between individuals in I, whereas axioms describe how these concepts, individuals, roles, etc. should be interpreted. Three sublanguages of OWL can be used: OWL-Full which is the most expressive language but reasoning within it is undecidable, OWL-Lite which has the lowest complexity but fewer constructs, and OWL-DL which has a good balance/trade-off between expressiveness and reasoning complexity [8].

In our approach, in order to ensure a high expressiveness with a decidable reasoning for our ontology, we used OWL 2 DL as a language for designing our ontology. Indeed, OWL 2 DL is more expressive than OWL-DL, i.e. includes more axioms. Concretely, we have implemented a framework using the OWL API Footnote 1 [23], which supports OWL 2 since it last version. The reasoning tasks about concepts, roles and individuals are also performed using our framework, which is based on the FaCT+ + reasoner and extending it with the axioms illustrated in Table 1 to support the Fuzzy Description Logics (Fuzzy DL). Initially, FaCT+ + supports the \(\mathcal {SROIQ}(D)\) logic (i.e. the DL for OWL2 ontology). However, our framework supports the fuzzy logic \(\mathnormal{f}\text{-}\mathcal{SROIQ}(D)\) thanks to the extension we have made.

Table 1 Syntax and semantics of the Fuzzy Description Logic \(\mathnormal{f}\text{-}\mathcal{SROIQ}(D)\) used for designing our multimedia ontology

Description Logics (DLs) are a family of logics for representing structured knowledge. Fuzzy DLs extend classical DLs by allowing to deal with fuzzy/imprecise concepts [38]. Indeed, in fuzzy logics a statement is no longer true or false, but is changed in a fuzzy statement signifying that it has a degree of truth α ∈ [0,1].

Fuzzy set preliminaries

In a formal way, let X be a set of elements. A fuzzy set A over a countable crisp set X is characterized by a membership function μ A : X →[0,1] (or A(x) ∈ [0,1]), assigning a membership degree A(x) to each element x in X. A(x) gives an estimation of the belonging of x to A. In fuzzy logics, the membership degree A(x) is regarded as the degree of truth of the statement “x is A”. Accordingly, a concept C is interpreted in fuzzy DL as a fuzzy set, and thus concepts become imprecise. For instance, the statement a:C (a is an instance of concept C) will have a truth-value in [0,1] given by its membership degree denoted \(C^\mathcal{I}(a)\). A fuzzy relation R over two countable crisp sets X and Y is a function R: X ×Y →[0,1]. R is reflexive iff for all x ∈ X, R(x,x) = 1 holds, while R is symmetric iff for all x, y ∈ X, R(x,y) = R(y,x) holds. R is said functional iff R is a partial function R: X ×Y →{0,1} such that for each x ∈ X there is a unique y ∈ X where R(x,y) is defined.

4.2 Expressiveness of our ontology

As aforementioned, for the sake of providing a highly expressive multimedia ontology with a decidable reasoning, we used the fuzzy DL \(\mathnormal{f}\text{-}\mathcal{SROIQ}(D)\) for designing our ontology. Based on the work of [37, 39], we introduce in the following the specific formalism (constructors and axioms) used for defining our multimedia ontology.

The \(\mathnormal{f}\text{-}\mathcal{SROIQ}(D)\) is a fuzzy extension of the \(\mathcal{SROIQ}(D)\) DL, which provide both a set of constructors allowing the construction of new concepts and roles. The \(\mathnormal{f}\text{-}\mathcal{SROIQ}(D)\) includes \(\mathcal{ALC}\) standard constructors (i.e. negation \(\neg\), conjunction ⊓, disjunction \(\sqcup\), full existential quantification \(\exists\), and value restriction ∀) extended with transitive roles (\(\mathcal{S}\)), complex role axioms (\(\mathcal{R}\)), nominals (\(\mathcal{O}\)), inverse roles (\(\mathcal{I}\)), and qualified number restrictions (\(\mathcal{Q}\)). (\(\mathcal{D}\)) indicates support for (fuzzy) concrete domains, i.e. datatype properties, data values or data types.

Fuzzy concrete domain

A fuzzy concrete domain is a pair \(\langle \Delta_D,\Phi_D\rangle\), where Δ D is an interpretation domain and Φ D is the set of fuzzy domain predicates d with a predefined arity n and an interpretation \(d^D:\Delta_D^n\rightarrow[0,1]\) [41].

In \(\mathnormal{f}\text{-}\mathcal{SROIQ}(D)\), concepts (denoted C or D) and roles (R) can be built inductively from atomic concepts (A), atomic roles (R A ), top concept \(\top\), bottom concept \(\bot\), named individuals (o i ), simple roles S, and universal role U. Simple roles S are inductively defined: (i) R A is simple if it does not occur on the right side of a Role Inclusion Axioms (RIA), (ii) R is simple if R is, (iii) if R occurs on the right side of a RIA, R is simple if, for each \(\langle w \sqsubseteq R \rhd \alpha\rangle\), w = S for a simple role S.

Fuzzy concepts

Under \(\mathnormal{f}\text{-}\mathcal{SROIQ}(D)\), a fuzzy concept is defined by the following assertions:Footnote 2

$$\begin{array}{lll} C & \rightarrow & \ \top\ |\ \bot\ |\ A\ |\ C_1 \sqcap C_2\ |\ C_1 \sqcup C_2\ |\ \neg C\ |\ \exists R.C\ |\ \exists T.d\ |\ \forall R.C\ |\ \forall T.d\ |\\ & & (\geq m\ S.C)\ |\ (\geq m\ T.d)\ |\ (\leq n\ S.C)\ |\ (\leq n\ T.d)\ |\ \{o_1,\ldots, o_n\}\\ D & \rightarrow & \ d \ |\ \neg d \end{array}$$

For more details about the semantics of these assertions cf. Table 1, constructors C1–C16.

Fuzzy \(\mathcal{KB}\)

A \(\mathnormal{f}\text{-}\mathcal{SROIQ}(D)\) knowledge base (denoted \(\mathcal{KB}\)) is a triple (\(\mathcal{T}\),\(\mathcal{R}\),\(\mathcal{A}\)) where \(\mathcal{T}\) is a fuzzy Terminological Box (TBox), \(\mathcal{R}\) is a regular fuzzy Role Box (RBox), and \(\mathcal{A}\) is a fuzzy Assertional Box (ABox) containing statements about individuals. The TBox and RBox contain general knowledge about the domain application.

Fuzzy ABox

The fuzzy ABox consists of a finite set of fuzzy concept and fuzzy role assertion axioms. Typically, these assertions include: concept assertion (\(\langle a:C\bowtie \alpha\rangle\)), role assertion (\(\langle(a:b):R \bowtie \alpha\rangle\)), concrete role assertion (\(\langle(a:b):T \bowtie \alpha\rangle\)), equality assertion (\(\langle a = b\rangle\)), and inequality assertion (\(\langle a\neq b\rangle\)). The semantics of these assertions is defined in Table 1, axioms A1–A5.

Fuzzy TBox

The fuzzy TBox is a finite set of General Concept Inclusions (GCI) constrained with a truth-value and of the form \(\langle C \sqsubseteq D \rhd \alpha\rangle\) between two \(\mathnormal{f}\text{-}\mathcal{SROIQ}(D)\) concepts C and D. Concept equivalence \(\langle C\equiv D \rangle\) can be captured by two inclusions \(C \sqsubseteq D\) and \( D\sqsubseteq C\). These assertions and their semantics are defined in Table 1, axioms A6 and A7.

Fuzzy RBox

The fuzzy RBox consists of a finite set of role axioms which are illustrated in Table 1, axioms A8–A14. These include: role inclusion axioms, disjoint role, symmetric role, reflexive role, transitive role, irreflexive role, and asymmetric role.

Owing to the specific motivations discussed in Section 4.3, we have defined the fuzzy operators used in Table 1 as follows:

  1. 1.

    product t-norm: a ⊗ b = a*b.

  2. 2.

    product t-conorm: a ⊕ b = a + b − a*b.

  3. 3.

    Łukasiewicz negation: ⊝ α = 1 − α.

  4. 4.

    Gödel implication (for GCIs and RIAs): \(\alpha \rightarrow \beta= 1 \text{ if } \alpha \leq \beta, \beta \text{ otherwise}\).

  5. 5.

    KD implication (for other constructors): αβ = max(1 − α, β).

Fuzzy interpretation

The Semantics of the \(\mathnormal{f}\text{-}\mathcal{SROIQ}(D)\) DL is defined in terms of fuzzy interpretations [38]. A fuzzy interpretation is a pair \(\mathcal I= (\Delta^\mathcal I,\cdot^\mathcal I)\) where \(\Delta^\mathcal I\) is a non-empty set of objects (called the domain) and \(\cdot^\mathcal I\) is a fuzzy interpretation function, which maps:

  • a concept name C onto a function \(C^\mathcal I:\Delta^\mathcal I \rightarrow [0,1]\),

  • a role name R onto a function \(R^\mathcal I: \Delta^\mathcal I \times \Delta^\mathcal I \rightarrow [0,1]\),

  • an individual name a onto an element \(a^\mathcal I \in \Delta^\mathcal I\),

  • a concrete individual v onto an element v D  ∈ Δ D ,

  • a concrete role T onto a function \(T^\mathcal I: \Delta^\mathcal I \times \Delta_D \rightarrow [0,1]\),

  • a concrete feature t onto a partial function \(t^\mathcal I: \Delta^\mathcal I \times \Delta_D \rightarrow \{0,1\}\)

Satisfiability

Finally, a fuzzy interpretation \(\mathcal{I}\) satisfies an \(\mathnormal{f}\text{-}\mathcal{SROIQ}(D)\) knowledge base \(\mathcal{KB}=(\mathcal{T}\),\(\mathcal{R}\),\(\mathcal{A})\) if it satisfies all axioms of \(\mathcal{T}\), \(\mathcal{R}\) and \(\mathcal{A}\). \(\mathcal{I}\) is then called a model of \(\mathcal{KB}\), written: \(\mathcal{I}\models \mathcal{KB}\).

4.3 Ontology-based reasoning

General automatic reasoning tasks on ontologies include concept consistency, concept subsumption to build inferred concepts taxonomy, instance classification and retrieval, parent and children concept determination, and answering queries over ontology classes and instances [1]. These reasoning tasks are induced by inferring logical consequences from a set of asserted facts or axioms.

Logical consequence

A fuzzy axiom τ is a logical consequence of a knowledge base \(\mathcal{KB}\), denoted \(\mathcal{KB}\models \tau\) iff every witnessed model of \(\mathcal{KB}\) satisfies τ.

Given a \(\mathcal{KB}\) and an axiom τ of the form \(\langle C\sqsubseteq D\rangle\), \(\langle a:C\rangle\) or \(\langle(a,b):R \rangle\), it is possible to compute the best explanation of a given statement (probably, about an image) as the τ’s best entailment degree (bed). The bed problem can be solved by determining the greatest lower bound (glb) [38].

Greatest lower bound

The greatest lower bound of τ with respect to a fuzzy \(\mathcal{KB}\) is:

$$ \label{eq:GLB} glb(\mathcal{KB},\tau)= \text{sup} \{n\ |\ \mathcal{KB}\models\langle \tau \geq n\rangle\}, \ \ \ where\ \ \text{sup} \ \emptyset = 0 $$
(1)

Example 1

(Greatest lower bound) For instance, given \(\mathcal{KB}=\{\langle(a,b):R, 0.5\rangle, \langle b:C, 0.9\rangle\}\), the greatest lower bound that a is an instance of a concept which is in relation R with concept C is:

$$ glb(\mathcal{KB}, a:\exists R.C)=0.45 $$

Best satisfiability degree

The best satisfiability degree (bsd) of a concept C with respect to a fuzzy \(\mathcal{KB}\) is defined as:

$$ \label{eq:BSD} bsd(\mathcal{KB},C) = \text{sup}_{\mathcal{I}\models\mathcal{KB}}\ \text{sup}_{x \in \Delta^\mathcal{I}}\ \left\{C^\mathcal{I}(x)\right\} $$
(2)

The best satisfiability degree consists in determining the maximal degree of truth that the concept C may have over all individuals \(x \in \Delta^\mathcal{I}\), among all models \(\mathcal{I}\) of the \(\mathcal{KB}\).

According to our specific context, and in order to achieve an efficient reasoning (and subsequently an accurate decision) on the best explanation of a given image, it is important to compute a membership degree for this explanation which reflects the likelihood of conjunction of all independent events composing it. The product logic makes possible to dispose of this desirable property for the t-norm. This assumption has motivated our choice for the product t-norm and the product t-conorm as fuzzy operators of our ontology—cf. Section 4.2. For instance, let us consider the following example where we want to compute the membership of an image i to the class BeachImage:

Example 2

(Product semantics and Zadeh semantics)

$$\begin{array}{rll} \mathcal{KB}&=&\{\langle i:Image,1 \rangle,\langle i:\exists depicts.Sea,\alpha_1 \rangle,\langle i:\exists depicts.Sand,\alpha_2 \rangle,\\ &&\langle i:\exists depicts.Sky,\alpha_3 \rangle\} \\ BeachImage &\equiv& Image \sqcap \exists depicts.Sea \sqcap \exists depicts.Sand \sqcap \exists depicts.Sky \end{array}$$
$$ \begin{array}{lll} glb(\mathcal{KB}, i:BeachImage)&=&\alpha_1\otimes\alpha_2\otimes\alpha_3 \\ &=& \left\{ \begin{array}{ll} {\rm min}\{\alpha_1,\alpha_2,\alpha_3\} & \text{ under Zadeh semantics}\\ \alpha_1*\alpha_2*\alpha_3 & \text{ under Product semantics} \end{array} \right. \end{array} $$

Both explanations and membership degrees are meaningful with respect to a given application. However, according to our target application, the product semantics allows to dispose of a more significant membership value than the one produced by Zadeh semantics. For example, let us suppose that α 1, α 2, and α 3 are produced as a result of an image classification process, or an object detection one. Therefore, it would be more accurate to compute the membership degree of the image i to the class BeachImage as the product of the confidence values of these classifiers than as the minimum score of these classifiers. This property is reachable by the use of product semantics.

5 Building of our multimedia ontology

5.1 Main concepts of our ontology

Proposed concepts

The proposed multimedia ontology relies mainly on the four following concepts, which can recursively involve similar concepts (Fig. 2a):

  • “Thing” represents the top concept (\(\top\)) of the ontology,

  • “Concept” is the generic concept in our ontology to represent a concept from the annotation vocabulary, i.e. any concept \(c_j\in\mathcal{C}\cup\mathcal{C'}\) used to describe the content of an image.

  • “Image” is the generic concept to represent an image, i.e. each image i i of the database will be considered as an instance of the concept “Image” with a satisfiability degree of 1 (\(\langle i_i:Image,1\rangle\)).

  • “Annotation” is a generic concept introduced to represent a given annotation, i.e. a set of concepts as a whole. We will come back on this notion later.

Fig. 2
figure 2

Illustration of the used roles for defining concept relationships in our ontology. Figure a illustrates the main concepts of our ontology and the used fuzzy roles (in dashed arrows) for defining the relationships between concepts. Figure b illustrates the roles names

5.2 Definition of the RBox

As stated previously, our intent is to design an ontology of spatial and contextual information dedicated to reasoning about the consistency of image annotation. According to this aim, we define in Table 2 the proposed roles and their properties, which constitute the RBox of our multimedia ontology. These roles can be categorized as contextual relationships and spatial relationships, and are detailed respectively in Section 5.4.1 and in Section 5.4.2. The choice of these specific roles is motivated by the reasoning scenarios designed to improve the image annotation task. However, these roles can be further enriched depending on referred applications.

Table 2 Roles and functional roles used for defining concept relationships in our ontology

5.3 Building the semantic hierarchy and definition of the TBox

The subsumption hierarchy (and respectively the subsumption relationships) is a fundamental component of ontologies. It acts as a backbone of the produced ontology, where the subsumption roles allow defining the inheritance of properties from the parent (subsuming) concepts to the child (subsumed) concepts. Thus, any statement that is true (with an α degree) for a parent concept is also necessarily true (with at least an α degree) for all of its subsumed concepts. Furthermore, these subsumption relationships allow defining the Terminological Box of ontologies.

In our approach, we propose to automatically build a subsumption hierarchy where leaf nodes are the initial concepts of the considered dataset (\(c_j \in \mathcal{C}\)), and mid-level nodes are the concepts discovered by a variant of the approach proposed in [3]. Indeed, in order to design a representative ontology of the image semantics, we propose in this paper to automatically build the semantic hierarchy using a Semantico-Visual similarity computed between image concepts. The used Semantico-Visual similarity incorporates:

  1. (i)

    a visual similarity which represents the visual distance between concepts, and

  2. (ii)

    a conceptual similarity which defines a relatedness measure between target concepts based on their definitions in WordNet.

Afterwards, the building of the subsumption hierarchy is bottom-up, and is based on a set of heuristic rules in order to link together the concepts that are semantically most related w.r.t the previously computed similarity. Consequently, the building of the subsumption hierarchy consists in identifying \(|\mathcal C'|\) new concepts that link all the concepts of \(\mathcal C\) in a hierarchical structure that best represents image semantics. For more information about these (visual and conceptual) similarities and the used rules for linking concepts together, the reader is suggested to refer to [3].

Subsequent to the building of the semantic hierarchy, the subsumption relationships between all pairs of concepts (c i , c j  ∈ C ∪ C′) are added to our ontology according to the hierarchy structure. This is achieved automatically using the axiom A6 illustrated in Table 1.

Figure 3 illustrates the built semantic hierarchy on the Pascal VOC’2010 dataset. This semantic hierarchy allowed to define the subsumption relationships between image concepts. We can observe that the produced hierarchy is a N-ary tree like-structure, where leaf nodes are the concepts in \(\mathcal C\). Mid-level concepts are automatically recovered from WordNet based on the previously introduced method. We can also observe that the connected concepts share strong visual and semantic similarity, which justifies the choice of this method in our approach. We therefore concur with the assumption that a suitable semantic hierarchy for representing image semantics should incorporate visual and conceptual (semantic) modalities during the building process [3].

Fig. 3
figure 3

The semantic hierarchy built on Pascal VOC’2010 dataset. Double octagon nodes are original concepts, i.e. concepts of \(\mathcal C\), and the diamond one is the root of the produced hierarchy

5.4 Definition of the ABox

Following the building of the semantic hierarchy that will be used as the backbone of our ontology, information about the context of images is added to our ontology in order to design a more representative knowledge base of image semantics. This information, mainly consisting of contextual and spatial relationships between image concepts will forms the ABox of our ontology and will serves for reasoning about image annotation. Furthermore, our intent is to design a fuzzy multimedia ontology in order to model the inherent uncertainty of concept relationships, which should lead to a more efficient decision-making during the image annotation process. Consequently, we introduce in the following how the confidence degrees of each of the proposed fuzzy roles (concept relationships) are computed.

5.4.1 Contextual relationships

Contextual information is of great interest to help understanding the image semantics. A simple form of contextual information is the co-occurrence frequency of a pair of concepts. For example, it is intuitively clear that if two concepts are similar or related, it is likely that their role in the world will be similar, and thus their context of occurrence will be equivalent (i.e. they tend to occur in similar contexts, for some definition of context). For instance, a photo containing “Television” and “Sofa” depicts usually a “Living-room” scene. Nevertheless, contextual similarity is a ‘corpus-dependent’ measure, i.e. depends on the concepts distribution in the dataset. It is therefore important to normalize the measures based on contextual information.

In our approach, we define three contextual relationships that we estimated important for reasoning about image annotation. These are: \(\mathcal {CON}=\) {“hasFrequency”, “hasAppearedWith”, “isAnnotatedBy”}. However, nothing prevents the enrichment of our multimedia ontology with other contextual relationships in order to adapt to other reasoning scenarios. The proposed relations (\(\in \mathcal {CON}\)) are detailed bellow.

Let us consider an image database \(\mathcal{DB}\), where:

  • \(\mathcal{L}\) is the number of images in the database,

  • \(\mathcal{N}\) is the size of the annotation vocabulary,

  • n i is the number of images annotated by c i (occurrence frequency of c i ), and

  • n ij the number of images co-annotated by c i et c j .

Our objective is to estimate P(c i ) as the probability of occurrence of a given concept c i (and respectively P(c i ,c j ) as the joint probability of c i and c j ) in \(\mathcal{DB}\). These probabilities can be easily estimated by:

$$ \label{aquPci} \widehat{P(c_i)}=\frac{n_i}{\mathcal{L}} $$
(3)
$$\label{Pci} \widehat{P(c_i,c_j)}=\frac{n_{ij}}{\mathcal{L}} $$
(4)

Based on these probabilities, we define the concept frequency relationship as the concrete feature: \(hasFrequency:\Delta^\mathcal{I}*\Delta_D\rightarrow\{0,1\}\), where \(\Delta^\mathcal{I}= \mathcal{C}\) and Δ D  = [0,1] are the interpretation domains. This concrete feature associates to each concept \(c_i \in \mathcal{C}\) a fuzzy degree corresponding to its occurrence frequency in \(\mathcal{DB}\):

$$\label{has_freq} \mu_{\text{hasFrequency}(c_i)}=P(c_i) $$
(5)

We also define the contextual relationship ’hasAppearedWith’ as the fuzzy role \(hasAppearedWith: \Delta^\mathcal{I}*\Delta^\mathcal{I}\rightarrow [0,1]\), where \(\Delta^\mathcal{I}= \mathcal{C}\). The membership degree of this relationship is computed using the Normalized Pointwise Mutual Information (NPMI). To this purpose, the Pointwise Mutual Information ρ(c i ,c j ) is firstly computed for all pairs of concept \(c_i, c_j \in \mathcal{C}\) as follows:

$$ \label{Equ:joint_pro} \rho(c_i,c_j)= \log \frac{P(c_i,c_j)}{P(c_i)P(c_j)} = \log \frac{{\mathcal{L}*n_{ij}}}{n_i*n_j} $$
(6)

ρ(c i ,c j ) quantifies the amount of information shared between the two concepts c i and c j . Thus, if c i and c j are independent concepts, then P(c i ,c j ) = P(c i P(c j ) and therefore ρ(c i ,c j ) = log 1 = 0. ρ(c i ,c j ) can be negative if c i et c j are negatively correlated. Otherwise, ρ(c i ,c j ) is positive and quantifies the degree of dependence between these two concepts. In this work, we only want to estimate the positive correlation between each pair of concepts from the annotation vocabulary and therefore we set the negative values of ρ(c i ,c j ) to 0. Moreover, in order to normalize it into [0,1], the membership degree of the fuzzy role ‘hasAppearedWith’ is computed as follows:

$$\label{Equ:ContextSim} \mu_{\text{hasAppearedWith}(c_i,c_j)}= \frac{\rho(c_i,c_j)}{-\log[\max(P(c_i),P(c_j))]} $$
(7)

Finally, we define the fuzzy role ‘isAnnotatedBy’ as a relationship between instances of concepts “Image” and “Annotation”, i.e. \(isAnnotatedBy: \Delta^\mathcal{I}*\Delta^\mathcal{I}\rightarrow [0,1]\), where \(\Delta^\mathcal{I}= \{Image,Annotation\}\). This relationship is intended to represent the probability of finding an image in \(\mathcal{DB}\) annotated by a set of concepts (\(Annotation_j=\langle c_1,c_2,\cdots,c_\Lambda\rangle\)), or inversely, the likelihood that a given annotation ’Annotation j ’ is associated with an image \(i_i \in \mathfrak{I}\). To this end, all the possible annotations in \(\mathcal{DB}\) are extracted and are added to our ontology as subconcepts of concept “Annotation”. The confidence value of this relationship is computed as follows:

$$ \label{eq:is_annotatedby} \mu_{\text{isAnnotatedBy}(Image_1,Annotation_j)}=\frac{n_{Annotation_j}}{\mathcal{L}} $$
(8)

where \(Annotation_j=\langle c_1,c_2,\cdots,c_{\Lambda}\rangle\) is a textual annotation used for annotating a set of images in \(\mathcal{DB}\), \(n_{Annotation_j}\) is the number of images annotated by Annotation j , and \(\mathcal{L}=|\mathfrak{I}|\) is the total number of images in \(\mathcal{DB}\).

For instance, Example 3 illustrates some inputs of the added assertions to our ABox.

Example 3

(Contextual relationship: ‘isAnnotatedBy’)

$$\begin{array}{rll} \langle Annotation_1 & \equiv & Aeroplane \sqcap Car \sqcap Person\rangle\\ \langle Annotation_1 & \sqsubseteq & Annotation\geq1 \rangle \rangle\\ \langle Annotation_2 & \equiv & Dining\_Table \sqcap Chair \sqcap Bottle \sqcap Dog\rangle\\ \langle Annotation_2 & \sqsubseteq & Annotation \geq1 \rangle\\ \langle a &:& Image \geq1 \rangle\\ \langle b &:& Annotation_1 \geq1 \rangle\\ \langle (a:b) &:& isAnnotatedBy \geq 0.023064 \rangle\\ &\cdots& \end{array} $$

5.4.2 Spatial relationships

Spatial information is a valuable source for the understanding of image semantics. The spatial arrangement of objects provides an important information for the recognition and interpretation tasks, and allows to solve the ambiguity between objects having a similar appearance [7]. For instance, using object detectors if one have detected in an image that “Sky” has appeared bellow “Sea”, it is easy to fix this prediction using spatial information because any well defined knowledge base (\(\mathcal{KB}\)) would allow to detect and correct this inconsistency.

In our approach, eight spatial relationships are used in order to define the directional positions and distances between image concepts. The directional relationships are defined as follows: \(\mathcal {DIR}=\) {“hasAppearedAbove”, “hasAppearedBelow”, ‘hasAppearedLeftOf”, “hasAppearedRightOf”, “hasAppearedAlignedWith”}, such as \(\forall \mathcal{X} \in \mathcal {DIR}, \mathcal{X}: \Delta^\mathcal{I}*\Delta^\mathcal{I}\rightarrow [0,1]\), with \(\Delta^\mathcal{I}= \mathcal{C}\).

The relationships in \(\mathcal {DIR}\) are derived from the following primitives: ’left’, ‘right’, ‘above’, ‘below’ and ‘aligned’, which are computed according to the angle between the segment joining two points ’a’ and ’b’ (where ‘a’ and ‘b’ are the centroids of two given objects in a given image) and the x-axis of the image—cf. Fig. 4. This angle, denoted θ(a,b), takes values in [ − π,π] which constitutes the domain of definition of these primitives. They are then computed using cos 2 θ and sin 2 θ, and are functions from [ − π,π] into {0,1}. Thus, any of the previous primitives can be computed by an angle α with the x-axis as illustrated in Fig. 5.

Fig. 4
figure 4

Spatial primitives are computed according to the angle between the segment joining two points ‘a’ and ‘b’ and the x-axis of the image. ‘a’ and ‘b’ are the centroids of two given objects (here “Cow” and “Person”) in a given image

Fig. 5
figure 5

Directional relationships are computed according to an angle α with the x-axis

Regarding the primitive ‘aligned’, it takes 1 when θ ∈ [ − π/6,π/6] ∪ [5π/6, − 5π/6] and 0 otherwise. A comprehensive survey about spatial relationships for image processing can be found in [7].

The confidence value of a given directional relationship is finally computed as follows:

$$\label{equ:Spatial} \mu_{\mathcal{X}(c_i,c_j)}=\frac{\sharp\text{ of instances where }\mathcal{X}(c_i,c_j)}{n_{ij}} $$
(9)

where \(c_i,c_j \in \mathcal{C}\), and \(\mathcal{X}\) is a directional relationship, i.e. \(\mathcal{X}\in \mathcal{DIR}\).

In addition, we define in our approach the distance relationships as \(\mathcal {DIS}=\) {“hasAppearedCloseTo”, “hasAppearedFarFrom”}, such as \(\forall \chi \in \mathcal {DIS}, \chi: \Delta^\mathcal{I}*\Delta^\mathcal{I}\rightarrow [0,1]\), with \(\Delta^\mathcal{I}= \mathcal{C}\). These distance relationships are computed according to the Euclidean distance on the considered objects. To this purpose, let us consider in a given image two objects O and P defined by their centroids (x 1,y 1) and (x 2,y 2), and their bounding box (O xmin, O xmax, O ymin, O ymax) and (P xmin, P xmax, P ymin, P ymax). We define then the following primitives:

$$ distance(O,P) = \sqrt{(x_1-x_2)^2+(y_1-y_2)^2} \\ $$
(10)
$$ size(O) = \sqrt{(O_{x{\rm max}}-O_{x{\rm min}})^2+(O_{y{\rm max}}-O_{y{\rm min}})^2}\\ $$
(11)
$$ close(O,P)= \left\{ \begin{array}{ll} 1 & \ \ \text{if} \ \ distance(O,P)<2(size(O)+size(P))\\ 0 & \ \ \text{otherwise} \end{array} \right. $$
(12)
$$ farfrom(O,P)= \left\{ \begin{array}{ll} 1 & \ \ \text{if} \ \ distance(O,P)\geq2(size(O)+size(P))\\ 0 & \ \ \text{otherwise} \end{array}\right. $$
(13)

Using the previous primitives, distance relationships can easily be computed by the following equation:

$$\label{equ:Spatial} \mu_{\chi(c_i,c_j)}=\frac{\sharp\text{ of instances where }\chi(c_i,c_j)}{n_{ij}} $$
(14)

where \(c_i,c_j \in \mathcal{C}\), and χ is a distance relationship, i.e. \(\chi\in \mathcal{DIS}\).

Example 4

(Spatial relationships)

$$ \begin{array}{rll} \langle a &:& Bottle \geq1 \rangle\\ \langle b &:& Dining\_Table \geq1 \rangle\\ \langle (a:b) &:& hasAppearedAbove \geq 0.76 \rangle\\ \langle (a:b) &:& hasAppearedBelow \geq 0.02 \rangle\\ \langle (a:b) &:& hasAppearedAlignedWith \geq 0.62 \rangle\\ \langle (a:b) &:& hasAppearedCloseTo \geq 0.97 \rangle\\ &\cdots& \end{array} $$

In order to illustrate our approach for building multimedia ontologies, we show in Fig. 6 an extract of the built ontology on Pascal VOC dataset. This figure depicts the main concepts of the built ontology and the used roles for defining concepts relationships. Full arrows represent the subsumption relationships between the ontology concepts. Dashed arrows represent the fuzzy roles used for defining the contextual and spatial relationships between concepts. For the clarity of the illustration we restricted the Annotation j concept number to 4 and we did not displayed the instances (individuals).

Fig. 6
figure 6

An extract of the built multimedia ontology on Pascal VOC dataset is illustrated in figure a. Dashed arrows represent the fuzzy roles used for defining the contextual and spatial relationships between concepts. Figure b illustrates the roles names

6 Proposed method for image annotation: Multi-stage reasoning framework for image annotation

Automatic image annotation is still a challenging problem despite more than a decade of research. Indeed, current approaches are struggling to scale up because of the lack of a computational model allowing to model such a complex system, the uncertainty introduced by the statistical learning algorithms, the dependency on the accuracy of the ground truth of the training dataset and the well-known semantic gap problem. Given a training dataset, automatic image annotation often consists in building a computational model that enables to predict a set of concepts from the annotation vocabulary to previously unseen images.

Image classification is a widely used technique for image annotation. It consists in performing several binary SVM classifiers on an input image to find to which classes it belongs to. The annotation of an image depends therefore on the classifier outputs, i.e. an image is annotated by a concept \(c_i \in \mathcal{C}\) if the output of the classifier associated to c i is positive. Usually, such a process involves considerable uncertainty because of the errors introduced by the machine learning algorithms. However, this uncertainty can be reduced using reasoning over the produced image annotation. For instance, it is most often easy to compute a confidence score (membership value) for the classification of an image to a given class. Such information is valuable and can be of great importance to improve image classification accuracy. For instance, one can improve image annotation in a post-classification process based on these confidence scores and an explicit knowledge source, such as an ontology which models images context. In that way, this uncertainty is used itself as a knowledge source in order to achieve a better decision-making on the image annotation. Furthermore, the use of an explicit knowledge model can help model, reduce, or even remove this uncertainty by supplying a formal framework to reason about the consistency of extracted information from images.

Our approach is motivated by the above assumption. Indeed, we propose in the following a multi-stage reasoning framework for image annotation based on the earlier built multimedia ontology. The proposed framework allows reasoning on the provided annotations by the image classification algorithm in order to achieve a semantically relevant image annotation. A global overview of the proposed approach is illustrated in Fig. 7.

Fig. 7
figure 7

Proposed method: a knowledge-based multi-stage reasoning framework for image annotation

Specifically, we consider the following problem. Given a formal multimedia ontology designed as a fuzzy knowledge base \(\mathcal{KB}=\langle\mathcal{T}\),\(\mathcal{R}\),\(\mathcal{A}\rangle\), where \(\mathcal{T}\) is a fuzzy Terminological Box (TBox), \(\mathcal{R}\) is a regular fuzzy Role Box (RBox), and \(\mathcal{A}\) is a fuzzy Assertional Box (ABox). This fuzzy knowledge base is assumed to contain the following explicit knowledge about ontology concepts: i) subsumption relationships, ii) contextual relationships, and iii) spatial relationships. This multimedia ontology is then used within our framework for annotating previously unseen images. As illustrated in Fig. 7, this is achieved by the following steps:

  • A hierarchical classification is performed on the input image, and the confidence score for each concept \(c_j \in \mathcal{C}\cup \mathcal{C'}\) is recovered.

  • These concepts and their confidence scores are thereafter transformed into fuzzy description logics assertions, and their consistency is checked using the subsumption relationships and our fuzzy DL reasoner. Inconsistent concepts are removed from the candidate annotationFootnote 3 of the input image.

  • Thereafter, the consistency of the set of concepts from the candidate annotation is checked with respect to the contextual relationships and our fuzzy DL reasoner. Inconsistent concepts are again removed from the candidate annotation of the input image.

  • Finally, the consistency of the candidate annotation is checked with respect to the spatial information, and the final (candidate) annotation is associated with the input image. This final annotation is supposed to be semantically consistent.

6.1 Hierarchical image classification

Based on the subsumption hierarchy, we propose in the following to train several classifiers that represent the same concept at different levels of abstraction. These classifiers are consistent with each other since they are linked by the subsumption relationship, and then represent the same information with different levels of details. Therefore, it is possible to reason on the outputs of these classifiers in order to achieve a relevant decision on the belonging of an image to a given class.

Concretely, given a semantic (subsumption) hierarchy, a classifier for each concept node of the hierarchy is trained by performing a One-Versus-All (OVA) Support Vector Machines [11]. Specifically, for training the classifier of a target concept node we took as positive samples all images associated with its children leaf nodes. Negative samples are all the other images of the training database. Therefore, the semantic hierarchy is only used to recover the set of positive and negative sample images for training the classifiers of each concept node at the different layers of the hierarchy. Consequently, the decision function of each classifier is independent from its subsumed (child) and subsuming (parent) concept nodes.

Let \(x_i^v\) be any visual representation of an image \(i_i \in \mathfrak{I}\) (a visual feature vector), we train for each concept class (\(c_j \in \mathcal{C} \cup\mathcal{C}'\)) in the hierarchy a classifier that can associate c j with its visual features. This is achieved by the use of \(|\mathcal{C}|+|\mathcal{C'}|\) binary SVM OVA, with a decision function:

$$\label{Equ:SVM} \mathcal{G}(x_i^v)=\sum\limits_k \alpha_k y_k \mathbf{K}(x_k^v,x_i^v)+b $$
(15)

where \(\mathbf{K}(x_k^v,x_i^v)\) is the value of a kernel function for the training sample \(x_k^v\) and the test sample \(x_i^v\), y k  ∈ {1, − 1} is the class label of \(x_k^v\), α k is the learned weight of the training sample \(x_k^v\), and b is a learned threshold parameter.

Radial Basis Function (RBF) kernel is used for the training of our SVM:

$$\label{Equ:Kernel} \mathbf{K}\left(x_k^v,x_i^v\right)=\exp \left(\frac{\|x_k^v-x_i^v\|^2}{\sigma^2}\right) $$
(16)

6.2 Reasoning on image annotation using the subsumption hierarchy

Based on the classifiers outputs and the subsumption relationships, we propose in the following to check the consistency of candidate concepts. So, let us consider a previously unseen image \(i'_i \in \mathfrak{I'}\). Performing a hierarchical image classification on i i produces an output \(\mathcal{P}\) which consists of a set of candidate concepts {\(c_j \in \mathcal C \cup \mathcal C', j=1..n_{i'_i}\)} and their confidence values {\(\alpha_j, j=1..n_{i'_i}\)}, i.e. \(\mathcal{P}=\langle(c_0,\alpha_0), (c_1,\alpha_1), \cdots (c_m,\alpha_m)\rangle\) as illustrated in Fig. 8. Subsequently, these concepts and their confidence scores are transformed into fuzzy description logics assertions. In order to do so, we first normalize into [0,1] the outputs {\(\alpha_j, j=1..n_{i'_i}\)} of the SVM classifiers by assigning zero to negative values and performing min-max normalization on the positive values. Thereafter, the consistency of each concept \(c_j \in \mathcal{C}\) is checked using the subsumption relationships and our fuzzy DL reasoner. Inconsistent concepts are removed from the candidate annotation.

Fig. 8
figure 8

Illustrative examples of the proposed method for annotating images

Specifically, our objective is to check the consistency of a candidate concept \(c_j\in \mathcal C\) to a given image i i using the subsumption relationships, and thus the set of its hypernyms {\(c_k\in \mathcal C'\ | \ c_j:C>0 , c_k:D>0 , C\sqsubseteq D>0\)}. Therefore, the reasoning process can be formulated using conjunctive queries as follows:

$$\begin{array}{lll} valid(c_j) &\leftarrow& \mathcal{P}(c_j)>0 \wedge c_j:C>0 \wedge c_k:D>0 \wedge C \sqsubseteq D>0 \wedge valid(c_k)\\ valid(\top) &=& 1 \end{array}$$

where \(\top\) is the root of the ontology, and \(\mathcal{P}(c_j)\) represents the confidence score of the concept c j given by α j .

In DL, given an abstract individual ‘a’ (an instance of a given candidate concept), the consistency checking of concept inclusions is performed as follows. For \(C \sqsubseteq D\), we compute the greatest lower bound \(glb(\mathcal{KB},C \sqsubseteq D)\) using Axiom A6 in Table 1, i.e. as the minimal value of x such that \(\mathcal{KB} = \langle \mathcal{T},\mathcal{R},\mathcal{A} \cup \{\langle a:C,\alpha_1\rangle\}\cup \{\langle a:D,\alpha_2\rangle\}\rangle\) is satisfiable under the constraints expressing that α 1α 2 ≤ x, with α 1 and α 2 ∈ [0,1]. This process is then iterated until the root of the ontology is reached. Thus, we come up with the following hierarchy: \(C_1 \sqsubseteq C_2\geq x_1, C_2 \sqsubseteq C_3\geq x_2,\cdots, C_n \sqsubseteq \top \geq 1\). Thereafter, a confidence score for the considered candidate concept is computed as follows:

$$ \label{eq:ConsistencyCheking} bed(\mathcal{KB},a:ValidCC)= x_1\otimes x_2\otimes \cdots \otimes 1=x_1*x_2*\cdots*1 $$
(17)

where ValidCC stands for a Valid Candidate Concept, which is a concept defined to regroup all the consistent candidate concepts.

Finally, all candidate concepts with a confidence score equal to zero are removed from the annotation of the image i i .

In order to illustrate our approach, let us consider the first example in Fig. 8 where evaluations were performed on Pascal VOC’2010 dataset. The image classification algorithm has detected “Motorbike” as a candidate concept (among others) for the considered image. However, according to the subsumption hierarchy (cf. Fig. 3) a “Motorbike” \(\sqsubseteq\) “Wheeled_vehicle” \(\sqsubseteq\) “Conveyance”, etc., and therefore the classifiers should also have detected these concepts to stay coherent. The consistency checking of the concept “Motorbike” is performed according to the previously described procedure [–cf. Example 5], and thus this concept is removed from the list of candidates since \(bed(\mathcal{KB},Motorbike:ValidCC)=0\).

Example 5

(Consistency checking of concept “Motorbike”)

$$\begin{array}{rll} \mathcal{KB} &=& \langle \mathcal{T},\mathcal{R},\mathcal{A} \cup \{\langle a:Motorbike\geq0.262\rangle\}\\ &&\,\cup \,\{\langle a:Weeled\_vehicule\geq0\rangle\}\ \\ && \,\cup\,\{\langle a:Conveyance\geq 0\rangle\}\\ &&\,\cup \,\{\langle a:Abstraction\geq0.109\rangle\}\\ &&\,\cup \,\{\langle a:Concept\geq1\rangle\}\rangle\\ bed(\mathcal{KB},Motorbike:ValidCC)&=& 0.262\otimes 0 \otimes 0 \otimes 0.109 \otimes 1= 0 \end{array}$$

6.3 Reasoning on image annotation using image context

As aforementioned, contextual information can provide valuable information for the understanding of image context or to reason about the consistency of image annotation. For instance, it is evident that an image which contains the set of concepts {“Aeroplane”, “Person”, “Car”} represents a scene of an airport tarmac, and not the one of a flying plane. And conversely, it is obvious that an image that contains “Dining_table” and “Sofa” should not contain “Boat” or “Bus”. Thus, contextual information, if processed, can be helpful to check the consistency of image annotations.

Using our multimedia ontology, it is easy to recover contextual information about images. Consequently, we propose in the following to use this information to recover from our ontology all consistent annotations with respect to contextual information, and to compute the best explanation of a considered image. Specifically, the fuzzy role “isAnnotatedBy” allows predicting a confidence score (based on contextual information) for a given set of candidate concepts. Given a Candidate Annotation \(CA_j=\langle c_1,c_2,\cdots,c_m\rangle\) and a target image \(i'_i \in \mathfrak{I'}\), a confidence score is computed to estimate the correlation likelihood between CA j and i i . This confidence score increases according to the likeliness of the candidate annotation CA j , or it is equal to 0 when the annotation is not valid.

Concretely, given an image i i and \(\mathcal{P'}:\langle(c_0,\alpha_0), (c_1,\alpha_1), \cdots (c_m,\alpha_m)\rangle, m=|\mathcal{P'}|\), a set of valid candidate concepts with respect to the subsumption relationships, we build first the set of candidate annotation (CA j , j ∈ 1..|combinaisons|) by taking all the possible combination of the concepts in \(\mathcal{P'}\). A confidence score for each valid candidate annotation (ValidCA) is then computed. For instance, let us assume that we dispose of one candidate annotation consisting of 3 concepts. Its confidence score is computed as follows:

Example 6

(Reasoning using image context)

$$\begin{array}{lll} &&\mathcal{P'}:\langle(c_1,\alpha_1), (c_2,\alpha_2), (c_3,\alpha_3)\rangle,\ \text{(classifier outputs)}\\ &&\langle c_1:C_1\geq\alpha_1\rangle,\ \langle c_2:C_2\geq\alpha_2\rangle, \langle c_3:C_3\geq\alpha_3\rangle\ \\ && \langle CA \equiv C_1 \sqcap C_2 \sqcap C_3\rangle\\ &&\langle b:CA\geq \alpha_b\rangle,\ \ s.t.\ \alpha_b= \alpha_1 \otimes \alpha_2 \otimes \alpha_3\\ &&\mathcal{KB} = \langle \mathcal{T},\mathcal{R}, \mathcal{A} \cup \{\langle a:Image\geq\alpha_a\rangle\}\cup \{\langle b:CA\geq\alpha_b\rangle\}\rangle\\ &&\langle (a,b):isAnnotatedBy\geq\alpha_r\rangle, \text{is already stored in the $\mathcal{KB}$ during the ontology} \\ && \text{building process, where $\alpha_r=\mu_{\text{isAnnotatedBy(a,b)}}$ (cf. (8))}.\\ \end{array} $$

Therefore, according to (1), the correlation likelihood between a candidate annotation CA and a given image i i can be computed as follows:

$$\label{eq:ValidCA} glb(\mathcal{KB},a:\exists\ isAnnotatedBy.CA)=\alpha_b\otimes\alpha_r =(\alpha_1 \otimes \alpha_2 \otimes \alpha_3) \otimes \ \mu_{\text{isAnnotatedBy}(a,b)} $$
(18)

then,

$$\label{eq:ValidCA2} ValidCA\equiv \exists\ isAnnotatedBy.CA $$
(19)

Finally, the best explanation (bex) of i i is retrieved as the ValidCA having the maximum correlation likelihood among all the others. This explanation is computed as follows:

$$ \label{eq:bex} bex(\mathcal{KB},ValidCA)=\{\langle a,r \rangle | r= bed(\mathcal{KB},a:ValidCA) \} $$
(20)

For instance, let us consider the first example in Fig. 8. We show below some cases of DL reasoning using the contextual information:

Example 7

(DL Reasoning using image context)

$$ \begin{array}{lll} &&\mathcal{P'}:\langle(c_1:Horse,0.391), (c_2:Person,0.805), (c_3:Sheep,0.519), (c_4:Cow,0.310)\rangle\\ &&\langle CA_0 \equiv Horse \sqcap Person \sqcap Sheep\rangle\\ &&\langle CA_1 \equiv Person \sqcap Sheep\rangle\\ &&\langle CA_2 \equiv Cow \sqcap Person\rangle\\ &&\langle b_0:CA_0\geq 0.163\rangle\\ &&\langle b_1:CA_1\geq 0.417\rangle\\ &&\langle b_2:CA_2\geq 0.249\rangle\\ &&\mathcal{KB} = \langle \mathcal{T},\mathcal{R}, \mathcal{A} \cup \{\langle a:Image\geq1\rangle\}\cup \{\langle b_0:CA_0\geq0.163\rangle\}\cup \{\langle b_1:CA_1\geq0.417\rangle\}\\ &&\phantom{\mathcal{KB} = }\;\cup\{\langle b_2:CA_2\geq0.249\rangle\}\rangle\\&& glb(\mathcal{KB},a:\exists\ isAnnotatedBy.CA_0)= (0.391 \otimes 0.805 \otimes 0.519) \otimes 0.003548= 0.00057\\ &&glb(\mathcal{KB},a:\exists\ isAnnotatedBy.CA_1)= (0.805 \otimes 0.519) \otimes 0.027413 = \textbf{0.01145}\\ &&glb(\mathcal{KB},a:\exists\ isAnnotatedBy.CA_2)= (0.391 \otimes 0.805) \otimes 0.025455=0.00635\\ && bex(\mathcal{KB},ValidCA)=0.01145 \end{array} $$

Consequently, with respect to the contextual information, the best explanation for the left image in Fig. 8 is: CA 1 ≡ Person ⊓ Sheep.

Please note that, since most images of the Pascal VOC dataset contain only one or two concepts [10], and thus the distribution of multi-labeled images is not uniform, we computed (8) for this dataset as:

$$\label{eq:isAnnotatedBy2} \mu_{\text{isAnnotatedBy}(Photo,Annotation_i)}=\frac{n_{Annotation_i}}{\mathcal{L}}*\exp(\Lambda) $$
(21)

where Λ = |Annotation i |.

6.4 Reasoning using spatial information

Contextual knowledge can help the recognition of objects within a scene by providing predictions about objects that are most likely to appear in a specific setting, i.e. topological information, along with the locations that are most likely to contain objects in the scene, i.e. spatial information. Specifically, the spatial arrangement of objects provides important information for the recognition and interpretation tasks, and allows to solve ambiguity between objects having a similar appearance. As part of this work, we have proposed an approach based on image classification for annotating images. Consequently, we do not dispose of the spatial position of detected concepts, and therefore the reasoning capabilities using spatial information are limited in the current approach. However, we propose in the following a simple but effective usage scenario that relies on the spatial arrangement of the currently detected concepts in order to provide a semantically consistent image annotation. In Section 8, we propose some usage scenarios that illustrate the usefulness of spatial information and the reasoning over this kind of knowledge in order to improve image annotation.

Given an image \(i'_i \in \mathfrak{I'}\) and \(\mathcal{P''}:\langle(c_0,\alpha_0), (c_1,\alpha_1),\cdots(c_m,\alpha_m)\rangle, m=|\mathcal{P''}|\), a set of a valid candidate concepts with respect to the subsumption relationships and contextual information. We propose first to query the ontology in order to retrieve all possible spatial arrangement of all pairs of concepts (c j ,c k ) \(\in \mathcal{P''}\), and to recover the confidence score of each of these spatial arrangements. A score can then be computed as the maximum likelihood of all spatial arrangements of these concepts to find the best explanation of i i . Algorithm 1 details the different steps of this method.

Reasoning on spatial information should also allow to provide a good image interpretation. For instance, computing the maximum spatial arrangement likelihood allows to retrieve the likeliness of spatial arrangement of each detected concept in a given image. This will allow for example, to provide a textual description of a given image in the following way:

  • Figure 8, first example:      “This picture depicts a person standing on the left of a sheep. They are close to each other.

  • Figure 8, second example:      “This picture depicts a cat sitting on a table in a living room. There is a table, a sofa and a television in the living room.

It is easy to implement such a system for image interpretation once we dispose of information about detected concepts and their spatial location [20]. We will address the implementation of such a system in our future work.

7 Experiments

In this paper, evaluations are performed on Pascal VOC’2009 dataset [15] and Pascal VOC’2010 dataset [16]. These datasets contain about 11,000 images and 20 concepts. Each image is annotated with one or more concepts from the annotation vocabulary. In the following, we introduce the used method for visual representation of images, then we present the obtained results on the used datasets and we compare our proposal to recent work.

7.1 Visual representation of images

The Bag-of-Features (BoF) representation, also known as Bag-of-Visual-Word (BoVW), is used in this paper to describe image features. The BoF model has shown excellent performances and became one of the most widely used model for image classification and object recognition [28]. In our approach, image features are described as follows: Lowe’s DoG Detector [31] is used for detecting a set of salient image regions. A signature of these regions is then computed using SIFT descriptor [31]. Afterwards, given the collection of detected region from the training set of all categories, we generate a codebook of size K = 1,000 by performing the k-means algorithm. Thus, each detected region in an image is mapped to the most similar visual word in the codebook through a KD-Tree. Each image is then represented by a histogram of K visual words, where each bin in the histogram corresponds to the occurrence number of a visual word in that image.

7.2 Evaluation of image annotation

As aforementioned, experiments are performed on Pascal VOC’2009 and VOC’2010 datasets. Since we do not dispose of the test set used in these challenges, we used 50 % of the image dataset for training the classifiers and the other images are used for evaluating our approach.

In order to emphasize the importance of hierarchical image classification and ontological reasoning using the subsumption relationships, we illustrate in Fig. 9 the obtained average precision and Precision/Recall (PR) curves for all the concepts of each level of the hierarchy. As depicted in this figure, the concepts in the higher levels of the hierarchy have strong average precision, and we can also observe that the classifier accuracy decreases as we go deeper in the hierarchy. These results can be explained as follows. Firstly, the classes in the higher levels of the hierarchy are widely different in their visual appearance, i.e. it is easy to find a boundary that separates these classes. They are also more balanced, i.e. these classes dispose of more positive samples for training their classifiers than the ones in lower levels of the hierarchy. We can therefore conclude that the subsumption relationships should allow improving the image annotation results as they provide a formal framework for reasoning about concepts consistency. Moreover, as the classification accuracy increases as we move to the upper levels of the hierarchy, the overall classification accuracy should increase also.

Fig. 9
figure 9

Hierarchical classification: Precision/Recall (PR) curves for the concepts of each level of the hierarchy

In Fig. 10, we compare our framework for image annotation to the following methods: a flat classification method, a hierarchical classification one and a baseline method. The baseline method is built by taking the average submission results to Pascal VOC’2010 challenge. The flat classification is performed by using |C| SVM One-Versus-All (OVA), where the inputs are the BoF representation of images and the outputs are the desired SVM responses for each image (1 or −1). We used cross-validation to overcome the unbalanced data problem, taking at each fold as many positive as negative images. Hierarchical classification is performed by training a set of (|C| + |C′|) hierarchical classifiers (OVA) consistent with the structure of the hierarchy illustrated in Fig. 3—for more details about hierarchical classification see Section 6.1. Results are evaluated in terms of Average Precision (AP) scores.

Fig. 10
figure 10

Comparison of our method for image annotation with: a flat classification method, a hierarchical classification one, and the baseline method. Comparison is performed on VOC’2010 dataset

As illustrated in Fig. 10, our method for image annotation performs better results than the other ones on Pascal VOC’2010 dataset, with an average precision of 66.49 % and a gain of +8.6 % comparing to the baseline method, a gain of +14.8 % comparing to the hierarchical classification method and a gain of +32.6 % comparing to the flat classification method. These results confirm the effectiveness of the proposed approach, and the importance of contextual and spatial information for improving image annotation. These improvements could be further significant when using a dataset containing more multi-labeled images. Indeed, in Pascal VOC dataset the proportion of images labeled with more than two concepts is small compared with the total number of images [10].

In Fig. 11, we compare our framework for image annotation to the following methods: Bottom-Up Score Fusion (BUSF) [4], Top-Down Classifiers Voting (TDCV) [4] and Hierarchy of SVM (H-SVM) [32]. As it can be seen in this figure, our multi-stage reasoning framework for image annotation outperforms on all classes comparing to the other ones. Please note that this comparison was performed using the same experimental setup, i.e. the same training/validation sets from the VOC’2010 dataset and the same visual representation of images. Therefore, it is clear that the proposed multimedia ontology and the proposed framework for reasoning about the consistency of image annotation allow achieving a significant improvement in the image annotation accuracy. These results also put into evidence the effectiveness of using explicit knowledge models, such as ontologies, for achieving semantically relevant image annotation.

Fig. 11
figure 11

Comparison of our framework for image annotation to previous work on Pascal VOC’2010 dataset. Our approach outperforms on all classes comparing to the other ones

In Table 3, we compare our multi-stage reasoning framework for image annotation to the methods of [47] and [45] on Pascal VOC’2009 dataset. In [47], the authors proposed a method for image classification using local visual descriptors and their spatial coordinates. Their method consists in performing first a nonlinear feature transformation on local appearance descriptor, termed as super-vector, which exploits the residual vector information obtained from the vector quantization (VQ). These descriptors are then aggregated to form image-level feature vector. The image-level feature vector is finally fed into a classifier to perform image classification. In [45], an efficient sparse coding algorithm with a mixture model is proposed and which is assumed to work with much larger dictionaries that often offer higher classification performances. The mixture model softly partitions the descriptor space into local sub-manifolds, where sparse coding with a much smaller dictionary can fast fit the data. As illustrated in Table 3, our approach performs better than the other ones and achieves a gain of 3.41 % compared to the method of [47] and a gain of 3.1 % compared to the method of [45]. This result is promising especially because we did use only the half of the training set for training our classifiers and the other images for evaluating our approach, since we did not dispose of the testing set. We also wish to recall that we have included in our evaluation the images and the concepts marked as difficult, which are ignored in the challenge because they are considered as difficult to recognize. For instance, in the third example of Fig. 8, we can easily observe a “Dining_table” in the illustrated image. However, “Dining_table” is marked as difficult in the ground-truth of this image in the VOC’2009 challenge, and thus it will not count for computing the average precision of this concept. In our evaluation, we included these concepts, i.e. if they are not detected they will count as false negative. Furthermore, the scope of our paper was to study the potential of adding contextual and spatial information into the image annotation process through the use of ontology and ontological reasoning. Thus, we have focused our contribution on these points and we did not seek to implement a very efficient image descriptor since this is not the aim of our paper. Accordingly, the obtained results can be further improved as for example by incorporating other image features.

Table 3 Comparison of our method for image annotation with the ones of [47] and [45] on Pascal VOC’2009 dataset

Finally, we want to highlight that some images in the VOC dataset are badly annotated. For instance, in the third example of Fig. 8 we can distinguish a bottle partially hidden by a vase and a potted flower in the background of the image. However, these concepts (i.e. “Bottle” and “Potted_plant”) are missing in the ground-truth of this image. Thus, despite that our method succeeded to recognize these concepts, they counted as false positive detections in the evaluation of our method since they are missing in the ground-truth. For the second example of Fig. 8, our method has detected the concept “Dining_table” which is absent from the ground-truth. However, the image depicts indeed a “coffee table” and therefore our prediction is semantically relevant, especially since the annotation vocabulary does not provide concepts such as “Table” or “Coffee_Table”. In Fig. 12, we illustrate another image which is badly annotated in the dataset. Indeed, the ground-truth of this image contains only the concept “Person”. However, the image depicts much more concepts: a bottle, chairs, tables, and screens. Our method has detected these concepts, but according to the ground-truth these detections counted as false positives.

Fig. 12
figure 12

An example of a badly annotated image in the VOC’2010 dataset. Ground-truth: Person. Annotation provided by our method: Bottle: 0.982, Chair: 0.281, Dining_table: 0.493, Person: 1.00, Tv_monitor: 0.333

8 Discussion

The proposed methodology for building multimedia ontologies is original, and is useful for the modeling and the understanding of image semantics, i.e. identify and formalize the semantic relationships between image concepts. Indeed, the representation of our concepts and their semantic relationships are automatically extracted from image datasets, which provides an efficient modeling of image semantics and allows for extending our ontology at any time by mining new image datasets. Efficient modeling of image semantics means here: less sensitive to the subjectivity of human perception and less sensitive to the semantic gap.

Regarding the usefulness of our multimedia ontology for computer vision tasks, we propose in the following some usage scenarios. Let us consider an expressive amount of multimedia content, it is possible to extend our approach in order to model (or to learn), in a simple way, complex concepts by the mining of this multimedia content. For instance, let us suppose that we dispose of a well annotated image database and which is representative of the scenes from real life. It is obvious that when we find a ‘Computer monitor’ in a given image, it is very likely to find a ‘Mouse’ and a ‘Keyboard’, and thus, these concepts will share a high co-occurrence confidence score. One can therefore use our proposed approach to define complex concepts, which are not previously included in the annotation vocabulary, based on the fuzzy role ‘hasAppearedWith’ and the co-occurrence confidence score. Specifically, if the context of appearance of a set of concepts is sufficiently high (greater than a predefined threshold), therefore using their definition in WordNet we can find the common concept that connects them, and consequently define automatically this (complex) concept. To illustrate this proposal, here are some examples of defined concepts by the above described method:

Example 8

(Scenario 1: Defining complex concepts)

$$ \begin{array}{lll} && \langle Sitting\_room \equiv Sofa \sqcap Table \sqcap Television\rangle\\ && \langle Beach \equiv Sea \sqcap Sand \sqcap Sky \sqcap \exists hasAppearedAbove(Sea,Sand)\sqcap\\ && \quad\qquad\qquad \exists hasAppearedBellow(Sea,Sky)\rangle\\ && \langle Computer \equiv Screen \sqcap Keyboard \sqcap Mouse \sqcap \exists hasAppearedAbove(Screen,\\ && \quad \quad Keyboard) \sqcap \exists hasAppearedRightOf(Mouse,Keyboard)\rangle \end{array} $$

Another usage scenario consists in a knowledge-driven approach for image annotation using object detection. Indeed, one popular technique for identifying and localizing objects in an image is by the use of sliding-window object detection. It consists in defining a fixed-size rectangular window and applying a classifier to the sub-image defined by the window. The classifier extracts image features from within the window and returns the probability that the window bounds a particular object. The process is repeated on successively scaled copies of the image so that objects can be detected at any size.

So, let us suppose that one dispose of a multimedia database annotated with an average of 3,000 concepts, as for instance the SUN database [44]. Thus, we will dispose of 3,000 object detectors that will be performed on all images of the database and at different scales, which is computationally very expensive. The complexity of this task can be decreased significantly by the use of our multimedia ontology and the scenario defined in the following.

Example 9

(Scenario 2: A knowledge-driven approach for object detection.) Given a previously unseen image:

  1. 1.

    Apply progressively the detectors of the most frequent concepts (w.r.t ’hasFrequency’ concrete feature) in \(\mathcal{KB}\), until a first concept \(c_i \in \mathcal{C}\) is detected.

  2. 2.

    Query the ontology (\(\mathcal{KB}\)) for the most likely concept (\(c_j\in \mathcal{C}\)) to appear with c i and its spatial location.

  3. 3.

    Apply the detector for c j by delimiting the retrieving space according to the predicted spatial location. If it fails go to 2, else go to 4.

  4. 4.

    Query the ontology for candidate textual annotations with respect to the already detected concepts and their locations.

  5. 5.

    According to the decreasing confidence scores of these annotations, apply the detectors for the concepts of the selected annotation. If all concepts of the considered annotation are detected go to 6, else go to 4 (to select another annotation consistent w.r.t the already detected concepts).

  6. 6.

    Stop the processing and return the object detection result (i.e., the set of detected concepts and their spatial location) for the input image.

This usage scenario allows reducing significantly the complexity of the object detection process. In order to perform object detection, it requires performing much less detectors than the classical approach and targeting the detection zone according to the already detected concepts. Thus, it is clear that the proposed ontology is useful to effectively manage image processing tasks, and to efficiently perform image annotation. These usage scenarios will be addressed in our future work.

9 Conclusion

In this paper, we proposed a new approach to automatically build a fuzzy multimedia ontology dedicated to image annotation and interpretation. In our approach, visual and conceptual information are used to build a semantic hierarchy faithful to image semantics, and which will serves as a backbone of our ontology. The ontology is thereafter enriched with contextual and spatial information. Fuzzy description logics are used as a formalism to represent our ontology and to deal with the uncertainty and the imprecision of concept relationships. Some usage scenarios are then proposed to show the usefulness of the proposed ontology.

We subsequently proposed a new method for image annotation based on hierarchical image classification and a multi-stage reasoning framework for reasoning about the consistency of the produced annotation. An empirical evaluation of our approach on Pascal VOC’2009 and Pascal VOC’2010 datasets has shown a significant improvement on the average precision results.