1 Introduction

Extracting semantic image descriptions is an intricate problem, challenging researchers for decades in the quest for generalisable, yet robust, approaches to alleviate the so called semantic gap and support content management services at a level closer to user needs [14, 26, 41]. A factor partially accountable for the challenges pertaining to this endeavor lies in the very nature of it, namely in the fact that it involves the handling of significant amount of imprecise and incomplete information. Leaving out issues related to the subjective interpretations that different users may attribute to conveyed meaning, intention, etc., imprecision relates intrinsically to a number of tasks including segmentation, feature extraction, intra- and inter-class variability. Inevitably, managing imprecision is critical not only when dealing with the aforementioned tasks, but also for the subsequent processing that realises the extraction of semantic descriptions.

Towards this direction, machine learning approaches have gained growing popularity in the past couples of years, as they provide convenient means for discovering and handling knowledge either incomplete or too complex to be explicitly handled. Support Vector Machines (SVMs) and Bayesian networks constitute characteristic examples in this direction, allowing one to learn in a generic fashion classifiers for a significant number of concepts (referring to objects, events, scene descriptions, etc.) [1, 25, 42, 51]. Despite the reports of successful applications, the performance remains still highly variable and deteriorates rather dramatically as the number of supported concepts increases. Among the principal causes for the variable performance are effects related to similarities between visual manifestations of semantically distinct concepts and to the variance in the possible manifestations a single concept may have. These effects issue directly to a large extent from the fundamental assumption underlying learning approaches, i.e. that the addressed semantics is captured to a satisfactory degree by features pertaining to their visual manifestations.

Since in many cases semantics goes beyond the capacity of perceptual features, discrepancies result between the learned associations and the intended ones. Consequently, the learnt classifiers often result in the extraction of complementary, overlapping, incomplete, as well as conflicting descriptions. Evidence can be found not only in individual evaluations of research studies in the relevant literature, but also in large scale benchmarks, such as the TRECVID challenge [40], sponsored by the National Institute of Standards and Technology (NIST). Two main challenges confronted repeatedly in the series of annual TRECVID evaluation activities include the deterioration of performance as the number of addressed concepts increases and the variable correspondence between the semantics of the content retrieved using the learned detectors and the semantics alleged by the detectors per se [15, 43]. The efforts during the last edition included the enhancement of robustness, even for a reduced number of concepts, through the utilisation of complementary information beyond visual features, either in the form of taxonomic relations as captured in the LSCOM ontology [27] or through detector combinations [28, 44].

Induced by the aforementioned, we investigate the utilisation of formal semantics in order to interpret the outcome of statistically learned classifiers into a semantically consistent image interpretation. Focusing on approaches deploying perceptual similarity against learned concept models, viz. where the confidence of the extracted descriptions reflects the membership to a concept class, we propose a fuzzy Description Logic (DL) based framework to capture and reason over the extracted descriptions, while handling the underlying vagueness. The input image classifications may be either scene or object level, and their interpretation is realised in three steps, namely designation of scene level characterisation, identification and resolution of inconsistencies with respect to possibly conflicting classifications, and enrichment, where additional inferred descriptions are made explicit.

The rest of the paper is structured as follows. Section 2 explicates the reasons that motivated our investigation into a fuzzy DL based reasoning approach and outlines the contribution with respect to the relevant literature. Section 3 provides a brief introduction into fuzzy DLs and delineates features pertaining to reasoning within the context of image interpretation. Section 4 presents the proposed framework architecture and the details of the individual reasoning tasks involved. Evaluation is given in Section 5, where the proposed framework is assessed against two experimental settings that consider input classifications of loose and rich semantic associations. Additionally, initial results are presented with respect to enriching the background knowledge with co-occurrence information in the form of fuzzy implication. Finally, Section 6 summarises the paper and discusses future directions.

2 Motivation and contribution

Learning based approaches provide a number of appealing traits, such as generic learning mechanisms and the capability to capture and utilise associations hidden in the examined input data whose explicit handling might otherwise be too complex and strenuous to be effective. Despite their unquestionable role in semantic image analysis tasks, one cannot overlook the limitations that conduce to the variability of the attained performance. As mentioned previously, the problem lies to a large extent in the rather poor utilisation of the semantics underlying the learned concepts and in the imprecision involved in the constituent tasks. Aiming to enhance performance, we focus on a twofold goal, addressing the utilisation of concept semantics while providing the means to cope with imprecision.

The use of explicit knowledge for the purpose of introducing concept semantics is not a new idea. Going back to the 80s and early 90s one finds an extensively rich literature [8, 35] that investigates a wide gamut of knowledge representation schemes, as rigorous as first order logic [36] and as intuitive as the early semantic networks [32]. Among the weaknesses of the early knowledge-directed paradigms were the lack of common representations and reasoning mechanisms that prohibited interoperability and reuse of knowledge. The Semantic Web (SW) initiativeFootnote 1 changed the scenery, advocating explicit semantics and corresponding representation languages to capture meaning in a formal and interoperable fashion. Since 2004, the Resource Description Framework Schema (RDFS) [7] and the Web Ontology Language (OWL) [4] constitute formal W3C recommendations, while substantial interest has revived in Description Logics (DLs), as it underpins the semantics of the SW languages.

In analogy to the different expressivity features provided by the various knowledge representation formalisms, a critical factor regarding the handling of imprecision concerns the nature of its semantics. Examining the relevant literature, imprecision may manifest in the extracted descriptions either as uncertainty regarding the presence of a specific entity (e.g. the presence of a sky region), or as vagueness regarding the degree to which the statement about the presence of the entity is true. Uncertainty characterises probabilistic approaches such as Bayesian nets, while vagueness is encountered in approaches that deploy similarity (distance) metrics against feature models acquired through training, such as SVMs. The confidence values of the latter reflect the extent of matching against the learned concept models, and as such each model can be taken as a fuzzy set, with distance metrics serving the role of the membership function.

Based on the aforementioned, our initial specifications consisted in the selection of a representation formalism with well-defined semantics and the designation of the targeted imprecision semantics. Perceptual-based similarity approaches constitute a fundamental element in statistical learning for image interpretation, and as such in the current investigation we focused on vague rather than probabilistic information. In this context, the concrete incentives of the proposed fuzzy DLs based reasoning framework issue form the specific traits characterising the application of SVM-based concept classifiers, namely contradictory descriptions that pertain to semantically different interpretations and incomplete descriptions, even at the presence of corresponding classifiers.

To put into perspective in an intuitive manner the implications involved, let us consider two example images and the respective descriptions extracted using the SVM based classifiers of [33], shown in Fig. 1. The extracted descriptions are expressed following the fuzzy DL notation. A statement of the form \((im: \exists contains.Concept_{i}) \geq n_{i}\) denotes that the image represented by the individual im contains a region depicting an instance of Concept i with a degree ≥ n i , while a statement (im: Concept j ) ≥ n j denotes that the image at scene level constitutes an instance of Concept j to a degree ≥ n j . As illustrated more than one scene level concepts may be assigned to a single image, and the same may be the case for the concept classifiers performing at object level. Furthermore, we note that since object level classification is performed per image segment, it is possible to have more than one assertions referring to the same concept with varying degrees.

Fig. 1
figure 1

Example outdoor images and respective descriptions extracted following SVM-based analysis

In the case of the first image, the scene level concept Countryside_buildings appears as the prevalent one. This is not only because the corresponding scene description is the one with the highest degree among the extracted scene level descriptions, but also because the object level descriptions neither contradict it nor imply preference of any other scene description. To assess the latter, one needs to be aware of the logical associations of the concepts involved. Even in this example, where the utilisation of semantics has the same effect as selecting the description with the highest confidence, the potential for additional enhancement is manifested. Specifically, exploiting the semantics of the Countryside_buildings concept, one can infer that the image contains a building depiction, thus that there is a missing assertion that needs to be added. Analogously, one can determine that the extracted Sand description is semantically unrelated, and hence should be removed.

Moreover, the exploitation of semantics enables the overall image description to be further enriched, allowing the addition of descriptions referring to concepts such as Landscape and Outdoor. The latter, though quite trivial for the considered example, can be of great value in the general case as it alleviates the need for training and learning detectors for all concepts of interest, as long as they are semantically connected to other concepts. We note that in the absence of confidence degrees, it would not be possible to acquire ranked estimations regarding the plausibility of the different interpretations. As a result, on one hand all classifications would be rendered equally probable, and preference could be established only on the grounds of the descriptions that would have to be removed (respectively added), risking rather artificial interpretations. On the other hand, depriving all information about perceptual similarity would leave out substantial knowledge regarding the validity of the choices considering the removal or addition of descriptions.

Similar considerations apply for the case of the second image. Taking into account solely the explicit scene level descriptions, the Roadside assertion appears to be the more plausible. Going through the object level descriptions and taking into account the semantics of the concepts involved, the simultaneous presence of Sea and Sand descriptions entails the presence of a Beach scene, an implication that implicitly contradicts the dominant Roadside assertion. The remaining object level classifications, i.e. the Person and Sky ones, do not provide any additional knowledge regarding the plausibility of scene level descriptions. Hence, since the implied confidence with respect to Beach appears to be larger than the told degree of Roadside, it would be desirable to be able to assess Beach, and by consequence Seaside, as the more plausible image interpretations.

The aforementioned examples, though quite simplistic, outline the potential and motivation for employing explicit semantics while providing the means to handle imprecision. Furthermore, they designate additional specifications regarding the selection of an appropriate knowledge representation formalism, namely the ability to capture expressive semantic associations (including disjointness, subsumption, conjunction) and to handle the vagueness introduced by the accompanying degrees. Based on these considerations, we chose to investigate a reasoning framework based on fuzzy Description Logics (DLs) [4749] as they present two very appealing traits. First, they are strongly related to OWL (particularly the Lite and DL species); thus they benefit from the Semantic Web initiative towards knowledge sharing, reuse and interoperability, both regarding the background knowledge comprising the domain semantics and the produced image annotations. Second, extending classical DLs, they provide the means to handle vagueness in a formal way, while providing for well-defined reasoning services. The current stage of the relevant literature, presented in the next subsection, elucidates further the motivation underlying the proposed reasoning framework, and outlines its contribution.

2.1 Relevant work

Going through the relevant literature, one notices that despite the significant amount of imprecision involved, a substantial share of the reported logic-based investigations towards the use of explicit knowledge adopts crisp approaches. In the series of works presented in [22, 30], crisp DLs are proposed for inferring complex descriptions that are modelled in the form of aggregates, i.e. conceptual entities modelled by parts satisfying specific constraints. Interpretation is realised as a deductive process, assuming the availability of primitive descriptions that do not contradict each other. Crisp DLs are also considered in [3], in combination with rules, for video annotation; again, possible conflicts or missed descriptions are not addressed. In [38] DLs are used to interpret perceptual descriptions pertaining to colour, texture and background knowledge into semantic objects. To this end a pseudo fuzzy algorithm is presented to reason over the calculated feature values with respect to the prototypical values constituting the definition of semantic objects. In [11], DLs have been combined with rules in order to perform abductive inference over crisp descriptions and acquire plausible interpretations, which are ranked using cost criteria involving the number of assertions that can be explained and the number of assertions that need to be hypothesised.

There also exist ontology-based approaches that focus more on formalising the transition from low-level descriptors to domain concepts include [16, 21] and [34]. Ontology languages are used to represent both domain specific concepts and visual features such colour, texture and shape, and to link them in a formal fashion. Although such approaches can be very useful for purposes of sharing and reusing knowledge regarding low and high level relations, the reasoning supported with respect to logical relations among the high-level concepts is not addressed. This is not a matter of the limited datatype support currently provided by ontology languages, but in principle because of the non logical nature of the problem at hand, i.e. the estimation of distance between a given data structure that constitutes a feature model and the measurable feature values.

Indicative works addressing probabilistic information include the following. In [31], a probabilistic approach is suggested as a possible solution to the handling of the ambiguity introduced during the analysis stage; however, no description of relevant experimentation or evaluation is provided. Other works that address probabilistic knowledge include Markov logic networks [37], where first order logic is combined with graphical probabilistic models into a uniform representation; although tested on knowledge about the university domain, the experiences drawn are applicable to information extractions tasks such as image interpretation too. In [29], a Bayesian approach is proposed in order to model and integrate probabilistic dependencies among aggregates connected by hierarchical relations, within a DLs scene interpretation framework; however, no evaluation results are available as implementation is still under way. In [10], a probabilistic framework for robotic applications is presented that tackles uncertainty issuing from noisy sensor data and missing information, employing relational hidden Markov models for the purpose of spatio-temporal reasoning.

We stress out again the discrimination between the probabilistic and the fuzzy perspective, as this is critical for the motivation and grounding of the reasoning approach proposed in this paper. A probability of 0.5 regarding a positive sea classification denotes our ignorance with respect to its presence or not; it does not imply anything about how blue the sea may be. A degree of 0.5 on the other hand, denotes how close the colour of this sea region is with respect to the specific colour attached to the notion of sea in the given context. For further reading on the two perspectives the reader is referred to [9]. Consequently, approaches such as the aforementioned ones that address probabilistic knowledge are in fact complementary to the aspects considered in this paper.

Finally, there are works utilising fuzzy DLs, and in this sense closer to the proposed fuzzy DLs based framework. Specifically, fuzzy DLs have been proposed in [52] for the purpose of semantic multimedia retrieval; the fuzzy annotations however are assumed to be available. In the context of semantic analysis, fuzzy DLs have been only recently explored in [39], where fuzzy DLs are used for inferring semantic concepts based on part-of relations and merging of the respective image regions, and in [24] for document classification; none of the approaches however addresses the possibility of inconsistency in the initially extracted descriptions.

Given the aforementioned, the contribution of the fuzzy DLs based reasoning framework presented in this paper can be summarised in the following.

  • The uncertainty, in the form of vagueness, which characterises learning-based extracted descriptions is formally handled and intergraded in the process of image interpretation. Thus, the valuable information encompassed in the extracted degrees can be utilised for computing and ranking the different plausible interpretations.

  • The inconsistencies, explicit or implicit, which issued from conflicting descriptions are identified and resolved, contrary to the assumption taken by relevant works regarding consistent initial classifications.

3 Fuzzy DLs

Description Logics (DLs) [2] is a family of knowledge representation formalisms characterised by logically grounded semantics and well-defined inference services. Starting from the basic notions of atomic concepts and atomic roles, arbitrary complex concepts can be described through the application of constructors (e.g., \(\neg\), ⊓, ∀). Terminological axioms (TBox) allow to capture equivalence and subsumption semantics between concepts and relations, while real world entities are modelled through concept (a:C) and role (R(a,b)) assertions (ABox). The semantics of DLs is formally defined through an interpretation I. An interpretation consists of an non-empty set ΔI (the domain of interpretation) and an interpretation function .I, which assigns to every atomic concept A a set A I \(\subseteq\) ΔI and to every atomic role R a binary relation R I \(\subseteq\) ΔI × ΔI. The interpretation of complex concepts follows inductively [2].

In addition to the means for representing knowledge about concepts and assertions, DLs come with a powerful set of inference services that make explicit the knowledge implicit in the TBox and ABox. Satisfiability, subsumption, equivalence and disjointness constitute the core TBox inferences. Satisfiability allows to check for concepts that correspond to the empty set, subsumption and equivalence check whether a concept is more specific or respectively identical to another, while disjointness refers to concepts whose conjunction is the empty set. Regarding the ABox, the main inferences are consistency checking, which assesses whether there exists a model that satisfies the given knowledge base, and entailment, which checks whether an assertion ensues from a given knowledge base.

Fuzzy DLs consider the extension of DL languages with fuzzy set theory [18, 53]. More specifically, in the case of a fuzzy DL language, a TBox is defined as a finite set of fuzzy concept inclusion and equality axioms, while the ABox consists a finite set of fuzzy assertions. A fuzzy assertion [48] is of the form a:C ⋈ n and (a,b):R ⋈ n, where ⋈ stands for ≥, >, ≤, and <.Footnote 2 Assertions defined by > and ≥ are called positive assertions, while assertions defined by > and ≤ are called negative.

A fuzzy set \(C \subseteq D\) is defined by its membership function (μ C ), which given an object of the universe D returns the membership degree of that object with respect to the set C. By using membership functions, the notion of the interpretation function is extended to that of a fuzzy interpretation function. In accordance to the crisp DLs case, the fuzzy interpretation function is a pair I = (ΔI, .I) where ΔI is a non-empty set of objects called the domain of interpretation, and .I is a fuzzy, this time, interpretation function which maps: an individual a to an element a I ∈ ΔI, i.e., as in the crisp case, a concept name A to a membership function A II→[0,1], and a role name R to a membership function R II ×ΔI →[0,1] [48, 49]. Several fuzzy operators exist in the literature that implement the functions (t-norms, co-norms, negation and implication) which extend the classical Boolean conjunction, disjunction, negation and implication to the fuzzy case. Table 1 shows the corresponding semantics for the language f-SHIN under Zadeh logic. The reasoning services definitions are adapted analogously. For example, concept satisfiability requires the existence of an interpretation under which there will be an individual belonging to this concept with a degree n ∈ (0,1].

Table 1 Fuzzy interpretation of DL constructors under Zadeh semantics [47]

Two main efforts exist currently that address formally both the semantics and the corresponding reasoning algorithms. In [4547], the DL language SHIN has been extended according to fuzzy set theory leading to the so called f-SHIN. The fuzzy extensions address the assertion of individuals and the extension of the language semantics. In [50], a fuzzy extension of SHOIN(D) is presented, which constitutes a continuation of earlier works of the authors on extending ALC, SHIF, and SHIF(D) to fuzzy versions [48, 49]. In addition to extending the SHOIN(D) semantics to f-SHOIN(D), the authors present a set of interesting features: concrete domains as fuzzy sets, fuzzy modifiers such as very and slightly, and fuzziness in entailment and subsumption relations.

In addition to the theoretic foundations for the fuzzy extensions, the corresponding reasoning algorithms have been presented and implemented, namely the Fuzzy Reasoning Engine Footnote 3 (FiRE) and the fuzzyDL Footnote 4 engine. FiRE [39] supports querying an f-SHIN knowledge base for satisfiability, consistency, subsumption, and entailment under Zadeh's semantics; general concept inclusions, roles and datatype support are among the planned future extensions; fuzzyDL [6] supports fuzzy SHIF semantics extended by the aforementioned features. It supports the Łukasiewicz's and Gödel's t-norms, t-conorms and fuzzy implications, as well as the Kleene-Dienes implication.

As already implied by the examples presented in Section 2, fuzzy DLs can be straightforwardly used to manage the output of classifiers with respect to background domain knowledge. Using the available constructors and corresponding inclusion and equality axioms, one can construct the terminological knowledge (TBox) that captures the semantics of the examined application domain. The assertional knowledge (ABox) that includes the fuzzy assertions corresponds to the analysis extracted descriptions. Typical DL inference services can be used afterwards to check whether the extracted descriptions violate the logical model of the domain leading to an inconsistency, or to compute the concepts to which a certain image instance belongs to.

However, in order to address the issues described in Section 2, namely the determination of the more plausible scene level description, the subsequent removal of incoherent descriptions and further enrichment of the overall interpretation, a reasoning framework coordinating the evoked DL inferences is needed. Towards this end, handling inconsistency constitutes a central requirement. As illustrated in Section 2, such inconsistency may refer to assertions at the same level of granularity, between scene level descriptions, or between object level description, or to assertions pertaining to different levels. Handling inconsistencies in DLs knowledge bases usually refers to approaches targeting revision of the terminological axioms [13, 17]. In the examined problem however, the inconsistencies result from the limitations in associating semantics with visual features. Thus it is the ABox that needs to be appropriately managed. The adopted methodology is described in the following section, where the proposed fuzzy DLs based reasoning framework is detailed.

4 A fuzzy DL-based reasoning framework for enhancing semantic analysis

Figure 2 depicts the architecture of the proposed fuzzy DLs-based reasoning framework. The input consists of a set of scene and object level assertions representing the descriptions extracted through the application of learning-based analysis techniques. No assumptions are made with respect to the particular implementation followed. The interpretation of the initial assertions integrates them into a consistent semantic image interpretation, and entails three steps. First, the more plausible scene level descriptions are determined, by utilising the subsumption relations among the considered scene level concepts. Next, the inconsistencies in the initial descriptions with respect to the previously computed scene level interpretation are resolved, resulting in a ranked list of plausible interpretations. Finally, the highest ranked interpretation is passed to the last step, where by means of logical entailment the inferred assertions are made explicit.

Fig. 2
figure 2

Architecture of the proposed fuzzy DLs based reasoning framework

The first two steps constitute a special case of knowledge integration, in which only assertions, but no axioms, are allowed to be removed. The reason for this, as also discussed in Section 5, lies in the fact that the axioms address domain-specific yet generic knowledge, rather than conceptualisations customised to specific traits of the examined dataset. As knowledge integration and inconsistency resolving for DLs ontologies have been extensively treated in the literature [5, 12], the approach followed within the proposed reasoning framework builds on existing results by appropriately adjusting them in order to meet the peculiarities introduced within the examined application context. In the sequel, the details of the individual tasks are given.

4.1 Scene level interpretation

The possible logical associations between concepts at the object level and concepts at the scene level entail that all assertions need to be taken into account in order to infer the degrees of confidence pertaining to scene level descriptions. To accomplish this, all disjointness axioms are removed. Thereby, we allow for the exploitation of all interrelations implicit in the initial descriptions. The next step consists in the bottom-up traversal of the scene level concept hierarchy for the purpose of determining the more plausible scene descriptions.

More specifically, at each level of the hierarchy, we check whether the concepts for which (told or inferred) assertions exist are disjoint. In the presence of disjointness, the assertion with the highest degree of confidence is preserved as more plausible, while the rest are marked as inconsistent. Once the examination at a given level is completed, consistency is examined between the set of currently selected assertions SA i and those computed at previous hierarchy level SA i − 1. If an assertion a of SA i − 1 contradicts the assertions belonging in SA i , then a is marked as inconsistent. The process is repeated until the root of the hierarchy is reached and results in discriminating the scene level concept assertions into two sets, namely the set of scene descriptions comprising the more plausible interpretation, and the set of scene descriptions marked as inconsistent. Table 2 summarises the described procedure, where S denotes the list of scene level concepts and A the complete list of assertions extracted by means of analysis.

Table 2 Scene level interpretation

The reason for prioritising scene level assertions over object level ones lies in the different nature of the semantics involved. Scene level concepts and, to be more precise, sub-trees in the scene level concept hierarchy, comprise mutually excluding sub-domains. Contrariwise, multiple configurations of assertions at object level may be consistent to a specific sub-domain. To illustrate this in a more intuitive manner, consider an image for which the extracted descriptions are: \((im: \exists contains.Sea) \geq 0.67\), \((im: \exists contains.Sand) \geq 0.74\) and (im: Rockyside ≥ 0.85), and assume a TBox consisting of the axioms \(Beach \equiv (\exists contains.Sea) \sqcap (\exists contains.Sand)\) and \(Beach \sqcap Rockyside \sqsubseteq \bot\). Both interpretations pertaining to the Rockyside sub-domain, i.e. { \((im\!:\! \exists contains.Sea) \geq 0.67\), (im: Rockyside ≥ 0.85) }, and { \((im: \exists contains.Sand) \geq 0.74)\), (im: Rockyside ≥ 0.85) } rank higher than the one pertaining to beach sub-domain, i.e. { \((im: \exists contains.Sea) \geq 0.67\), \((im: \exists contains.Sand) \geq 0.74\), (im: Beach ≥ 0.67) }, in terms of plausibility as captured by the corresponding degrees.

4.2 Inconsistency handling

Having computed the most plausible scene level interpretation, the next step is to resolve the possible inconsistencies. In the current implementation, the regions depicting an object level concept are represented implicitly through statements of the form \((im: \exists contains.Concept_{i})\) ≥ n i , and thus inconsistencies may arise either among scene level assertions or between scene and object level assertions. The first step towards the identification of inconsistencies is to restore the disjointness axioms that were removed during the scene interpretation task. In order to avoid halting the reasoner, the disjointness axioms are rewritten with respect to the scene level concepts selected at the previous step in the form of subsumption relations to respective no-concepts. Practically, given a set of selected scene level concepts {S i , S j , ..}, each axiom of the form C k S j \(\sqsubseteq\) \(\bot\) is translated into C k \(\sqsubseteq\) noS j . Furthermore, for all noS j concepts we introduce an axiom of the form noS j \(\sqsubseteq\) NoConcept. Consequently, resolving the possible inconsistencies amounts to tracking the assertions that led to the inference of NoConcept instances.

First, we address inconsistencies incurred directly by told descriptions. This translates into checking whether there exist asserted individuals that refer to C k concepts participating in the introduced C k \(\sqsubseteq\) noS j axioms. The handling of such assertions is rather straightforward and results in their immediate removal. Clearly, addressing asserted individuals first, prunes the search space during the subsequent tracking of the inferences that lead to an inconsistency.

Next, we consider assertions referring to complex object- and scene-level concepts. Again we start with object-level concepts in order to account for cases of conflicts at the scene level that are triggered through inference upon object level assertions. Contrary to the previous case, we now need to analyse the involved axioms in order to track the asserted descriptions that cause the inconsistency. Furthermore, these axioms determine which of the descriptions should be removed so as to reach a consistent interpretation. To accomplish this, we build on relevant works presented for resolving unsatisfiable DL ontologies [19], and employ a reversed tableaux expansion procedure. We note that the main difference with respect to the relevant literature is that, in our application framework, we consider solely the removal of assertions, rather than the removal or weakening of terminological axioms. Table 3 summarises the procedure for handling inconsistencies while Table 4 summarises the expansion rules.

Table 3 Inconsistency handling
Table 4 Expansion rules for computing the alternative sets of consistent assertions

The expansion procedure starts having as root node the (im: NoConcept ≥ d i } assertion, where d i is the computed inferred degree, and continues until no expansion rule can be applied. As illustrated, in the case of inconsistencies caused by axioms involving the conjunction of concepts, there is more than one way to resolve the inconsistency and reach a consistent interpretation. Specifically, there are many alternative interpretations as the number of all possible disjunctions of size k = 1, .., N, where N is the number of conjuncts. In order to choose among them, we rank the set of solutions according to the number of assertions that need to be removed and the average value of the respective degrees.

We note that since all role assertions referring to the role contains are assumed to hold with a degree ≥ 1.0, and the regions that depict object level concepts are not explicitly represented, expansion rules are required only for the case of the ⊓ and \(\sqcup\) constructors. Consequently, in the case of a domain TBox that utilises additional constructors or when fuzzy role assertions are also considered, the expansion rules would have to be appropriately extended.

4.3 Enrichment

The final step considers the enrichment of the descriptions by means of entailment and is the most straightforward of the three tasks, as it amounts to typical fuzzy DLs reasoning. Once the scene level concepts are selected, and the assertions causing inconsistencies (either directly or through complex definitions), are resolved, we end up with a semantically consistent set of assertions, whose communication to the fuzzy DLs reasoning engine results in the determination of the final semantic image description. To render the inferred descriptions explicit, corresponding queries are formulated and the responses constitute the final outcome of the proposed framework.

Figure 3 shows an example application of the proposed reasoning framework. As illustrated, the applied concept detectors assess the image as both a Rockyside and Seaside scene. The respective object level descriptions include the concepts Sky, Sea and Sand. Removing the disjointness axioms from the TBox, we obtain the inferred assertions for the concepts Beach, Landscape and Outdoor, where the inferred assertions are grouped with respect to the asserted ones that triggered their inference. The left bottom part of Fig. 3 depicts the scene level concept hierarchy that corresponds to the depicted domain TBox extract, as well as the inferred degrees.

Fig. 3
figure 3

Example application of the proposed fuzzy DLs based reasoning framework

Following the afore described procedure, we start from the lowest level of the hierarchy and add the Beach concept assertion to the list of plausible scene descriptions. Moving to the next level, there are assertions referring to the Rockyside and Seaside concepts. Since the two concepts are disjoint, we select the assertion with the highest degree, namely the one referring to the Rockyside concept, and add it to the list of plausible scene descriptions. Checking against the previous level, it is the case that Rockyside and Beach are disjoint. Thus, the Beach assertion is removed from the list of plausible descriptions and marked as inconsistent. At the next step, the top level of the hierarchy is reached, which causes the addition of the Outdoor concept assertion in the scene interpretation.

Hence, upon the completion of the scene level interpretation step, the scene level concept hierarchy has been populated, and the sub-tree with the highest degrees is identified, which in our example amounts to the concepts Rockyside and Outdoor. The inconsistency handling step that follows results in removing the told descriptions referring to the Sand and Seaside concepts, and thereby the Sea and Beach descriptions are also removed. For instance, in the case of the \(Beach \equiv Seaside \sqcap \exists contains.Sand\) axiom presented in the example of Fig. 3, there are three alternatives that lead to a consistent interpretation, obtained by removing either the Seaside assertion, or the \(\exists contains.Sand\) assertion, or both. In the example of Fig. 3, enrichment amounts to the addition of the (image:Outdoor)≥0.86 assertion, and to the update of the (image:\(\exists\).contains.Rock) assertion degree to 0.78 from the initial value of 0.67.

5 Experimental results and evaluation

To evaluate the feasibility and potential of the proposed fuzzy DLs based reasoning framework, we carried out an experimental implementation, using fuzzyDL as the core fuzzy DLs inference engine. The specific reasoner was chosen as it meets the expressivity requirements of the representation.

Opting for a generic performance assessment, we experimented in the domain of outdoor images, as it allows for rich semantics and it is sufficiently broad to avoid hard to generalise observations typical of close domains such as medical imaging. The data set we considered includes images of forests, beaches, seaside landscapes, mountain sceneries, roadside scenes, countryside buildings, and urban scenes. An extract of the TBox developed to capture the underlying semantics is illustrated in Table 5. The 700 images that constitute the data set, were divided into two non overlapping sets of 350 images each for training and testing purposes. The ground truth has been manually generated for the complete dataset at scene and object level.

Table 5 Example extract of the outdoor image domain TBox constructed for evaluation purposes

Utilising the developed TBox, performance is estimated by comparing the reliability of the image descriptions extracted through machine learning to that of the descriptions resulting after the application of the proposed reasoning framework. To this end, we adopted the precision, recall, and F-measure metrics, using the following definitions.

  • recall (r): the number of correct assertions extracted/inferred per concept divided by the number of the given concept assertions present in the ground truth image descriptions.

  • precision (p): the number of correct assertions extracted/inferred per concept divided by the number of assertions that were extracted/inferred for the given concept.

  • F-measure: 2*p*r/(p + r).

Two experiments were designed to evaluate the performance of the proposed reasoning framework in the presence of weak and rich semantics. In the first experiment the employed concept classifiers are rather poorly related, in terms of both scene and object level concepts; in the second experiment, the employed classifiers address concepts whose interrelations go beyond simple subsumption axioms, hence allowing for more complex inferences. Furthermore, we present initial results with respect the use of fuzzy implication as the meas to model co-occurrence information.

5.1 Experiment I

In the first experiment, the sets of scene (C scene ) and object (C object ) level concepts addressed by the classifiers are C scene  ={Outdoor, Natural, ManMade, Landscape, Mountainous, Beach} and C object  ={Building, Grass, Vegetation, Rock, Tree, Sea, Sand, Conifers, Boat, Road, Ground, Sky, Trunk, Person}. Table 6 provides the definition of the scene level concepts of the outdoor images considered.

Table 6 Description of scene level concepts for the considered outdoor images dataset

Figure 4 shows example images of these scene level concepts. As described in Table 6, the concept Natural encompasses images belonging to the Landscape, Beach and Mountainous scenes, while the concept ManMade encompasses cityscape images. The only additional concept supported by reasoning is Seaside, which represents coastal images that do not necessarily correspond to beach images. Statistical information regarding the frequency distribution of each concept in the train and test datasets is given in Tables 7 and 8. Since the employed object concept classifiers perform at the segment level, the distribution of object level concepts is given with respect to the overall number of segments.

Fig. 4
figure 4

Experiment I—Example images of the concepts addressed by the learned scene level classifiers

Table 7 Experiment I—evaluation of analysis and reasoning performance for scene level concepts, and concept distribution in the dataset
Table 8 Experiment I—evaluation of analysis and reasoning performance for object level concepts, and concept distribution in the dataset

Different implementations have been used for the employed classifiers, allowing for overlapping descriptions. Specifically, for scene level classification, the SVM-based approach of [20] and the randomised clustering trees approach of [23] have been used. For segment level (object) classification, the distance based feature matching approach based on prototypical values of [34], and the clustering trees approach of [23] have been followed.

Tables 7 and 8 present the attained performance with respect to scene and object level concepts, when applying solely the learned classifiers and when using the proposed reasoning framework.

As illustrated, in the case of scene level concepts, the utilisation of reasoning mostly matches the performance of the classifiers, while in some cases it introduces a slight improvement. This behaviour is a direct result of the loose semantic relations between the employed scene and object level classifiers. Going through the corresponding TBox (Table 5), ones notices that the scene level concepts are in their majority atomic, participating in subsumption axioms with one another. Hence, additional inferences can issue only from relations involving object level concepts. For the considered set of concept classifiers, though, the only type of association between scene level concepts and object level concepts is disjointness. As a result, object level assertions cannot have an effect on the plausibility of scene level assertions.

To acquire a quantitative measure of the effect of reasoning in such a case, we interpreted the response of the scene classifiers according to the respective scene concept semantics. For example, if an image was classified as Landscape, it was interpreted also as positive classification for the concepts Natural and Outdoor as well. Thus, as long as a correct classification is attained at some level of the scene concept hierarchy, it is propagated towards the more generic concepts. Following this scheme we can observe how the application of reasoning compares to this ``optimal'' scene classification interpretation, which includes subsumption relations. However, we note that this optimistic perspective taken with respect to the acquired scene classifications, does not reflect the actual behaviour of the classifiers, nor the general case, as it is not uncommon to acquire positive classification results for semantically disjoint concepts or to get negative responses for concepts that are subsumed by concepts for which the classification was positive.

Examining the performance with respect to the object level concepts, the effects induced through reasoning are more interesting. As illustrated in Table 8, four patterns can be observed: i) cases where only the precision is improved, ii) cases where only the recall is improved, iii) cases where both precision and recall improve, and iv) cases where the performance remains unchanged with the application of reasoning. As explained in the following, each pattern derives from the type of axioms in which the respective concept participates.

Improvement on precision, as in the case of the Boat concept, relates strongly to the utilisation of disjointness, since through the presented inconsistency handling approach, it is ensured that the final assertions comply with the scene level interpretation. Naturally, this harbours the risk of ending up with significantly distorted final descriptions in the case of false scene level interpretations. The performance deterioration with the respect to the concept Road constitutes an example of this, as in this case the majority of images depicting road were attributed with greater confidence to the Seaside or Forest. However, when considering such cases one needs to bear in mind not the interpretation desired based on visual inspection of the image, but rather the interpretation appearing more plausible on the grounds of the initial descriptions provided by the classifiers.

Figure 5 illustrates such an example. The two most plausible scene level descriptions are misleading, while the detected object level concepts, though accurate, are not adequate to drive the inference of the corresponding scene concept. Additional knowledge, such as co-occurrence of concepts could be exploited to assist and either re-adjust the degrees or trigger an inference. This limitation constitutes one of the reasons motivating, as future direction of research, the exploration of a reasoning framework that combines fuzzy and probabilistic knowledge. Some very preliminary investigations towards this directions are described in the following, where the modelling of co-occurrence information in the form of fuzzy implication is discussed.

Fig. 5
figure 5

Example image and corresponding extracted descriptions where additional knowledge is needed to perform inference

Object level concepts exhibiting improved recall rates, such as Building, Sand and Grass, correspond to concepts that appear on the righthand side of axioms. In practise, such concepts may be entailed either by scene level concepts, such is the case for Building and Sand, or by other object level concepts, such as in the case of Tree. Respectively, concepts whose recall is reduced reflect cases of wrongly inferred scene level concepts that resulted in their removal due to disjointness axioms.

Cases where both precision and recall are improved correspond to concepts appearing on the righthand side of general inclusion axioms, when the correct scene level concepts have been inferred. Such concepts are characterised by rich semantics and apparently benefit the most from the application of reasoning. Invariable performance indicates atomic concepts participating solely on the left hand side of axioms (e.g. the Trunk and Person concepts).

Table 9 summarises the average performance of the classifiers and of the proposed reasoning framework for the experiment. The first line presents the performance when all concepts are taken into account, while the second line presents the performance when only concepts that participate in axioms are considered. As illustrated, the application of reasoning bears a positive impact in terms of the precision, which is increased from 0.49 to 0.63. Recall remains unchanged for the reasons explained above.

Table 9 Overall evaluation for Experiment I

5.2 Experiment II

In the second experiment, the sets of scene (C scene ) and object (C object ) level concepts addressed by the learned classifiers are respectively C scene  ={Countryside_Buildings, Roadside, Rockyside, Seaside, Forest} and C object  ={Building, Roof, Grass, Vegetation, Dried-Plant, Sky, Rock, Tree, Sea, Sand, Boat, Road, Ground, Person, Trunk, Wave}. The additional concepts supported by reasoning are Landscape, Mountainous, Beach, and Outdoor. Landscape is subsumed by Forest, Roadside and Countryside_buildings, while Mountainous is subsumed by Rockyside. Statistical information regarding the frequency distribution of each concept in the train and test datasets is given in Tables 10 and 11. This time, a single classifier has been employed per concept, following the SVM based approach of [33] (Fig. 6).

Table 10 Experiment II—Evaluation of analysis and reasoning performance for scene level concepts, and concept distribution in the dataset
Table 11 Experiment II—Evaluation of analysis and reasoning performance for object level concepts, and concept distribution in the dataset
Fig. 6
figure 6

Experiment II—Example images of the concepts addressed by the learned scene level classifiers

Table 10 compares the performance of the classifiers and of the proposed reasoning framework with respect to scene level concepts. Contrary to the previous experiment, only explicitly generated classifications are considered. As illustrated, apart from the expected improvement in terms of scene level concepts that are acquired due to subsumption relations, the application of reasoning shows a more noticeable effect compared to the first experiment. The reasons for this behaviour lie again in the semantics of the concepts involved. Specifically, in this experimental setting, the object level concepts addressed by the classifiers are semantically related with scene level ones not only through disjointness axioms but also by general concept inclusions, thus incurring the inference of scene level concepts from object level ones. For example, the presence of a Building assertion triggers the inference of a corresponding Countryside_buildings assertion, with an equal or greater degree of confidence. Hence, in combination with the axioms that relate the concepts Vegetation and Grass with Landscape scenes, the application of reasoning allows to improve the recall for Countryside_buildings.

Table 11 compares the performance for descriptions at the object level. With the exception of the Boat and Grass concepts, the application of reasoning improves significantly on the performance obtained by the sole application of the classifiers. Again, this is a direct consequence of the fact that the considered object level concepts are characterised by richer semantics with respect to the scene level concepts that constitute their context of appearance. The deterioration in the recall rates of the Boat and Grass concepts is again indicative of the risks entailed by a false scene level interpretation. Going through the images for which Boat assertions where falsely removed, we observed that the prevailing scene level classifications were not in compliance with the depicted scene. Similar observations hold for the concept Grass. A possible way to alleviate such phenomena, apart from the investigation of additional types of knowledge, could be the re-assessment of the terminological axioms describing the domain. However, in such approach lurks the risk of ending up with solutions customisable to a specific learning implementations or to specific application domains and datasets.

Similar considerations emerge when examining the not so noticeable effect of reasoning in the recall of scene level concepts such as Rockyside. Going through the images depicting rocky side scenes, but not recognised as such, we observed that in all cases the classifiers had falsely detected another scene level concept instead; and this, despite the fact that the instantiations of the Rock concept were successfully detected generally. Adding the axiom \(\exists\)contains.Rock \(\sqsubseteq\) Rockyside, would seem a reasonable approach to improve recall for the Rockyside concept, especially since in the examined dataset, the Seaside classifier tends to produce higher confidence values than the Rockyside one for seaside images, where rock instance appear too. However, it is easy to see that such an amendment would imbalance the trade off between what constitutes domain semantics and what is mere tuning to the peculiarities of a given dataset.

Finally, Table 12 summarises the average performance. As illustrated, the application of reasoning has a stronger influence compared to the first experiment, reflected on both recall and precision, which are significantly improved.

Table 12 Collective evaluation for Experiment II

5.3 Investigating reasoning with additional knowledge

The previously described experimental configurations allowed us to assess the performance of the proposed reasoning framework, and provided useful experiences with respect to the issues and challenges involved. The lessons drawn could be summarised in the two following observations. On one hand, the proposed reasoning framework has the potential to enhance semantic image analysis towards more consistent and complete descriptions, provided that the concepts involved bear adequately rich semantic associations. On the other hand, the inference of the overall description is liable to false interpretations in the presence of heavily distorted classification results, both in terms of the concepts and of their computed degrees. As already mentioned above, a possible way to assist inference in such cases would be the use of additional knowledge. Such knowledge could consider for example the percentage of an image assigned to a given description. Thereby, in cases like that of Fig. 5, where normally no scene level concepts can be inferred based on the initial assertions, a complementary mechanism could favour the Landscape concept and its subclasses. Another approach would be be the joint utilisation of probabilistic knowledge, regarding the concept co-occurrence patterns.

Since our focus is currently on the benefits and weaknesses pertaining to the utilisation of fuzzy reasoning, we performed some additional experiments using fuzzy implication to capture co-occurrence information. We note though, that due to the early stage of this effort, no conclusive observations can be drawn. In the following, we briefly describe the semantics of fuzzy implication and present the outcomes of this preliminary investigation.

Fuzzy implication is defined by a function of the form \(\textsl{J}:[0,1]\times[0,1]\rightarrow [0,1]\) [18]. In practise this means that the degree of the right hand side expression does not depend only on the degree of the left hand side expression, but is influenced as well by the degree attributed to the implication itself. Thereby, fuzzy implication can serve as a mean to introduce data set specific knowledge without impairing the impact of the inferences drawn by the crisp axioms modelling generic semantics.

To get an initial estimation of whether such an approach could be beneficial, we experimented using the following scheme. First, we performed statistical analysis on the manually constructed ground truth to obtain co-occurrence patterns between scene level concepts and object level concepts. The latter share a significant characteristic: although semantically related to specific scene level concepts, they do not necessarily entail information regarding the presence of the respective scene level concepts. Then, the observed concept patterns frequency was used in order to define the degrees of the corresponding fuzzy implication axioms.

Two different schemes have been used for the calculation of the fuzzy implications degrees. In the first case, concepts co-occurrence was measured with respect to the ground truth annotations; thus it embodies, in a way, an ideal model of the dataset specific knowledge. In the second case, co-occurrence was measured taking into account the object level descriptions as obtained by the classifiers and the scene level ground truth annotations. Consequently, in the second case the degrees encompass additional information accounting to an extent for the effectiveness of the classifiers, and thus for the errors introduced by them. Table 13 illustrates the effect on the determination of image descriptions at scene level, using the Kleene-Dienes implication [18], with respect to the two aforementioned schemes for determining concept co-occurrence.

Table 13 Evaluation for scene level concepts when using fuzzy implication to model co-occurrence information

Although both schemes used for the determination of degrees are very simplistic, the introduction of fuzzy implication appears to have the potential to provide a further improvement when compared with the values in Table 10. As previously mentioned, these results are only preliminary, and as such do not allow any conclusions to be drawn yet. Further investigation is required with respect to the methodology used to calculate the degrees and the process of selecting which implications to consider. The associations between the concepts aggravate further the difficulty in finding a balance between compensating for classification errors (by opting for higher implication degrees) and avoiding figurative increase in recall values (i.e. improved recall accompanied by significant deterioration in precision).

6 Conclusions and future directions

In this paper, we presented a fuzzy DLs-based reasoning framework for the purpose of enhancing the extraction of image semantics through the utilisation of formal knowledge. The deployment of fuzzy DLs allows to formally handle the vagueness encountered in classifications acquired through statistical learning, while the formal semantics allow the integration of the initial description into a semantically coherent interpretation. The proposed reasoning framework exploits reasoning not only in order to enrich the image descriptions by means of entailment, but also in order to address and resolve inconsistencies in the initial classifications. Thereby, and free of assumptions regarding the individual classifiers' implementation, it provides the means to reach a consistent final image description and to alleviate limitations often encountered in learning based approaches due to poor semantics utilisation. The experiments, though not conclusive, show promising results, while outlining a number of issues and challenges affecting the attained performance.

Future directions include the extension of the framework in order to handle the representation of individual image regions and spatial relations so as to allow the utilisation of spatial reasoning. Furthermore, as outlined in the evaluation Section, experimentation towards the possibilities of introducing additional knowledge, either in the form of fuzzy implication or in a probabilistic manner, constitutes another direction towards a more complete framework. The latter is of particular interest, not only because both types of imprecision are met in semantic image analysis, but mostly because they serve complementary purposes.