1 Introduction

The growing amount of image and video available either online or in one’s personal collection has attracted the multimedia research community’s attention. There are currently substantial efforts investigating methods to automatically organize, analyze, index and retrieve video information. This is further stressed by the availability of the MPEG-7 standard that provides a rich and common description tool for multimedia contents. Moreover, it is encouraged by TRECVid evaluation campaigns which aim at benchmarking progress in video content analysis and retrieval tools developments.

Retrieving complex semantic concepts such as car, road, face or natural disaster from images and videos requires to extract and finely analyze a set of low-level features describing the content. In order to generate a global result from the various potentially multimodal data, a fusion mechanism may take place at different levels of the classification process. Generally, it is either applied directly on extracted features (feature fusion), or on classifier outputs (classifier fusion).

In most systems concept models are constructed independently [34, 46, 55]. However, the binary classification ignores the fact that semantic concepts do not exist in isolation and are interrelated by their semantic interpretations and co-occurrence. For example, the concept car co-occurs with road while meeting is not likely to appear with road. Therefore, multi-concept relationship can be useful to improve the individual detection accuracy taking into account the possible relationships between concepts. Several approaches have been proposed. Wu et al. [55] have reported an ontological multi-classification learning for video concept detection. Naphade et al. [34] have modeled the linkages between various semantic concepts via a Bayesian network offering a semantics ontology. Snoek et al. [46] have proposed a semantic value chain architecture for concept detection including a multi-concept learning layer called context link. In this paper, we propose a generic and robust scheme for video shots indexing based on ontological reasoning construction. First, each individual concept is constructed independently. Second, the confidence value of each individual concept is re-computed taking into account the influence of other related concepts.

This paper is organized as follows. Section 2 reviews existing video indexing techniques. Section 3 presents our system architecture. Section 4 gives the proposed concept ontology construction, including three types of similarities. Section 5 reports and discusses the experimentation results conducted on the TRECVid collection. Finally, Section 6 provides the conclusion of the paper.

2 Review of existing video indexing techniques

This section presents some related works from the literature in the context of semantic indexing. The field of indexing and retrieval has been particularly active, especially for content such as text, image and video. In [2, 11, 45, 50, 52], different types of visual content representation, and their application in indexing, retrieval, abstracting, are reviewed.

Early systems work on the basis of query by example, where features are extracted from the query and compared to features in the database. The candidate images are ranked according to their distance from the query. Several distance functions can be used to measure the similarity between the query and all images in the database. In Photobook [39], the user selects three modules to analyze the query: face, shape or texture. The QBIC system [13] offers the possibility to query on many features: color, texture and shape. VisualSeek [44] goes further by introducing spatial constraints on regions. The Informedia system [53] includes camera motion estimation and speech recognition. Netra-V [58] uses motion information for region segmentation. Regions are then indexed with respect to their color, position and motion in key-frames. VideoQ [9] goes further by indexing the trajectory of regions. Several papers touch upon the semantic problem. Nephade et al. [33] built a probabilistic framework for semantic video indexing to map low-level media features with high-level semantic labels. Dimitrova [11] presents the main research topics in automatic methods for high-level description and annotation. Snoek et al. [45] summarize several methods aiming at automating this time and resource consuming process as state-of art. Vembu et al. [52] describe a systematic approach to the design of multimedia ontologies based on the MPEG-7 standard and sport events ontology. Chang et al. [20] exploit the audio and visual information in generic videos by extracting atomic representations over short-term video slices.

However, models are constructed to classify video shots in semantic classes. Neither of these approaches satisfy holistic indexing, where a user wants to find high level semantic concepts such as an office or a meeting for example. The reason is, that there is a semantic gap [52] between low-level features and high-level semantics. While it is difficult to bridge this gap for every high level concept, multimedia processing under a probabilistic framework and ontological reasoning facilitate, bridging this gap for a number of useful concepts.

3 System architecture

The general architecture of our system can be summarized in five steps as depicted in Fig. 1: (1) features extraction, (2) classification, (3) perplexity-based weighted descriptors, (4) classifier fusion and (5) ontological readjustment of the confidence values. Let us detail each of those steps:

Fig. 1
figure 1

General indexing system architecture

3.1 Features extraction

Temporal video segmentation is the first step toward automatic annotation of digital video for browsing and retrieval. Its goal is to divide the video stream into a set of meaningful segments called shots. A shot is defined as an unbroken sequence of frames taken by a single camera. The MPEG-7 standard defines a comprehensive, standardized set of audiovisual description tools for still images as well as movies. The aim of the standard is to facilitate quality access to content, which implies efficient storage, identification, filtering, searching and retrieval of media [31]. Our system employs five types of MPEG-7 visual descriptors: Color, texture, shape, motion and face descriptors. These descriptors are briefly defined as follows:

3.1.1 Scalable Color Descriptor (SCD)

is defined as the hue-saturation-value (HSV) color space with fixed color space quantization. The Haar transform encoding is used to reduce the number of bins of the original histogram with 256 bins to 16, 32, 64, or 128 bins [17].

3.1.2 Color Layout Descriptor (CLD)

is a compact representation of the spatial distribution of colors [21]. The color information of an image is divided into (8×8) block. The blocks are transformed into a series of coefficient values using dominant color descriptor or average color, to obtain CLD = {Y, Cr, Cb} components. Then, the three components are transformed by 8×8 DCT (Discrete Cosine Transform) to three sets of DCT coefficients. Finally, a few low frequency coefficients are extracted using zigzag scanning and quantized to form the CLD for a still image.

3.1.3 Color Structure Descriptor (CSD)

encodes local color structure in an image using a structuring element of (8×8) dimension. CSD is computed by visiting all locations in the image, and then summarizing the frequency of color occurrences in each structuring element location on four HMMD color space quantization possibilities: 256, 128, 64 and 32 bins histogram [32].

3.1.4 Color Moment Descriptor (CMD)

provides some information about color in a way which is not explicitly available in other color descriptors. It is obtained by the mean and the variance on each layer of the LUV color space of an image or region.

3.1.5 Edge Histogram Descriptor (EHD)

expresses only local edge distribution in the image. An edge histogram in the image space represents the frequency and the directionality of the brightness changes in the image. The EHD basically represents the distribution of 5 types of edges in each local area called a sub-image. Specifically, dividing the image into (4×4) non-overlapping sub-images. Then, for each sub-image, we generate an edge histogram. Four directional edges (0°, 45°, 90°, 135°) are detected in addition to non-directional ones. Finally, it generates a 80 dimensional vector (16 sub-images, 5 types of edges). We make use of the improvement proposed by [38] for this descriptor, which consist in adding global and semi-global levels of localization of an image.

3.1.6 Homogeneous Texture Descriptor (HTD)

characterizes a region’s texture using local spatial frequency statistics. HTD is extracted by Gabor filter banks (6 frequency times, 5 orientation channels), resulting in 30 channels in total. Then, computing the energy and energy deviation for each channel to obtain 62 dimensional vector [31, 56].

3.1.7 Statistical Texture Descriptor (STD)

is based on statistical methods of co-occurrence matrix such as: energy, maximum probability, contrast, entropy, etc [1], to model the relationships between pixels within a region of some grey-level configuration in the texture; this configuration varies rapidly with distance in fine textures, slowly in coarse textures.

3.1.8 Contour-based Shape Descriptor (C-SD)

presents a closed 2D object or region contour in an image. To create Curvature Scale Space (CSS) description of contour shape, N equidistant points are selected on the contour, starting from an arbitrary point and following the contour clockwise. The contour is then gradually smoothed by repetitive low-pass filtering of the x and y coordinates of the selected points, until the contour becomes convex (no curvature zero-crossing points are found). The concave part of the contour is gradually flattered out as a result of smoothing. Points separating concave and convex parts of the contour and peaks (maxima of the CSS contour map) in between are then identified. Finally, eccentricity, circularity and number of CSS peaks of original and filtered contour are should be combined to form more practical descriptor [31].

3.1.9 Camera Motion Descriptor (CM)

details what kind of global motion parameters are present at what instance in time in a scene provided directly by the camera, supporting 7 camera operations: fixed, panning (horizontal rotation), tracking (horizontal transverse movement), tilting (vertical rotation), booming (vertical transverse movement), zooming (change of the focal length), dollying (translation along the optical axis), and rolling (rotation around the optical axis) [31].

3.1.10 Motion Activity Descriptor (MAD)

shows whether a scene is likely to be perceived by a viewer as being slow, fast paced, or action paced [48]. Our MAD is based on intensity of motion. The standard deviations are quantized into five activity values. A high value indicates high activity and the low value of intensity indicates low activity.

3.1.11 Face Descriptor (FD)

detects and localizes frontal faces within the keyframes of a shot and provides some face statistics (e.g, number of faces, biggest face size), using the face detection method implemented in OpenCV. It uses a type of face detector called a Haar Cascade classifier, that performs a simple operation. Given an image, the face detector examines each image location and classifies it as “face” or “not face”  [37].

3.2 Classification

The classification consists in assigning classes to videos given some description of its content. The literature is vast and ever growing [24]. This section summarizes the classifier method used in the work presented here: “Support Vector Machines”.

SVMs have become widely employed in classification tasks due to their generalization ability within high-dimensional pattern [51]. The main idea is similar to the concept of a neuron: Separate classes with a hyperplane. However, samples are indirectly mapped into a high dimensional space thanks to its kernel function. In this paper, a single SVM is used for each low-level feature and is trained per concept under the “one against all” approach. At the evaluation stage, it returns for every shots a normalized value in the range [0, 1] using (1). This value denotes the degree of confidence, to which the corresponding shot is assigned to the concept.

$$ y^{j} _{i}=1/\left(1+\exp{(-\alpha d_{i}})\right) $$
(1)

Where (i, j) represents the ith concept and jth low-level feature, d i is the distance between the input vector and the hyperplane and α is the slope parameter which is obtained experimentally.

3.3 Perplexity-based weighted descriptors

Each concept is best represented or described by its own set of descriptors. Intuitively, the color descriptors should be quite appropriate to detect certain concepts such as: sky, snow, waterscape, and vegetation, while inappropriate for studio, meeting, meeting, car, etc.

For this aim, we propose to weight each low-level feature according to the concept at hand, without any feature selection (Fig. 2). The variance as a simple second order vector can be used to give the knowledge of the dispersion around the mean between descriptors and concepts. Conversely, the entropy depends on more parameters and measures the quantity of informations and uncertainty in a probabilistic distribution. We propose to maps the visual features onto a term weight vector via entropy and perplexity measures. This vector is then combined with the original classifier outputsFootnote 1 to produce the final classifier outputs. As presented in Fig. 2, we shall now define the four steps of the proposed approach [6].

Fig. 2
figure 2

Perplexity-based weighted descriptors structure

3.3.1 K-means clustering

It computes the k centers of the clusters for each descriptor, in order to create a “visual dictionary” of the shots in the training set. The selection of k is an unresolved problem, and only tests and observation of the average performances can help us to make a decision. In Souvannavong et al. [47], a comparative study of the classification results vs the number of clusters used for the quantization of the region descriptors of TRECVid 2005 data, shows that the performances are not deteriorated by quantization of more than 1,000 clusters. Based on this result, our system will employ k r  = 2,000 for the clustering the MPEG-7 descriptors computed from image regions, and k g  = 100 for the global ones. This presents a good compromise between efficiency and a low computation times.

3.3.2 Partitioning

Separating data into positive and negative sets is the first step of the model creation process. Typically, based on the annotation data provided by TRECVid, we select the positive samples for each concept.

3.3.3 Quantization

To obtain a compact video representation, we vector-quantize features. Based on the vocabulary size k r  = 2,000 (number of visual words) which has empirically shown good results for a wide range of datasets. All features are assigned to their closest vocabulary word using Euclidean distance.

3.3.4 Entropy measure

The entropy H (2) of a certain feature vector distribution P = (P 0, P 1, ..., P k − 1) gives a measure of concepts distribution uniformity over the clusters k [27]. In [22], a good model is such that the distribution is heavily concentrated on only few clusters, resulting in low entropy value.

$$ H=-\sum\limits_{i=0} ^{k-1}P_{i}\log(P_{i}) $$
(2)

where P i is the probability of cluster i on the quantized vector.

3.3.5 Perplexity measure

In [15], perplexity (PPL) or normalized perplexity value (\(\overline{PPL}\)) (3) can be interpreted as the average number of clusters needed for an optimal coding of the data.

$$ \overline{PPL}=\frac{PPL}{PPL_{\rm max}}=\frac{2^{H}}{2^{H_{\rm max}}} $$
(3)

If we assume that k clusters are equally probable, we obtain H(P) = log (k), and then \(1 \leq \overline{PPL} \leq k\).

3.3.6 Weight

In speech recognition, handwriting recognition, and spelling correction [15], it is generally assumed that lower perplexity/entropy correlates with better performance, or in our case, to a very concentrated distribution. So, the relative weight of the corresponding feature should be increased. Many formula can be used to represent the weight such as Sigmoid, Softmax, Gaussian, etc. In our paper, we choose Verhulst’s evolution model (4). This function is non exponential, it allows a brake rate α i to be defined, as well as reception capacity (upper asymptote) K, and β i defines the decreasing speed of weight function.

$$ w_{i}=K \frac{1}{1+\beta_{i} \exp{(-\alpha_{i}(1/\overline{\text{PPL}}_{i}))}} $$
(4)
$$ \beta_{i}= \left\lbrace \begin{tabular}{ll} $K\exp{(-\alpha_{i} ^2)} \ \ \text{if } \ Nb^{+} _{i} < 2*k$ \\ $1 \ \ \ \text{Otherwise} $ \end{tabular} \right. $$
(5)

β i is introduced to decrease the negative effect of the training set limitation, due to the low number of positive samples (\(Nb^{+} _{i} << k\)) of certain concepts such as weather, desert, mountain,... (see Table 2). We observe a lower perplexity value, which could not be interpreted as a relevant relation between descriptor and concept. So, we increase β i (5) to obtain a rapid weight decrease for each concept presenting less than 2 * k positive samples.

The relevance of the various descriptors at identifying high level concepts can be obtained through the perplexity distribution (see Fig. 3). The Boxplot provides a good visual summary of many important aspects of a distribution. The lower and upper lines express the data range, the lower and upper edges of the box indicate the 25th and 75th percentile. The line inside the box indicates the median value of the data. Figure 3 shows the normalized perplexity for each descriptor and its best concept presented by the minimum observation, such as: SCD is more effective to detect the concept sky “13”, EDH for road “12”, etc. The first observation concerns the same value of median perplexity obtained for SCD, CLD, CMD, CSD, where color is more discriminant. Secondly, C-SD gives the smallest 25th percentile of normalized perplexity for all data, followed by EDH and SCD. Thirdly, it seems that EHD is very useful in the detection of the contour as in the sport and road concepts. Identical observation is given for C-SD. Conversely, MAD presents a large interval of perplexity but gives small value for the concepts walking-running, people-marching where the motion activity can be detected. Finally, FD is a relevant descriptor to detect face and person concepts which was to be expected from the very nature of this descriptor.

Fig. 3
figure 3

Normalized perplexity Boxplot

This approach is proposed to weight each low-level feature per concept, within an adaptive classifier fusion step (Section 3.4). The combination provides a new classification system that we call PENN “Perplexity-based Evidential Neural Network”. We will now present the classifier fusion step.

3.4 Classifier fusion

Classifier fusion is an important step of the classification task. It improves recognition reliability by taking into account the complementarities between classifiers, in particular for multimedia indexing and retrieval. Several schemes have been proposed in the literature according to the type of information provided by each classifier as well as their training and adaptation capacity. The state of the art and the comparison study about the effectiveness of the classifier fusion methods are given in [4].

In [12], Duin et al. have distinguished the combination methods of different classifiers and the combination methods of weak classifiers. Another kind of grouping using only the type of classifiers outputs (class, measure) is proposed in [57]. Jain [18] built a dichotomy according to two criteria of equal importance: the type of classifiers outputs and their capacity of learning. This last criteria is used by [25, 26] for grouping the combination methods. The trainable combiners search and adapt the parameters in the combination. The non trainable combiners use the classifiers outputs without integrating another a priori information of each classifiers performances.

In this part, we describe our proposed neural network based on evidence theory (NNET) [5] to address classifier fusion (Fig. 4).

  1. 1.

    Layer L input: Contains N units. Identical to the RBF (Radial Basis Function) network input layer with an exponential activation function ϕ. d: distance computed using training data. α ∈ [0, 1] is a weakening parameter associated to unit i.

    $$ \left\lbrace \begin{tabular}{ll} $s^{i}=\alpha^{i} \phi{(d^{i})}$ \\ \par $\phi(d^{i})=\exp{(-\gamma^{i} (d^{i})^{2})}$ \end{tabular} \right. $$
    (6)
  2. 2.

    Layer L 2: Computes the belief masses m i (7) associated to each unit. The units of module i are connected to neuron i of the previous layer.

    $$ \left\lbrace \begin{tabular}{ll} $m^{i}(\{w_{q}\})=\alpha^{i} u_{q} ^{i}\phi{(d^{i})}$ \\ $m^{i}(\Omega)=1-\alpha^{i} \phi{(d^{i})} $ \end{tabular} \right. $$
    (7)

    where \(u_{q} ^{i}\) is the membership degree to each class w q , q class index q = {1, ..., M}.

  3. 3.

    Layer L 3 : The Dempster–Shafer combination rule combines N different mass functions in one single mass. It is given by the conjunctive combination (8):

    $$ m(A)= (m^{1}\oplus...\oplus m^{N}) =\sum_{B_{1} \bigcap...\bigcap B_{N} = A } \prod_{i=1} ^{N} m^{i} (B_{i}) $$
    (8)

    The activation vector of modules i is defined as \(\stackrel{\rightarrow}{\mu ^{i}}\). It can be recursively computed using:

    $$ \left\lbrace \begin{tabular}{ll} $\mu^{1}=m^{1}$\\ \par $\mu^{i} _{j} = \mu^{i-1} _{j} m^{i} _{j} + \mu^{i-1} _{j} m^{i} _{M+1} + \mu^{i-1} _{M+1} m^{i} _{j}$ \\ \par $\mu^{i} _{M+1} = \mu^{i-1} _{M+1} m^{i} _{M+1}$ \end{tabular} \right. $$
    (9)
  4. 4.

    Layer L output: In [10], the output is directly obtained by \(O_{j}=\mu^{N} _{j}\). The experiments show that this output is very sensitive to the number of prototype, where a small modification in the number can change the classifier fusion behavior. To resolve this problem, we use normalized output (10). Here, the output is computed taking into account the activation vectors of all prototypes to decrease the effect of an eventual bad behavior of prototype in the mass computation.

    $$ O_{j}=\frac{\sum_{i=1} ^{N} \mu_{j} ^{i}}{\sum_{i=1} ^{N} \sum_{j=1} ^{M+1} \mu_{j} ^{i}} $$
    (10)
    $$ P_{q}=O_{q}+O_{M+1} $$
    (11)

    The different parameters (Δu, Δγ, Δα, ΔP, Δs) can be determined by gradient descent of output error for an input pattern x. Finally, the maximum of plausibility P q of each class w q is computed.

Fig. 4
figure 4

NNET classifier fusion structure

Therefore, the combination between perplexity-based weighted low-level feature per concept, within the adaptive NNET classifier fusion provides a novel system that we call PENN “Perplexity-based Evidential Neural Network”.

4 Concept ontology construction

The ontology has been historically used to achieve better performance in the multimedia retrieval system [8]. It defines a set of representative concepts and the inter-relationships among them. It is therefore important to introduce some constraints to the development of the similarity measures before proceeding to the presentation of our method. Psychology demonstrates that similarity depends on the context, and may be asymmetric [30]. However, when ontologies have been defined for multimedia they have not been extensively used at the decision making stage of high level concept detection.

Most indexing models are based on binary classification, ignoring possible relationships between concepts. However, concepts do not exist in isolation and are interrelated by both their semantic interpretations and co-occurrence. Wu et al. [55] have reported an ontological multi-classification learning for video concept detection in the NIST TREC-2003 Video Retrieval Benchmark.Footnote 2 Ontology-based multi-classification learning consists of two steps. At the first step, each single concept model is constructed independently based on SVM (Support Vector Machine). At the second step, ontology-based concept learning improves the accuracy of individual classifiers based on updating confidence scores from single concept models. Two kinds of influences have been defined: confusion factor β and boosting factor λ. The confusion factor β is the influence between concepts that can not be co-existent. The boosting factor λ is the top-to-down influence from big to small concepts in the ontology hierarchy. A small example of such influence is presented in Fig. 5. The factors are obtained using a correlation study of the training data. Then, an update of the novel confidence is applied, as shown in (12) and (13).

$$ \left\lbrace \begin{tabular}{ll} $\underline{p}(x / C_i) = p(x / C_i) + \displaystyle\sum\nolimits_{j \in \psi} {\lambda_{j} ^{i} p(x / C_j)}$ \\[6pt] \par $\lambda_{j} ^{i} =\dfrac{A}{B+\exp{(C |p(s / C_i)-p(s / C_j)|)}}$ \end{tabular} \right. $$
(12)
$$ \left\lbrace \begin{tabular}{ll} $\underline{p}(x / C_i) = \dfrac{p(x / C_i)}{\beta}$ \\[6pt] \par $\beta = \dfrac{1}{f(p(x / C_i)- \max_{j \in \theta}(p(x / C_j)))}$ \end{tabular} \right. $$
(13)
Fig. 5
figure 5

An example of ontology used by Wu et al. [55]

The parameters A, B and C of (12) are empirically obtained as described in the works of Li et al. [28]. f(.) is a positive and increasing function for the (13).

Naphade et al. [34] have modeled the linkages between various semantic concepts via a Bayesian network offering a semantics ontology. The central theme to this approach is the concept of Multijects or Multimedia Objects. A Multiject has a semantic label and summarizes a time sequence of low level features of multiple modalities in the form of a probability. It has 3 main aspects: The first aspect is the semantic label. The second aspect of a Multiject is, that it summarizes a time sequence of low level features. The detection of a certain Multiject can increase or decrease the probability of occurrence of other Multiject. For example, if the Multiject beach is detected with a very high probability, then the probability of occurrence of the Multiject yacht or the Multiject sunset increases. This is the third aspect of Multijects, i.e. their interaction in a network.

Authors assume that all concepts have the same semantic level, related by the conditional dependence relation with the associated low-level descriptors.

Fan et al. [14] have proposed a hierarchical classification for image annotation.Footnote 3 This approach introduce the contextual dependences of the WordNet ontology and the co-occurrence relationship, as presented by the following equation:

$$ \left\lbrace \begin{tabular}{ll} $\lambda(C_{m},C_{n})=\rho(C_{m},C_{n})\pi(C_{m},C_{n})$ \\[6pt] \par $\text{ou} \ \ \ \rho(C_{m},C_{n})=log\left(\dfrac{P(C_{m},C_{n})}{P(C_{m})P(C_{m})}\right)$ \\[12pt] \par $\text{et} \ \ \ \pi(C_{m},C_{n})=-log{\left(\dfrac{dist(C_{m},C_{n})}{2D}\right)}$ \end{tabular} \right. $$
(14)

with ρ(C m , C n ) is the joint probability between two concepts. It is obtained by the computation of the frequency for the co-occurrence of the relevant C m and C n . π(C m , C n ) is the contextual dependency, extracted in the ontology structure (dist is the length of the shortest path between two concepts, and D is the maximum depth of the WordNet).

Hauptmann et al. [16] have presented a comparison between the unimodal and the multimodal indexing. The multimodal system learn the dependence between concepts using the following graphical models: Conditional Random Field “CRF” and Bayesian network. The two models provide closer results in term of precision but are better than the unimodal approach. Koskela et al. [22] have exploited the correlations between the concepts to build a clustering method.

In another development, Li et al. [29] have proposed a study of various linear and non-linear functions S = f(f 1, f 2, f 3) depending on the shortest path length l, depth of subsumer concept in the hierarchy h, and the local semantic density d, as shown in (15).

$$ \left\lbrace \begin{tabular}{ll} $f_1=\exp{(-\alpha l)}$ \\[6pt] \par $f_2=\dfrac{e^{\beta h}-e^{-\beta h}}{e^{\beta h}+e^{-\beta h}}$ \\[12pt] \par $f_3=\dfrac{e^{\delta d}-e^{-\delta d}}{e^{\delta d}+e^{-\delta d}}$ \end{tabular} \right. $$
(15)

where α is a constant, β > 0 is a smoothing factor. \(\delta=max_{c\in CS(c_m,c_n)}(-{\rm{log}} p(c))\) represents the semantic similarity measured by the information content.

Several combinations have been applied and evaluated such as: S 1 = f 1, S 2 = f 1 f 2, S 3 = S 2 f 3, S 4 = S 2 + f 3, etc. The obtained results with different parameters (α and β) indicate that different functions have satisfactory performances, particularly those that use the three influences.

Discussion   The work of Wu et al. [55] uses a confidence update using the correlation of data, and a fixed ontology structure. Naphade et al. [34] have trained the low-level features, and the co-occurrence between concepts. Koskela et al. [22] have included the co-occurrence and visual information in the construction of this relationship. Fan et al. [14] as in Li et al. [29] have incorporated contextual dependencies of the WordNet ontology, and co-occurrences. This paper extends preceding works in term of the inter-concepts similarity construction. We use the co-occurrence in the corpus, the visual information outcome from low-level description, and finally the hybrid semantic similarity obtained from the ontology architecture.

In LSCOM-lite ontologyFootnote 4 [35], we notice positive relationships such as (road, car), (vegetation, mountain), and negative relationships like (building, sports), (sky, meeting).

Here, we will investigate how the relationship between different semantic concepts can be extracted and used. One direct method for similarity calculation is to find the minimum path length of connecting two concepts [40]. For example, Fig. 6 illustrates a fragment of the semantic hierarchy of LSCOM-Lite. The shortest path between vegetation and animal is vegetation-outdoor-location-root-objects-animal. The minimum length of a path is 5. Or, the minimum path length between vegetation and outdoor is 1. Thus, we could say in LSCOM-lite ontology, outdoor is more similar semantically to vegetation than animal. But, we should not say animal is more similar to car. In an other way, outdoor contains many different concepts such as “desert, urban, road,etc” each with different colors and textures scene descriptions. Therefore, the linking of concepts can infer new and more complex concepts, or improve the recognition of concepts previously detected. Thus, the presence or absence of certain concepts suggests a high or low capability to find other concepts (e.g. detection of sky and sea increases the probability of the concept Beach and reduces the likelihood of desert). For this, more information between the concepts are introduced, so that it becomes a function of attributes “co-occurrence, low-level visual descriptors, path length, depth and local density” to boost the performance of specific indexing system, as:

$$ \lambda{(C_{m},C_{n})}=\sum\limits_{i=\{\text{Cos, Vis, Sem\}}}(Sim_{i}(C_{m},C_{n})) $$
(16)
Fig. 6
figure 6

Fragment of the hierarchical LSCOM-Lite

Below, we explain with more details the similarity forms used in our architecture.

4.1 Co-occurrence

The first similarity is obtained by considering the co-occurrence statistics between concepts, where the presence or absence of certain concepts may predict the presence of other concepts. Intuitively, documents (video shots) that are “close together” in the vector space relate to similar things. Many methods are proposed in literature to represent this proximity such as: Euclidean, Hamming, Dice, etc. Here, we use Cosine similarity because it reflects similarity in terms of relative distributions of component. Cosine is not influenced by one document being small compared to others like the Euclidean distance tends to be [23]:

$$ Sim_{\rm cos}(P^{m},P^{n})=\frac{\sum_{i=0} ^{k-1}P_{i} ^{m} P_{i} ^{n}}{\sqrt{\sum_{i=0} ^{k-1}(P_{i} ^{m})^{2} \sum_{i=0} ^{k-1}(P_{i} ^{n})^{2}}} $$
(17)

4.2 Visual similarity

The second similarity is based upon low level visual features. In Section 3.3, we have used perplexity to build a weighted descriptor per concept. Now, in order to compute the visual similarity d vis, we are interested in mutual information presented as a measure of divergence. To this end, several measures are proposed in the literature: Jensen–Shannon (JS), Kullback–Leibler (KL), etc. We decided to use d JD Jeffrey divergence [23] which is like d KL , but is numerically more stable.

$$ d_{JD}(P^{m},P^{n})=\sum\limits_{i=0} ^{k-1}\left(P_{i} ^{m} \log{\frac{P_{i} ^{m}}{\hat{P_{i}}}}+P_{i} ^{n} \log{\frac{P_{i} ^{n}}{\hat{P_{i}}}} \right) $$
(18)

where \(\hat{P_{i}} = \frac{P^{m}+P^{n}}{2}\) is the mean distribution. The visual distance between two concepts is:

$$ Sim_{\rm vis}(C_{m},C_{n})=\frac{1}{\sum_{i=1} ^{Nb \ features} \frac{1}{2}(w_{i} ^{m}+w_{i} ^{n})d^{i} _{JD}(P^{m},P^{n})} $$
(19)

where \(w_{i} ^{m}\) is the ith perplexity-based weighted descriptors for the concept m.

4.3 Semantic similarity

The semantic similarity between the concepts has been widely studied in the literature and can be classified in three major approaches [43]:

4.3.1 Distance-based approach

It estimates the distance (edge length) between nodes which correspond to the concepts being compared. Two concepts C m and C n are similar if their path is short, presented by the minimum number of edges that separates the two concepts. Rada et al. [40] propose the following equation:

$$ Sim_{sem}(C_{m},C_{n})=1/\left(1+dist_{Rada}(C_{m},C_{n})\right) $$
(20)

Wu and Palmer [54] propose a similarity-based (see Fig. 7) on the depth of the concept subsumes CS Footnote 5 and the two concepts (21).

$$ Sim_{sem}(C_{m},C_{n})=\frac{2*depth(CS)}{depth(C_{m})+depth(C_{n})} $$
(21)
Fig. 7
figure 7

The concept similarity measure

The drawbacks of this approach are its dependence on the concepts position in the hierarchy, and that all edges have the same weight, which imposes difficulties in defining and controlling the distance edges.

4.3.2 Information content-based approach

It takes into account the information shared by the concepts in terms of entropy measure. Two methods exist. The first uses a learning corpus and compute the probability p(C i ) to find the concept C i or one of its descendants. For Resnik [41], the semantic similarity can be obtained per the frequency of appearance in the corpus, and defined by:

$$ Sim_{sem}(C_{m},C_{n})=\max{\left(IC(CS(C_m,C_n))\right)} $$
(22)

with IC(C i ) = − log(p(C i )) is the information content of the concept C i (i.e, the entropy of a class C i ). The probability p(C i ) is computed by dividing the number of instances of C i by the total number in the corpus. This measure does not seem complete and precise because it depends on the specific subsumed concept only.

The second method computes the information content of nodes from WordNet instead of a corpus. Seco et al. [42] use descendant hyponyms of the concepts to obtain the information content. This approach can produce a similarity between two neighbor concepts of an ontology, exceeding the value of two concepts contained in the same hierarchy. This is inadequate in the context of information retrieval.

4.3.3 Hybrid approach

The hybrid approach combines the two previous approaches. Often, it reuses the information content of nodes and the smallest common ancestor, as with the equation of Lin et al. [30], or with the distance of Jiang and Conrath dist J&C  [19].

$$ Sim_{sem_{Lin}}(C_{m},C_{n})=\frac{2*\log{P(CS)}}{\log{P(C_{m})}+\log{P(C_{n})}} $$
(23)
$$ \left\lbrace \begin{tabular}{ll} $dist_{J\&C}(C_m,C_n)=IC(C_m)+IC(C_n)-2*IC(CS(C_m,C_n))$ \\ \par $Sim_{sem_{J\&C}}(C_{m},C_{n})=1/\left(dist_{J\&C}(C_m,C_n)\right)$ \end{tabular} \right. $$
(24)

For the ontology presented in the Fig. 8, we compare the last two hybrid approaches with the novel one as presented in the (26), that it is the combination of Rada [40] and J&C [19].

$$ \left\lbrace \begin{tabular}{ll} $Sim_{sem_{J\&C}}(C_{m},C_{n})=1/d_{J\&C}(C_m,C_n)$ \\ \par $d_{J\&C}(C_m,C_n)=IC(C_m)+IC(C_n)-2*IC(CS(C_m,C_n))$ \end{tabular} \right. $$
(25)
$$ Sim_{sem}(C_{m},C_{n})=1/\left(d_{Rada}(C_{m},C_{n})+ d_{J\&C}(C_m,C_n)\right) $$
(26)

where d Rada (C m , C n ) is the length of the shortest path between C m and C n .

Fig. 8
figure 8

Hierarchical ontology model

4.4 Concept-based confidence value readjustment (CCVR)

The proposed framework (Fig. 1) introduces a reranking or confidence value readjustment to refine the PENN results for concept detection [7], and is computed using:

$$ \underline{P\left(x/C{i}\right)}=P\left(x/C{i}\right)+\frac{1}{Z}\sum\limits_{j=1} ^{Nb \ arc} \lambda_{i,j}(1-\zeta_{j})P\left(x/C{j}\right) $$
(27)

where \(\underline{P\left(x/C{i}\right)}\) corresponds to the multi-modal PENN result, λ i,j is the causal relationship between concepts C i and C j , ζ j is the classifier error in the validation set and Z is a normalization term.

5 Experimentations

The experiments provided here are conducted on the TRECVid 2007 dataset [49] containing science news, news reports, documentaries, etc. Of the 100 hours of video segmented into shots and annotated [3] with semantic concepts from the 36 defined labels. Half is used to train the feature extraction system and the other half is used for evaluation purposes. The evaluation is realized in the context of TRECVid using mean average precision MAP in order to provide a direct comparison of the effectiveness of the proposed approach with other published work using the same dataset. Precision provides a measure of the ability of a system to present only relevant sequence.

$$ \text{AP}=\frac{\left(\frac{\text{number of relevant video sequences retrieved}}{\text{total number of video sequences retrieved}}\right)}{\text{total number of relevant video sequences}} $$
(28)

Other metrics are introduced in our evaluation to have a global comparison: F-measure, classification rate CR, and balanced error rate BER.Footnote 6 The classifier results can be represented in a confusion matrix (Table 1), where a, b, c and d represent the number of examples falling into each possible outcome:

$$ \text{F-measure}=2\frac{P.R}{P+R} $$
(29)
$$ \text{BER} = \frac{1}{2}(\frac{b}{a+b}+\frac{c}{c+d}) $$
(30)
Table 1 Confusion matrix representation

Figure 9 shows the variation of average precision results vs semantic concepts, for three systems: NNET,Footnote 7 PENN,Footnote 8 and Onto-PENN.Footnote 9 First, we observe that PENN and Onto-PENN systems have the same performance on average for several concepts, and present a significant improvement compared to NNET for the concepts 4,6,17,18,19,23,31 and 32. This is not surprising considering the manner the MAP (Mean Average Precision) is computed (using only the first 2,000 returned shots as in TRECVid) (see Table 2). Furthermore, low performances on several concepts can be observed due to both numerous conflicting classification and limited training data regardless of the fusion system employed. This also explains the rather low retrieval accuracy obtained for concepts 3, 22, 25, 26, 33 and 34.

Fig. 9
figure 9

Average precision evaluation

Table 2 Id of the TRECVid 2007 concepts

To evaluate the inter-concepts similarity contribution in the video shots indexing system, we need to study the results in all test set. For this, the comparisons of the detection performances are carried out by thresholding the soft-decisions at the shot-level before and after using the inter-concepts similarity via F-meas, CR +  and BER. Note that the MAP is not sensitive to Threshold values τ. Figure 10 compares the three experimental systems along with the variation of τ ∈ [0.1, 0.9], by step of 0.1. We can clearly see that for any τ value the Onto-PENN dominates and obtains higher performances for F-meas, CR +  as well as lower BER comparing to PENN and NNET. The BER min  = 40.38% is given by τ = 0.2, for F-meas= 16.98% and CR +  = 34.48%. The best results are obtained for τ ∈ [0.2, 0.5]. With τ = 0.40, the CR +  is improved by 10.14% to achieve 22.07%, and decreasing the BER of 2.91% compared to NNET.

Fig. 10
figure 10

Evaluation of the metrics (CR + , BER and F-measure) vs Threshold τ ∈ [0.1, 0.9]

Figure 11 presents the performance evolution per concepts using τ = 0.4. Some points can be noticed: The three systems produce a certain non-detection (F-meas = 0, CR +  = 0) for the concepts 2, 3, 9, 11, 25, 26, 28, 29, 33, 34, and 36. Then, NNET can not detect any of the following concepts 1, 5, 6, 20, 21, 22, 31, 32, and 35. Identically, for PENN in 5,20,22, and 35. Finally, Onto-PENN resolves the limitation previously mentioned and achieves a high improvement for the concepts 1, 4, 7, 8, 10, 12, 13, 15, 16, 17, 18, 19, 22, 23, 24, and 31, due to the strong relationship between the connected concepts, allowing for better, more accurate decision-making.

Fig. 11
figure 11

F-measure and CR +  evaluation

As an example, to detect face, person, meeting, or studio concepts, PENN gives more importance to FaceDetector, ContourShape, ColorLayout, ScalableColor, EdgeHistogram than others descriptors. For the “Person” concept, the improvement was as high as 11%, making it the best performing run. The Onto-PENN system introduces the relationship between the connected concepts (i.e. concepts that are likely to co-occur in video shots), increasing the performance in term of accuracy (see Fig. 12). The co-occurring concept constitute some type of contextual information about the content of the shot under consideration.

Fig. 12
figure 12

Inter-concept connections graphical model for the concept office. We observe that 20 concepts are connected with office, but only 5 are strong and significant (meeting: 6.65%, studio: 5.06%, face: 33.92%, person: 38.52%, and computerTV: 4.77%) presenting 88.92% of the global information

Table 3 summarizes the overall performances for the content-based video shots classification systems using a fixed Threshold (τ = 0.4). We compute the above mentioned statistics for all concepts, and for a subset composed of the 10 most frequent concepts in the dataset. All hybrid semantic similarities-based Onto-PENN allow an overall improvement of the system and a significant increase of F-meas and CR  + . They achieve a respectable result for MAP, and significantly decrease the balanced error rate “BER” compared to NNET and PENN. Finally, the results given by the two equations (25 and 26) are very close, with a slight advantage for the (26). However, it can be observed that the MAP declines using the equations of Rada, and Lin compared to the two equations used, which underlines the importance of the semantic similarity choice.

Table 3 Performance comparisons between the three experimental systems: NNET, PENN and Onto-PENN

6 Conclusions

In this paper, we have presented a generic and robust ontology-based video shots indexing scheme. One of the particular aspect of the proposed framework is to employ contextual information during the classification phase. To learn the influence of the relation between concepts, three types of influence are computed: co-occurrence, visual descriptors and hybrid semantic similarity. A comparison of some approaches to automatically construct the semantic similarity has been presented. Based on the newly defined simulated user principle, we evaluate the results of four alternative methodologies. We demonstrate through statistical study and empirical testing the potential of multimodal fusion, to be exploited in video shots retrieval. In TRECVid 2007 benchmark, a significant improvement is obtained with our system, about 18.75% in terms of correct positive recognition rate (CR + ), 5.99% for the F-measure, 1.66% for the mean average precision (MAP), and decreases the balanced error rate of 2.91% on average. Our proposed “Onto-PENN” method outperforms clearly both the NNET and PENN methods which are not using any contextual information. In addition, we have shown that perplexity-based weighted vector integration in the indexing papeline increases the performances of our system.

In the future works, we plan to extend application to WordNet instead of a corpus, integration of richer semantics and broader knowledge.