1 Introduction

The scene understanding research plays a very important role in many computer vision applications [2128]. In this area, a challenging task is the object recognition, which has been widely studied during the past decade. For example, some arts [10, 18, 19] recognize objects under the single-label setting, i.e, each image is assigned by a single label; some arts [3, 8, 20] consider the multi-label setting, i.e., each image can be assigned by one or more labels simultaneously; and the authors of [17] combine the localization task with classification task. However, nowadays recognizing objects in natural scenes is still quite difficult. That is because natural scenes are always complex, resulting in some problems such as ambiguity and occlusion.

To deal with natural scenes, some works investigate more robust recognition algorithms by considering the semantics behind images. They commonly use topic modeling algorithms, e.g., latent Dirichlet allocation (LDA) [1], to uncover the latent semantics, or directly extend the LDA model for object recognition tasks. The representative attempts include a unified framework for total scene understanding [11], topic-supervised LDA (ts-LDA) and class-specific-simplex LDA (css-LDA) [16], and visual semantic integration model (VSIM) [4]. Different from other algorithms, VSIM is in fact a hierarchical model of latent semantic contexts and observed features. It consists of two levels: semantic level and visual level. In the semantic level, it uses pachinko allocation model (PAM) [12] to capture the scene semantics. In the visual level, it extends a nearest neighbor based LDA (nnLDA) model to represent observed visual context. This VSIM algorithm derives a joint inference process over the two levels, and empirically shows appreciated recognition performance for even complex nature images.

For VSIM, the “quality” of semantics extracted by the upstream semantic level is obviously significant for the downstream visual level and the final recognition performance. However, the original VSIM uses the simplest four-level PAM to capture the semantics. This might generate poor semantics information and further worse recognition performance. To boost VSIM, we use the enhancement PAM, i.e., hierarchical PAM (hPAM) [15], to replace the simplest four-level PAM. We develop two variations of hPAM, including an mentioned version in [15] and an additional version involving private subtopics and public subtopics. Both enhancement PAMs are preferable to VSIM for image data, and they can better uncover the semantics behind images. We have evaluated the two modifications by comparing with the original VSIM algorithm and some other state-of-the-art algorithms. The experimental results showed that our algorithms obtain better performance. For clarity, some important notations used in this paper are summarized in Table 1.

Table 1 Notation descriptions

The rest of this paper is organized as follows: In Section 2, we introduce the VSIM algorithm. In Section 3, we boost VSIM using two variations of hPAM. Section 4 shows the experimental results. Finally, the conclusions are given in Section 5.

2 VSIM

VSIM [4] is a special “big” topic model used to fit the image content. It consists of two parts: semantic level and visual level. The semantic level uses the semantic topics, generated by the PAM algorithm, to represent context. Then, the visual level uses the visual topics, generated by the nnLDA model, to describe observed visual features. The full graphical model of VSIM is shown in Fig. 1, and details are provided as following.

Fig. 1
figure 1

The graphical model of VSIM

The semantic level. In this level, the simplest four-level PAM [12] is used. This PAM contains a root node, a supertopic level, a subtopic level and a semantic label level, where adjacent levels are connected with each other. Supertopics are Dirichlet Multinomial \((\theta _{d,s}^{(t)}, \alpha ^{(s)})\) distributions over subtopics, and they are used to represent general semantics. Subtopics are Dirichlet Multinomial \((\phi _{t}^{(l)}, \beta ^{(l)})\) over semantic labels, and they are used to represent reified semantics. Formally, its generative process is given as follows:

  1. 1.

    For each subtopic t

    1. a.

      Sample a distribution over semantic labels: \(\phi _{t}^{(l)}{\kern 1pt} \sim Dirichlet\left ({{\beta ^{(l)}}} \right )\)

  2. 2.

    For each image d

    1. a.

      Sample a distribution over supertopics: \(\theta _{d}^{(s)} \sim Dirichlet({\alpha ^{(0)}})\)

    2. b.

      For each supertopic s

      1. i.

        Sample a distribution over subtopics: \(\theta _{d,s}^{(t)} \sim Dirichlet({\alpha ^{(s)}})\)

    3. c.

      For each of the N d semantic label l d,n

      1. i.

        Sample a supertopic \(z_{d,n}^{(s)}{\kern 1pt} \sim Multinomial\left ({\theta _{d}^{(s)}} \right )\)

      2. ii.

        Sample a subtopic \(z_{d,n}^{(t)}{\kern 1pt} \sim Multinomial\left ({\theta _{d,z_{d,n}^{(s)}}^{(t)}} \right )\)

      3. iii.

        Sample a semantic label \({l_{d,n}} \sim Multinomial\left ({\phi _{z_{d,n}^{(t)}}^{(l)}} \right )\)

The visual level. In this level, the nnLDA model is provided to build a bridge between semantic labels (i.e., obtained by PAM) and observed labels based on image features, i.e., a many-to-many bipartite relation via a nearest neighbor rule. In nnLDA model, visual topics are Dirichlet Multinomial over observed labels, and they are used to describe observed visual features. Given a semantic label l d,n generated by PAM, the generative process of nnLDA is as follows:

  1. 1.

    For each visual topic v

    1. a.

      Sample a distribution over observed labels: \(\phi _{v}^{(w)}{\kern 1pt} \sim Dirichlet\left ({{\beta ^{(w)}}} \right )\)

  2. 2.

    For each semantic label l d,n

    1. a.

      Sample a distribution over visual topics: \(\theta _{d,n}^{(v)} \sim Dirichlet(\alpha )\)

    2. b.

      For each of the M d,n observed label \(l_{d,n,m}^{(v)}\)

      1. i.

        Sample a visual topic \(z_{d,n,m}^{(v)}{\kern 1pt} \sim Multinomial\left ({\theta _{d,n}^{(v)}} \right )\)

      2. ii.

        Sample a observed label \(l_{d,n,m}^{(v)} \sim Multinomial\left ({\phi _{z_{d,n,m}^{(v)}}^{(w)}} \right )\)

3 Boosting VSIM

In the semantic level of VSIM, PAM introduces two topic levels, i.e., supertopics and subtopics, to hierarchically uncover the semantics behind images. However, PAM suffers from two problems: (1) Sometimes supertopics might prefer to make a direct connection to semantic labels, instead of a topic path; (2) Supertopics might focus on a few “private” subtopics, instead of all the subtopics. Although this PAM in VSIM performs a sparse DAG structure using an asymmetric subtopic Dirichlet prior, it is lack of considerations of “public” subtopics that describe the background semantics.

To address these problems in the semantic level of VSIM, we use hPAM [15] to replace the four-level PAM to cover semantics behind images. In contrast, hPAM is more flexible than PAM, where every node is either associated with a multinomial distribution over semantic labels, or connected with a portion of nodes in the next level. In this work, we use two variations of hPAM to point against the mentioned two problems. One is an existing version (termed hPAM1) suggested in [15], and the other one is a future version (termed hPAM2) discussed in [15]. Since in VSIM the semantic level and visual level are conditional independent from each other, we only provide the estimation of the semantic level.

3.1 hPAM1

The hPAM1 model is mainly focusing on the first problem mentioned above. For example, suppose that a supertopic delegates “house environment”, which contains a subtopic “bedroom”, and this subtopic “bedroom” connects with a semantics label “bed”. In PAM, for any observed semantics label “bed”, it must be generated through a topic path of (house environment, bedroom). We argue that this assumption seems ramrod and redundant, because in practice, the supertopic “house environment” sometimes prefers to connect with “bed” directly. The hPAM1 model allows that all the three upper levels can generate the semantics labels, including the root node, supertopics and subtopics (e.g., “house environment” is directly connecting with “bed”). This further assumption of hPAM1 is obviously more reasonable for the natural images and can perfectly address the first problem.

To achieve this goal, hPAM1 introduces an additional variable y to show which level is generating the semantics label. Formally, hPAM1 introduces a level distribution \(\zeta _{s,t}^{(l)}\) for each topic path (s,t), drawn from a symmetric Dirichlet prior γ (l). For the generation of each semantics label, we first sample a value from this level distribution to determine which level directly generates the semantics. For example, if y d,n =1, we will sample the semantics label from the supertopic distribution, e.g., sampling “bed” from “house environment”. Under this further assumption, the generative process of hPAM1 (Fig. 2) for images is given as follows:

  1. 1.

    For root node, each supertopic and each subtopic i

    1. a.

      Sample a distribution over semantic labels: \(\phi _{0}^{(r/l)}/\phi _{i}^{(s/l)}/\phi _{i}^{(t/l)}{\kern 1pt} \sim Dirichlet\left ({{\beta ^{(l)}}} \right )\)

  2. 2.

    For each topic path (s,t)

    1. a.

      Sample a level distribution: \(\zeta _{s,t}^{(l)}{\kern 1pt} \sim Dirichlet\left ({{\gamma ^{(l)}}} \right )\)

  3. 3.

    For each image d

    1. a.

      Sample a distribution over supertopics: \(\theta _{d}^{(s)} \sim Dirichlet({\alpha ^{(0)}})\)

    2. b.

      For each supertopic s

      1. i.

        Sample a distribution over subtopics: \(\theta _{d,s}^{(t)} \sim Dirichlet({\alpha ^{(s)}})\)

    3. c.

      For each of the N d semantic label l d,n

      1. i.

        Sample a supertopic \(z_{d,n}^{(s)}{\kern 1pt} \sim Multinomial\left ({\theta _{d}^{(s)}} \right )\)

      2. ii.

        Sample a subtopic \(z_{d,n}^{(t)}{\kern 1pt} \sim Multinomial\left ({\theta _{d,z_{d,n}^{(s)}}^{(t)}} \right )\)

      3. iii.

        Sample a level indicator \({y_{d,n}} \sim Multinomial\left ({\zeta _{z_{d,n}^{(s)},z_{d,n}^{(t)}}^{(l)}} \right )\)

      4. iiii.

        if y d,n =0/1/2, sample a semantic label \({l_{d,n}}\sim Multinomial\left ({\phi _{0}^{(r/l)}/\phi _{z_{d,n}^{(s)}}^{(s/l)}/\phi _{z_{d,n}^{(t)}}^{(t/l)}} \right )\)

Fig. 2
figure 2

The graphical model of hPAM1

We use collapsed Gibbs sampler [9] to train hPAM1. This is achieved by sequentially updating the supertopic assignment \(z_{d,n}^{(s)}\), subtopic assignment \(z_{d,n}^{(t)}\), and level indicator y d,n to each semantic label by the following rule:

$$\begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! & & P\left( {z_{d,n}^{(s)}{\kern 1pt} ,z_{d,n}^{(t)}{\kern 1pt} ,{y_{d,n}}{\kern 1pt} |I,{\alpha^{(0)}},{\alpha^{(s)}},{\beta^{(l)}},{\gamma^{(l)}}} \right) \propto \\ \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! & & \quad \quad \quad \quad \quad \quad \frac{{N_{- n}^{s/d} + {\alpha^{(0)}}}}{{N_{- n}^{d} + S{\alpha^{(0)}}}} \times \frac{{N_{- n}^{st/d} + {\alpha^{(s)}}}}{{N_{- n}^{s/d} + {\sum}_{i = 0}^{T} {\alpha_{i}^{(s)}} }} \\ \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! & & \quad \quad \quad \quad \quad \quad \times \frac{{N_{ - n}^{y/st} + {\gamma^{(l)}}}}{{N_{- n}^{st} + 3{\gamma^{(l)}}}} \times \frac{{N_{- n}^{l/sty} + {\beta^{(l)}}}}{{N_{ - n}^{sty} + L{\beta^{(l)}}}} \end{array} $$
(1)

where N st/d and N s/d are the times that a topic path (s,t) and supertopic s have occurred in image d, respectively; N d is the total semantic labels in image d; N y/st and N st are the number of level indicator y exists for the topic path (s,t) and the total number of (s,t) occurs; N l/sty and N sty are the number that a semantic label l corresponds to pair (s,t,y) and the total number of pair (s,t,y) occurs; the subscript -n denotes a quantity except for the token in position n.

Finally, the distributions \(\zeta _{s,t}^{(l)}{\kern 1pt} \) and \(\phi _{0}^{(r/l)}/\phi _{i}^{(s/l)}/\phi _{i}^{(t/l)}\) can be estimated as follows:

$$ \zeta_{s,t,y}^{(l)}{\kern 1pt} = \frac{{N_{}^{y/st} + {\gamma^{(l)}}}}{{N_{}^{st} + 3{\gamma^{(l)}}}} $$
(2)
$$ \phi_{0,l}^{(r/l)}/\phi_{s,l}^{(s/l)}/\phi_{t,l}^{(t/l)} = \frac{{{N^{l/sty}} + {\beta^{(l)}}}}{{{N^{sty}} + L{\beta^{(l)}}}}\quad if\;y = 0/1/2 $$
(3)

Note that the asymmetric Dirichlet prior α (s) is used to capture the sparse relations between supertopics and subtopics. So we need to optimize this prior during model training. Following [4], we use the moment matching method to estimate the approximate MLE of α (s).

3.2 hPAM2

This hPAM2 model is mainly focusing on the second problem mentioned above. On one hand, in PAM the supertopic distribution over subtopics is in fact sparse. That is, for each supertopic, some subtopics commonly correspond to very low probabilities. For example, the supertopic “house environment” connects with two subtopics “bedroom” and “road”. Obviously, “house environment” must significantly prefer to “bedroom” than “road”. The hPAM2 model fully considers this situation, and introduces the concept of private subtopic, where each supertopic only samples from its own private subtopics, e.g., defining “bedroom” is the private subtopic for “house environment”. In this setting, supertopics are constrained with the less relevant subtopics. On the other hand, some subtopics represent the background semantics. In contrast to private subtopics, these subtopics are widespread to all the supertopics. For example, the subtopic “weather” might be popular for almost all the supertopics. Based on this analysis, hPAM2 further introduces the concept of public subtopic, where each public subtopic is shared by all the supertopics.

Formally, hPAM2 makes two assumptions: (1) Each supertopic has P private subtopics, which are used to describe the sparse structure between supertopics and subtopics; (2) All the supertopics share R public subtopics, which are used to describe the background semantics. From the generative perspective, hPAM2 introduces a private/public Bernoulli distribution \(\zeta _{d}^{(p)}\) for each image d, drawn from a Beta prior γ (p). This distribution is used to determine whether generating a private subtopic from its superior supertopic or generating a public subtopic directly. The generative process of hPAM2 (Fig. 3) is given as follows:

  1. 1.

    For each private or public subtopic t

    1. a.

      Sample a distribution over semantic labels: \(\phi _{t}^{(l)}{\kern 1pt} \sim Dirichlet\left ({{\beta ^{(l)}}} \right )\)

    Fig. 3
    figure 3

    The graphical model of hPAM2

  2. 2.

    For each image d

    1. a.

      Sample a distribution over supertopics: \(\theta _{d}^{(s)}\sim Dirichlet({\alpha ^{(0)}})\)

    2. b.

      Sample a distribution over public subtopics: \(\theta _{d}^{(g)}\sim Dirichlet({\alpha ^{(g)}})\)

    3. c.

      Sample a private/public distribution: \(\zeta _{d}^{(p)}{\kern 1pt} \sim Dirichlet\left ({{\gamma ^{(p)}}} \right )\)

    4. d.

      For each supertopic s

      1. i.

        Sample a distribution over subtopics: \(\theta _{d,s}^{(p/t)}\sim Dirichlet(\alpha _{s}^{(p/t)})\)

    5. e.

      For each of the N d semantic label l d,n

      1. i.

        Sample an indicator \({\delta _{d,n}} \sim {\text {Bernoulli}}\left ({\zeta _{d}^{(p)}} \right )\)

      2. ii.

        If δ d,n =1: Sample a supertopic \(z_{d,n}^{(s)}{\kern 1pt} \sim Multinomial\left ({\theta _{d}^{(s)}} \right )\), and then Sample a subtopic \(z_{d,n}^{(t)}{\kern 1pt} \sim Multinomial\left ({\theta _{d,z_{d,n}^{(s)}{\kern 1pt} }^{(p/t)}} \right )\); finally, Sample a semantic label \({l_{d,n}}\sim Multinomial\left ({\phi _{z_{d,n}^{(t)}}^{(l)}} \right )\)

      3. iii.

        If δ d,n =0: Sample a public subtopic \(z_{d,n}^{(g)}{\kern 1pt} \sim Multinomial\left ({\theta _{d}^{(g)}} \right )\); and then Sample a semantic label \({l_{d,n}}\sim Multinomial\left ({\phi _{z_{d,n}^{(g)}}^{(l)}} \right )\)

Collapsed Gibbs sampler is also used for hPAM2. The updating rule with respect to the supertopic assignment \(z_{d,n}^{(s)}\), private subtopic assignment \(z_{d,n}^{(t)}\), public subtopic assignment \(z_{d,n}^{(g)}\) and private/public indicator δ d,n is given as follows:

$$\begin{array}{@{}rcl@{}} P\left( {z_{d,n}^{(s)}{\kern 1pt} ,z_{d,n}^{(t)}{\kern 1pt} ,z_{d,n}^{(g)},{\kern 1pt} {\delta_{d,n}}{\kern 1pt} |I,{\alpha^{(0)}},{\alpha^{(g)}},{\alpha^{(p/s)}},{\beta^{(l)}},{\gamma^{(p)}}} \right)\propto \\ \frac{{N_{- n}^{\delta /d} + {\gamma^{(p)}}}}{{N_{- n}^{d} + 2{\gamma^{(p)}}}} \times \frac{{N_{- n}^{s/d} + {\alpha^{(0)}}}}{{N_{- n}^{pd} + S{\alpha^{(0)}}}} \times \frac{{N_{- n}^{st/d} + \alpha_{s,t}^{(p/s)}}}{{N_{- n}^{s/d} + {\sum}_{i = 0}^{P} {\alpha_{s,i}^{(p/s)}} }} \\ \times \frac{{N_{- n}^{t/d} + {\alpha^{(g)}}}}{{N_{- n}^{gd} + R{\alpha^{(g)}}}} \times \frac{{N_{- n}^{l/t} + {\beta^{(l)}}}}{{N_{- n}^{t} + L{\beta^{(l)}}}} \end{array} $$
(4)

where N δ/d and N d are the number of semantic labels assigned by indicator δ and the total number of semantic labels in image d; N pd is the total semantic labels generated by private subtopics image d; N t/d and N gd are the times that public subtopic t has been occurred and the total semantic labels generated by public subtopics in image d.

Similar with hPAM1, the distributions \(\zeta _{d}^{(p)}\) and \(\phi _{t}^{(l)}\) with hPAM2 are obtained:

$$ \zeta_{d}^{(p)}{\kern 1pt} = \frac{{N_{}^{\delta /d} + {\gamma^{(p)}}}}{{N_{}^{d} + 2{\gamma^{(p)}}}} $$
(5)
$$ \phi_{t,l}^{(l)} = \frac{{{N^{l/t}} + {\beta^{(l)}}}}{{{N^{t}} + L{\beta^{(l)}}}} $$
(6)

We also use the moment matching method to optimize the asymmetric Dirichlet prior α (p/s).

4 Experiment

In this section, we evaluate the performance of the proposed algorithms on two diverse image datasets: Scene-15 [6] and SUN09 [5]..

4.1 Experimental setting

Datasets: The Scene-15 dataset contains totally 4,485 images grouped into fifteen scene classes, where thirteen were collected by [6] and two were collected by [10]. Each class has 200 to 400 images, and the average size is about 300×250 pixels.

The SUN09 dataset contains totally 8600 natural indoor and outdoor images. Each image is averagely annotated by seven various objects and each object averagely covers 5 % of the image size. Following [4], we consider the top frequent 200 classes, and use 4,367 images for training and 4,317 images for testing. During training, we consider the annotated ground-truth locations and labels pre-signed in the dataset. During testing, we use the bounding boxes detected by DPM detectorFootnote 1 [7].

For Scene-15, we use two types of features to represent images. i.e., texton histograms using a codebook of 100 textons obtained by a 40-filter bank textons and visual word histograms using a codebook of 200 visual words generated by dense SIFT descriptor. For SUN09, we use three types of features, as described in [14], including the two mentioned histograms and normalized R. G. B histograms (i.e., color) with their means and variances.

Model parameters: Since the focusing of this work is on the semantic level of VSIM, we use the same number of visual topics (i.e., A=50) for nnLDA as suggested in [4]. In terms of hPAM1, 20 supertopics and 50 subtopics are defined; the three symmetric Dirichlet priors are set as: α (0)=1, β (l)=0.1 and γ l=10; the asymmetric subtopic Dirichlet prior α (s) is learnt during model training. In terms of hPAM2, there assumes 20 supertopics with 5 private subtopics and 10 public subtopics (i.e., totally 110 subtopics); the four symmetric Dirichlet priors are set as: α (0)=1, α (g)=0.1, β (l)=0.1 and γ l=10; also, the private subtopic Dirichlet prior α (p/s) is learnt during model training. In terms of Gibbs sampler, we run five independent MCMC chains with a burn-in of 500 iterations for each model, the averaged results are reported finally.

4.2 Performance

We evaluate our algorithms with the same tasks in [4]. Here, we name our algorithms as hVSIM1 and hVSIM2.

4.2.1 Semantic Scene Prediction

We evaluate the scene detection performance with respect to the ground-truth on dataset SUN09. As described in [4], we estimate the ground-truth distribution by grounding semantic labels with ground-truth labels and inferring supertopics and subtopics from the semantic level. We use the symmetric Kullback-Leibler (KL) divergence between the subtopic distributions as the performance metric. Note that the lower value of KL divergence implies better performance.

We use two state-of-the-art models, i.e., VSIM and total scene understanding modelFootnote 2 (TSU) proposed by [11], as baselines. The results are shown in Fig. 4. Clearly, the KL divergences of hVSIM1 and hVSIM2 are lower than the original VSIM and TSU. hVSIM1 shows a little better performance than hVSIM2, That is because we neglect the public subtopics in hVSIM2. To sum up, we argue that our modifications can provide closely matching with the ground-truth labels.

Fig. 4
figure 4

KL divergence between estimated and the ground-truth on SUN09

4.2.2 Predicting Top-N Labels

In this experiment, we evaluate the presence proportion of estimated top-N labels in the ground-truth on dataset SUN09. The VSIM and hcontextFootnote 3 [5] are used as performance baselines. For this evaluation, the higher value of proportion indicates better performance.

We perform this experiment with different values of N over the following set \(\left \{ {1,2,3,4,5} \right \}\). The results are shown in Fig. 5. The hVSIM2 performs better than other algorithms in all the settings of N. The hVSIM1 also outperforms VSIM (4/5) and hcontext (4/5) in the most settings. Note that hVSIM2 seems more robust than others. That is because, unlike VSIM, it considers the private and public subtopics so that it can capture local and background semantics. In contrast, our scheme is more reasoning for the natural scene. Empirical results further indicate this cognition.

Fig. 5
figure 5

Performance of top-N prediction on SUN09

4.2.3 Object Detection

Finally, we evaluate object detection performance of our two algorithms on both datasets Scene-15 and SUN09.

For Scene-15, we use 40, 60, 80 and 100 images per-class for training respectively, and for each case the remaining images are used for testing. Apart from VSIM, other two existing algorithms, i.e., Sparse coding (SC) [2] and localized soft-assignment coding (LSC) [13], are used as baselines. Figure 6 shows the classification accuracy with different values of training image per-class. We observe that our algorithms perform better than baseline algorithms in the most cases, and hVSIM2 is even better than hVSIM1. They both outperform VSIM in all the settings. When the training images per-class is 100, hVSIM2 is almost at the same level with the two arts SC and LSC, and hVSIM1 is a little lower than the two. With the decrease of training images, our algorithms outperform SC and LSC. This indicates that our algorithms are robust, and can achieve competitive performance with fewer training images. This advantage is very helpful for small sample cases in practice.

Fig. 6
figure 6

The performance of object detection on Scene-15

For SUN09, as in [4], we sort the size of object classes from most to least; and report the averaged accuracy over every 25 object classes. We compare our algorithms with VSIM and the DPM detector proposed in [7]. Figure 7 illustrates the results. Clearly, hVSIM1 and hVSIM2 show similar performance, and both of them outperform VSIM in the most settings. DPM performs a bit better than the three VISM-based algorithms for large size classes, however, it seems too sensitive to the size of training set. For relatively small rare classes, our two algorithms significantly outperform DPM. In contrast, hVSIM1 and hVSIM2 show more smooth trend and are competitive with VSIM and DPM on the overall performance.

Fig. 7
figure 7

The performance of object detection on SUN09

5 Conclusion

In this paper, we investigate the problem of naming object in the complex natural images. We attempt to boost the recent VSIM algorithm, which is composed of a semantics level, i.e., a four-level PAM, and a visual level, i.e, nnLDA. Our focus is on the semantics level, and we use two variations of hPAM to replace the simple four-level PAM. The first one allows connections among supertopics and semantic labels; the other one takes the local and background semantics into account. They are more flexible and powerful to capture semantics behind natural images.

We compare the proposed algorithms with the original VSIM and some other state-of-the-art algorithms on datasets Scene-15 and SUN09. Empirical results indicate that our modifications are more robust and achieve competitive performance on the basic tasks, e.g., top-N predication and object detection.

In the future, we will explore our algorithms on other popular and challenging image collections and tasks.