1 Introduction

Image classification is the process of classifying images according to the objects contained in them. One of the main challenges in designing a classification system is to develop an appropriate method of image representation [2, 14, 16, 28, 29, 34, 36]. Bag-of-words (BoW) has been widely used as the image representation method for image classification [14, 34, 36].

Fig. 1
figure 1

Spatial layout for image representation. Numbers are indices of regions to be matched in two images. a Example images, b SPM layout, c proposed layout

In the traditional BoW framework, each image is represented as a histogram of word frequency by assigning all local features to visual words. This model is insensitive to scale and illumination change, but suffers from lack of spatial information. Hence, pyramid structure representation such as spatial pyramid matching (SPM) [14] has been used to extend the global BoW representation by partitioning images into progressively finer sub-regions. The SPM computes a histogram of word frequency within each sub-region, and concatenates all the histograms to form the final image representation. However, SPM suffers from degradation of classification accuracy due to varying locations of objects in images of a same class (Fig. 1a). If the object locations in images are different, the spatial partition based on SPM (Fig. 1b) may mismatch them. This problem can be solved by partitioning the images separately into object and background areas.

In this paper, we propose a class-specific image representation (Fig. 1c) to match objects and background areas more accurately. For the object area, we define two kinds of region: class-specific region-of-interest (C-ROI) and focal region (FR). The C-ROI is defined as a region that can be found in all images of the same class (Fig. 1c, Region 2). A C-ROI contains almost all the region of an object of interest. The FR is defined as the most informative region in the C-ROI, i.e., the most informative part of the object of interest (Fig. 1c, Region 1). For the background area, we define the remaining region excluding the object area as the region in which to match the background scene. The region is divided horizontally into two regions (Fig. 1c, Regions 3 and 4) to model the background scene. By concatenating feature vectors of each region extracted in this way, we can construct an image representation that is more class-specific to objects of interest than is traditional SPM.

To extract C-ROI and FR, we use multiple region detectors. Usually, the classification accuracy depends on the region detector used, because they have different characteristics. The region detector that is most suitable for classification depends on the image class given. In this paper we use four scale-invariant region detectors: DoG [22], Harris-Laplace [24], Hessian-Laplace [25], and salient [12]. These region detectors capture a variety of characteristic information such as blob-like, corner-like, and entropy-based features; hence, combining the detectors’ output can be helpful to classify a variety of objects. We use the similarity of information obtained from multiple region detectors to extract C-ROIs in images of a class. We use the spatial distribution and appearance characteristics of region detectors to compute the similarity for extraction of C-ROIs.

The characteristics of images in various image classes can vary widely, which means that different information is required to describe images of different classes. For example, for faces (Fig. 1), the eye regions may be the most informative region to describe the class, so capturing blob-like structures is quite useful. Therefore, in this paper, the most informative region of a class, i.e., the FR, is obtained by considering the class-specific importance of each region detector for the class. The class-specific importance of a region detector presents how strongly the region detector affects extraction of C-ROIs.

The proposed method to construct a specific representation for a class consists of four steps (Fig. 2). Given training images of a class, step 1 extracts C-ROIs in them; these C-ROIs are defined by the similarity of information obtained from multiple region detectors. When the C-ROIs are extracted, the class-specific importance of each region detector is also computed to indicate which region detectors give dominant effects on extraction of C-ROIs. Step 2 extracts the spatial distribution of keypoints extracted by each region detector in the C-ROIs. We use nonnegative matrix factorization (NMF) to obtain the semantic spatial distribution for each region detector. Step 3 finds the FR of the class by summing the semantic spatial distributions of region detectors weighted by the class-specific importance and by thresholding. Step 4 defines spatial pooling regions, and concatenates encodings of each spatial region to form the final image representation. Here, the encoding of each spatial region is the BoW representation for obtained features of the spatial region.

2 Related work

Image representation has been developed as the main objective for various tasks such as image classification and retrieval [14, 16, 17, 34, 36]. For image classification, the pyramid structure representation based on Bag-of-words (BoW) [14, 34, 36] has been widely used as the image representation. This representation extends the global BoW representation and models approximate geometric layout by partitioning the image plane into progressively finer sub-regions; this procedure has become standard in the image classification task. Yang et al. [36] and Wang et al. [34] proposed extensions of the SPM approach [14]; the extensions compute a pyramid image representation based on effective coding schemes, instead of the k-means vector quantization in the SPM. The extensions obtained better classification accuracy than traditional SPM, and attained state-of-the-art accuracy on some benchmarks. However, if corresponding object locations and scene layout differ among images, these methods also suffer from misalignment.

Fig. 2
figure 2

Framework of proposed method for class-specific image representation. Steps are described in Sect. 3

The misalignment problem between objects in images can be solved by the object-centered representation. The part-based approach [58] and the interest region-based approach [1, 9, 11, 26, 31, 35] are two widely used object-centered representations. The part-based approach represents an object as a spatial layout of multiple parts, where the deformable configuration is characterized by spring-like connections between them [8] or by a joint Gaussian density of the locations of parts obtained from a random constellation [6]. Other methods [5, 7] constructed with spring-like connections introduce many local ambiguities and limited parts. The disadvantage of existing part-based models is that they depend heavily on the representations of each part. The interest region-based approaches represent an image by focusing on the specific interest region considered in their work. Galleguillos et al. [9] focused on the interest region for image classification by incorporating multiple stable segmentations and Bag-of-features (BoF) image representation into a multiple instance learning (MIL) framework. Chai et al. [1] proposed segmenting images into foreground and background within a co-segmentation scenario to improve image classification accuracy. Yakhnenko et al. [35] used a latent-SVM model, which uses all regions to score an image, and associates each region with a latent variable that indicates whether or not the region represents the object of interest. Nguyen [26] used segment-based Support Vector Machines which simultaneously localize the most discriminative set of segments and use them to learn an SVM. However, all of these methods are based on segments, and are sensitive to the segmentation result. Recently, some studies [11, 31] presented methods based on the saliency map for the interest region. Sharma [31] proposed a method to learn the discriminative spatial saliency of images while simultaneously learning a max margin classifier for a given visual classification task, but this work focused mainly on image classes like ‘riding horse’ in which the spatial relation between a person and an object (‘horse’) is important information to obtain the saliency map. Jiang et al. [11] used supervised learning to map the regional feature vector to a saliency score that yielded the saliency map. Because the method is only evaluated in terms of salient object detection, not image classification, the interest region obtained from this method has not been proven to be effective for image classification.

The proposed method also aims to find interest regions for the object-centered representation. Unlike most existing region-based approaches that obtain interest regions from segmentation results or saliency maps, our method exploits scale-invariant region detectors to model the interest regions. Scale-invariant region detectors such as DoG, Harris-Laplace, Hessian-Laplace, and Salient can capture important information (e.g., blob-like, corner-like, entropy-based) in images. Traditionally, region detectors have been used to extract keypoints for image matching and object class recognition. Some previous studies [6, 23] proposed frameworks that used scale-invariant region detectors to classify images. Fergus et al. [6] used the Salient detector to construct a probabilistic representation. Mikolajczyk et al. [23] compared the classification accuracy of local detectors and descriptors in the context of object class recognition. However, the question of which region detectors are most effective for a specific class is seldom discussed. In this paper, we try to define effects of region detectors for a specific class and use them in the class-specific modeling.

Nonnegative matrix factorization (NMF) [15] is an effective factor analysis method. It aims to find two nonnegative matrices whose product provides a good approximation to the original matrix. It is optimal for learning the parts of objects because the nonnegative constraints allow only additive combinations. The methods based on NMF and its variants have been applied to various tasks such as feature selection or data dimension reduction [1821]. In this paper, we use NMF to obtain the semantic spatial information of keypoints extracted by each region detector; the semantic spatial information is used to construct class-specific representation.

3 Proposed method

In this section, we propose a framework that uses multiple scale-invariant region detectors for class-specific image representation. For class-specific object area, we define two kinds of region, i.e., the C-ROI (Sect. 3.2) and the FR (Sect. 3.3); a class-specific image representation (Sect. 3.4) is obtained by using these two regions.

3.1 Region detectors

In this paper, we use four different scale-invariant region detectors to obtain class-specific spatial layouts: DoG, Harris-Laplace, Hessian-Laplace and salient detectors. The region detectors provide locations and scale of keypoints, and capture different kinds of information: the DoG and Hessian-Laplace detectors are suitable for finding blob-like structure; the Harris-Laplace detector captures corner-like structures; and, the salient detector extracts regions that have high entropy (or information). These region detectors have been successfully used in object classification. In the preprocessing step, we use these detectors to extract keypoints and their local information (locations and scale of the local region) for all training images.

3.2 Class-specific region-of-interest (C-ROI)

The C-ROI is a region which can be commonly found in images of same class. To find this region, we first obtain candidate C-ROIs at different scales and locations. The spatial distribution and appearance characteristics of region detectors are used to select the C-ROI among candidate C-ROIs (Algorithm 1).

figure a

3.2.1 Candidate C-ROIs

To extract C-ROIs in images of the same class, we must first identify candidate C-ROIs in images. In many previous studies, candidate regions to be processed were obtained from all possible locations and scales, so tens of thousands of candidates may have been identified. Although accurate target regions could be obtained by considering all possible candidates, this approach is impractical because it entails huge computational cost. Therefore, we try to reduce the number of candidate C-ROIs by choosing a limited number. In experiments, we observed that if extracted keypoints are concentrated in a region, the region is worth considering closely. This means that we must compute the density of the 2-dimensional distribution of keypoint and find its local peaks. To do this, we apply the four region detectors to every training image of same class. For each image, we superimpose the four kinds of detected keypoints on one image, then use the MeanShift algorithm [3] to identify local peaks of the keypoint distribution for the use as locations of candidate C-ROIs. To increase the accuracy of locating C-ROI, we add some extra locations around the local peaks (in our work, these are located in \(\pm\)10 and \(\pm\)20 pixels from local peaks). To maintain the classification accuracy even when the C-ROIs vary in size, we use three different sizes of candidate C-ROI at each candidate location (in our work, sizes of the regions are 0.5, 0.7, and 0.9 times of the image’s width/height ratio). Given a set of images I = \(\{ x_{1},x_{2},\ldots ,x_{N} \}\) in a class, we obtain a collection of \(M_{i}\) candidate C-ROIs in each image \(x_{i}\); this collection is denoted as \(P_{i} = \{ P_{i1},P_{i2},\ldots ,P_{iM_{i}} \}\). Even if a small number of candidate regions is considered, the obtained candidate C-ROIs cover most objects of interest in images (Fig. 3).

The next step is to select a C-ROI among the \(M_{i}\) candidate C-ROIs for each image. To do this, we define two features; spatial histogram \(S^{P_{ik}}_{f}\), and appearance histogram \(A^{P_{ik}}_{f}\), for each candidate C-ROI, where index f represents a region detector. Using these two features, we select a C-ROI for each image that gives the best matching score with candidate C-ROIs from other images of the same class. Although finding a C-ROI for each image is nearly exhaustive matching, it is practical because a small number of candidate C-ROIs is considered in this process.

Fig. 3
figure 3

Obtained candidate C-ROIs (green rectangles) in some images of Caltech-4

3.2.2 Spatial histogram \(S^{P_{ik}}_{f}\)

C-ROIs in images of a class should show similar distributions of keypoints extracted by region detectors. To describe this similarity, we compute spatial histogram \(S^{P_{ik}}_{f}\) (Fig. 4) of keypoints detected using region detector f within each candidate C-ROI \(P_{ik}\).

Given a candidate C-ROI \(P_{ik}\), \(k \in M_i\) in an image, \(P_{ik}\) is decomposed into \(N_R\) regular small sub-regions (in our work, \(N_R\) is set to 100, i.e., \(10 \times 10\) grids). For a region detector f in \(P_{ik}\), a spatial histogram \(S^{P_{ik}}_{f}\) with \(N_R\) bins is constructed by counting the number of keypoints detected using f in each bin. The spatial histogram of bin \(r \in N_{R}\) is computed by

$$\begin{aligned} S^{P_{ik}}_{f}(r) = \sum _{ e \in P_{ik} } \delta (e \in R_r), \end{aligned}$$
(1)

where e is the keypoint extracted using f, \(R_r\) is a sub-region that corresponds to bin r, and \(\delta (P)\) returns 1 if P is true and 0 otherwise. The value of each bin is normalized to a proportion of the maximum value of the spatial histogram.

Fig. 4
figure 4

Conceptual illustration of spatial histogram and appearance histogram for a given candidate C-ROI

3.2.3 Appearance histogram \(A^{P_{ik}}_{f}\)

To obtain the appearance histogram from each candidate C-ROI \(P_{ik}\), we use the Bag-of-words (BoW) representation, for which denseSIFT [14] features are extracted from subsampled training images and a codebook is constructed from them using k-means clustering as a preprocessing step. Only one codebook is constructed for all classes. Once the codebook is prepared, all that is required to compute the appearance histogram is to collect denseSIFT features in \(P_{ik}\) and to use the codebook to generate a histogram. To extract the individual characteristics of each of the four region detectors used in this paper, we modify \(P_{ik}\) in two aspects. First, instead of using the whole region of \(P_{ik}\) to obtain the histogram, we exclude some sub-regions \(N_{r}\) in which the number of keypoints detected using f is very small. Here, the sub-region is the same as the sub-grid region used for spatial histogram computation; i.e., we use only the sub-regions that are informative enough for each f. Second, we modify the size s and location l of each remaining sub-region that is used for histogram computation. Instead of using a fixed-sized sub-region at the regular location, we relocate each remaining sub-region to the average location of all the keypoints detected in it. We also set its size to \(s = k_s \overline{\sigma _s}\), where \(\overline{\sigma _s}\) is the average scale of all detected keypoints in the sub-region. Here, \(k_s\) is set to 8.

We obtain four \(A^{P_{ik}}_{f}\) for each \(P_{ik}\) (Fig. 4). Notice that we use denseSIFT features for histogram computation; we use the four region detectors only to define the newly changed sub-regions in each \(P_{ik}\).

3.2.4 C-ROI selection

C-ROI is a region that can be commonly found in images of the same class. To select the region \(k \in M_i\) among candidate C-ROIs \(P_{ik}\) in an image \(x_i\), the similarity of \(P_{ik}\) over the class must be measured. For a candidate C-ROI \(P_{ik}\), the closest distance to candidate C-ROIs in another image \(x_{j}\) is defined as

$$\begin{aligned} Z_{ik}^{j} = \min\, _{l \in M_{j}} ~ \,Z(P_{ik},P_{jl}), \end{aligned}$$
(2)

The distance \(Z(P_{ik},P_{jl})\) between two regions is

$$\begin{aligned} Z(P_{ik},P_{jl})= & {} \min _{f \in F} ~[\alpha D(S^{P_{ik}}_{f},S^{P_{jl}}_{f}) \nonumber \\&+ ~ (1-\alpha ) D(A^{P_{ik}}_{f},A^{P_{jl}}_{f})], \end{aligned}$$
(3)

where F is the set of region detectors, \(\alpha\) is the trade-off parameter between the two types of information (in our case, α = 0.5), and D computes the \(\chi ^2\) distance between two histograms.

To measure the similarity of the candidate C-ROI over a class, we use Multi-ranking Amalgamation Strategy [10]. For region k in \(x_{i}\), we re-rank \(Z_{ik}^{j}\) with all j and keep the W smallest values, which are defined as \(B_{ik}\) \(=\;\{ B_{ik}^1,B_{ik}^2,\ldots ,B_{ik}^W \}\), for which \(B_{ik}^m\,<\,B_{ik}^n\) and \(m\,<\,n\). The score of the region k as a C-ROI increases as the values in \(B_{ik}\) decrease. Therefore, the score of \(P_{ik}\) is defined as

$$\begin{aligned} T_{ik} = \frac{1}{W}~\sum _{w=1}^{W} ~ {\frac{1}{log(1+B_{ik}^w)}}~. \end{aligned}$$
(4)

The C-ROI \(P_{ik^{*}}\) in \(x_i\) is selected by

$$\begin{aligned} k^{*} = \arg \max _{k} ~T_{ik}. \end{aligned}$$
(5)

As examples we present extracted C-ROIs (Fig. 5) for several classes of the Caltech-101 dataset.

Fig. 5
figure 5

Examples of extracted C-ROIs for some classes of Caltech-101: a pagoda, b starfish, c hedgehog and d stop-sign classes

3.2.5 Class-specific weights of region detectors for C-ROIs

We can measure the class-specific weights of region detectors that represent the relative importance of a region detector for detecting C-ROIs from a given class of images (Algorithm 2).

figure b

Let \(P_{ik^{*}}\) be a C-ROI in an image \({x_{i}}\) and \(Q_{ik^{*}} = \{Q_{ik^{*}}^{1},Q_{ik^{*}}^{2},\ldots ,Q_{ik^{*}}^{W}\}\) be the set of W candidate C-ROIs in other images having the smallest distance from \(P_{ik^{*}}\). Then we can find the region detector that gives the minimum distance between \(P_{ik^{*}}\) and \(Q_{ik^{*}}^{w}\), which we denote as \(f_{ik^{*}}^{w}\), using the equation:

$$\begin{aligned} f_{ik^{*}}^{w}= & {} \arg \min _{f \in F} ~[\alpha D(S^{P_{ik^{*}}}_{f},S^{Q_{ik^{*}}^{w}}_{f}) \nonumber \\&+ ~ (1-\alpha ) D(A^{P_{ik^{*}}}_{f},A^{Q_{ik^{*}}^{w}}_{f})]. \end{aligned}$$
(6)

For the set \(Q_{ik^{*}}\), we can get a set \(F_{i}\); \(F_{i} = \{ f_{ik^{*}}^{1},f_{ik^{*}}^{2},\ldots ,f_{ik^{*}}^{W}\}\). For a given image class c, we can get a collection of index sets defined as: \(F^{c} = \{ F_{1},F_{2},\ldots ,F_{N} \}\), where N is the number of images in c. Then, \(|F^{c}| = WN\).

The relative importance, or weight \(w^{c}_{f}\) of the region detector f for c can be measured by counting the number \(N^{c}_{f}\) of each region detector in \(F^{c}\):

$$\begin{aligned} N^{c}_{f}= & {} \sum _{ a \in F^{c} } \delta (a == f), \nonumber \\ w^{c}_f= & {} \frac{N^{c}_{f}}{\max _{f \in F}N^{c}_{f}}, \end{aligned}$$
(7)

where \(\delta (P)\) returns 1 if P is true, and 0 otherwise. The relative importance of region detectors shows relatively large variation depending on image classes (Table 1).

Table 1 Relative importance of detectors for image classes in Fig. 5

3.3 Focal region (FR)

The extracted C-ROI in Sect. 3.2 generally contains the whole structure of the object of interest. In this section, we aim to find the most informative region called FR in the C-ROI (Algorithm 3).

figure c

Toward this goal, the algorithm follows four steps. Given N training images of a class c, the first step is to get the \(N_R\)-dimensional spatial histograms \(S^{P_{ik^{*}}}_{f}\), \(i=1~\ldots ~N\), for region detector f from C-ROIs in N images (Sect. 3.2.2), and obtain a \(N_R \times N\) matrix \(X_{f}^{c}\) by putting them in a matrix form (Fig. 6), that is, \(X_{f}^{c} = [S^{P_{1k^{*}}}_{f} S^{P_{2k^{*}}}_{f} \ldots ~ S^{P_{Nk^{*}}}_{f}]\).

Fig. 6
figure 6

Illustration of constructing spatial distributions to apply NMF for FR

The second step is to obtain the semantic spatial distribution by applying nonnegative matrix factorization (NMF) [15] to \(X_{f}^{c}\). NMF is known to be able to learn parts or semantic information of some content. NMF determines a 2-factor decomposition:

$$\begin{aligned} X_{f}^{c} \approx U_{f}^{c}V_{f}^{c}, \end{aligned}$$
(8)

where \(U_{f}^{c}\) is an \(N_R \times K\) matrix that contains K bases (in our case, K = 1Footnote 1), and \(V_{f}^{c}\) is a \(K \times N\) matrix that contains K weights for each basis.

The third step is to get the activation map \(A^{c}\) by summing \(U_{f}^{c}\) weighted with the class-specific weights \(w^{c}_{f}\):

$$\begin{aligned} A^{c} = \sum _{f=1}^{N_f} w^{c}_{f}U_{f}^{c}. \end{aligned}$$
(9)
Fig. 7
figure 7

Results of activation map and FRs obtained from C-ROIs of face and cup classes. a C-ROIs, b activation map, c FRs

The final step is to get the FR from activation map \(A^{c}\) by thresholding. The FR is fixed relative to C-ROI for all images of same class. As examples, we present the FRs detected in C-ROIs for two classes of image (Fig. 7). The detected FRs are almost the same regions that can be designated by human intuition.

3.4 Class-specific image representation

The C-ROI and the FR described so far can be used to construct the class-specific image representation. To compute an image-level descriptor for an image, we define spatial pooling regions, then concatenate encodings of each spatial region. Here, the encoding of each spatial region is the BoW representation with denseSIFT features of the spatial region. As in the traditional SPM where the spatial pooling is done in a spatial pyramid fashion (\(1 \times 1\), \(2 \times 2\) and \(4 \times 4\) grids), we also construct 3-level spatial pooling structure as:

  • Level 1: The whole image is used as a spatial pooling region (\(1 \times 1\) grid).

  • Level 2: Class-specific spatial pooling is designed. Unlike SPM pooling, the proposed method uses four regions divided differently (Fig. 1), i.e., Region 1 of FR, Region 2 of C-ROI, Regions 3 and 4 on the remaining area that correspond to background. Here, Regions 3 and 4 are partitioned horizontally as shown in Fig. 1c; the properties of background are generally changed to the horizontal direction.

  • Level 3: Regular \(4 \times 4\) grids are constructed on the C-ROI of the class so that we can obtain more detailed information on the object of interest.

3.4.1 Classifier learning

For learning a classifier, we use the PEGASOS SVM [30] as a linear SVM solver. To use non-linear additive kernels instead of the linear kernel, we use the \(\chi ^{2}\) explicit feature map [33]. The regularization-loss trade-off parameter C of the SVM is set to 10. For a specific class, training images are divided into two groups, positive (training images of the specific class) and negative (all training images of the other classes). Then, a 1-vs-rest classifier is trained with the training data for the specific class. For multi-class image classification, a 1-vs-rest classifier has to be prepared for each class so that the number of classifiers is the same as the number of classes.

3.4.2 Testing

To test an image i, candidate C-ROIs are extracted first. To select a C-ROI for a class c among them, equation (2) is modified to

$$\begin{aligned} Z_{ik}^{c} = Z(P_{ik},P_{jl^{*}}^{c}), \end{aligned}$$
(10)

where \(P_{jl^{*}}^{c}\) denotes the C-ROI (not candidate C-ROI) of training image j of class c, which were obtained in the training stage. Equations (3)–(5) can be used without modification. Notice that C-ROI is extracted differently depending on the class to test. FR in the extracted C-ROI can be obtained using the activation map \(A^{c}\) of the class to test, which is fixed for each class. Using these two regions yields a class-specific image representation of the class, and which we use to evaluate over the c classifier. Finally, if the evaluation of all classes is finished, the test image is classified into the class \(c^{*}\) that has the maximum score using the equation:

$$\begin{aligned} c^{*} = \arg \max _{c \in C} ~S_{i}^{c}, \end{aligned}$$
(11)

where \(S_{i}^{c}\) is a score obtained from the c classifier for the test image i and C is the set of classes to test.

4 Experiments

4.1 Datasets

We evaluated our proposed method on the Caltech-4Footnote 2, the Caltech-101Footnote 3, the CMU FacesFootnote 4, and the Scene 15Footnote 5 benchmark datasets (Table 2). The Caltech-4 contains four classes of images with large variation in object size and location; the Caltech-101 contains 101 classes of images with even larger variation in object size and location than in the Caltech-4, and with additional large variation in object pose. Among 101 classes in the Caltech-101 dataset, we selected only 36 classes that do not have large variation in object pose, because our approach is based on the assumption that the spatial distribution of region detector responses would be similar over images of a same class. The CMU Faces dataset was of special interest. The goal was to classify images according to whether or not the faces wore sunglasses; this task seemed suitable to demonstrate the power of the FR proposed in this paper for classification. The Scene 15 dataset was included to show the ability of our algorithm to capture similar parts in the scenes as C-ROIs even though unlike other datasets these scenes do not contain objects apparent for classification.

Table 2 Descriptions of the four datasets

4.2 Implementation details

We used a single descriptor, denseSIFT [14]. The SIFT descriptors extracted from \(16 \times 16\) pixel patches were densely sampled from each image on a grid with step size of 8 pixels. The images were all processed on gray scale. We used k-means to learn a codebook of size 1024, and assigned the SIFT features to the nearest codebook vector (hard assignment). We used the VLFeat library [32] for SIFT and k-means computation. For all datasets, we set \(N_R\) to \(10 \times 10\) grids for spatial histogram computationFootnote 6. For the Caltech-4, Caltech-101, CMU Faces datasets, only two levels (1 and 2) were used for the spatial pooling. For the Scene 15 dataset, all three levels were used for this task. For comparison of classification accuracy, we used SPM as baseline with linear SVM, which was obtained from the Liblinear [4] library.

4.3 Effects of class-specific combination of region detectors

The relative importance (or weights) \(w^{c}_{f}\) of region detectors computed for some classes of four datasets (Fig. 8; Table 3) shows which region detectors contribute most to extracting C-ROIs from the class. For example, the Hessian-Laplace detector is known to be good at extracting blob-like structure, and showed the highest weights for the classes of leopards and faces with sunglasses which have such blob-like structures. The high weights for these patterns seemed to lead to extraction of FRs that include them.

Fig. 8
figure 8

Example images for some classes of four datasets

Table 3 Relative importance of detectors for some classes (in Fig. 8) in the datasets of (a) Caltech-4, (b) Caltech-101, (c) CMU Faces and (d) Scene 15
Table 4 Classification rate (%) on (a) Caltech-4, (b) Caltech-101, (c) CMU faces and (d) Scene 15 datasets

In this section, we show the effects of class-specific combination of region detectors for image classification. To do this, we applied only one region detector to our work for image classification for the four datasets (Table 4). Combining the four detectors always gave results better than any single region detector. This result is very important in that if only one region detector is allowed, we must choose the best one; this choice depends on the data to classify, and is neither easy nor intuitive.

We extracted average images of C-ROIs extracted from some classes of the Caltech-101 and the Scene 15 datasets (Fig. 9). The average images obtained using the combination of region detectors showed clearer boundary of objects than the images obtained using any single region detector. This means that the combination of region detectors localizes the class-specific region-of-interest in images better than does a single detector.

Fig. 9
figure 9

Average images of selected C-ROIs for some classes from Caltech-101 and Scene 15 datasets; each row represents average images of obtained C-ROIs using a only DoG region detector, b only Harris-Laplace region detector, c only Hessian-Laplace region detector, d only salient region detector, and e multiple region detectors (in proposed method)

4.4 C-ROI and FR

For qualitative results, we implemented extraction of C-ROIs (Fig. 10) and FRs (Fig. 11) on four datasets in Sect. 4.1. If objects in a given class had similar shapes, the C-ROIs were well extracted regardless of size and location of the objects. The activated FRs in C-ROIs of some classes contained key components of them, i.e., the spines on the hedgehog, the spot pattern on the leopard, the highway sign, and the sunglasses on the face.

Fig. 10
figure 10

Examples of extracted C-ROIs for some classes of a Caltech-4, b Caltech-101, c CMU Faces and d Scene 15 datasets

Fig. 11
figure 11

Activated FRs on C-ROIs for some classes of a Caltech-4, b Caltech-101, c CMU Faces and d Scene 15 datasets

To check the effect of two proposed class-specific regions for image classification, we evaluated the classification accuracy using only these two regions over the CMU Faces and the Scene 15 datasets; these two datasets are specially designed to classify specific conditions or scenes without objects apparent for classification, and are therefore suitable to evaluate the power of the C-ROI and the FR for classification. For feature extraction, we defined four different spatial pooling regions: the whole image as baseline; only C-ROI; only FR; C-ROI and FR, instead of 3-level spatial pooling structure (in Sect. 3.4). Classification accuracy for the four different spatial pooling regions is listed in Table 5. For the CMU Faces, the proposed C-ROI or FR (Figs. 10, 11c) was significantly better than the whole image region. This result seems reasonable because the C-ROI or FR captures the region that gives the information about the presence of sunglasses. Similarly, for the Scene 15, the proposed C-ROI or FR showed much better results than the whole image region. To check classes that showed quite good classification accuracy with the proposed C-ROI or FR, we separately evaluated the classification accuracies for each class in the Scene 15, and listed classes with largest improvements over the accuracy of the baseline that considers the whole image region (Fig. 12). Characteristics of some scenes were better described using C-ROI or FR than using the baseline. For both datasets, using both C-ROI and FR for classification gave better results than using only one C-ROI or FR; this result means that both proposed regions play an important role in classification.

Fig. 12
figure 12

Example of classes with largest gains obtained by a C-ROI and b activated FR on C-ROI for the Scene 15 dataset; the classification accuracy are listed below sample images with gains obtained from baseline

Table 5 Classification rate (%) on (a) CMU faces and (b) scene 15 datasets

C-ROI and FR extraction sometimes failed (Fig. 13). Failures usually occurred in classes that do not have a common structure for the same class (e.g., the beds from various viewpoints in bedroom class). Because we assumed that images in the same class should have similar structures or objects, classification accuracy would be degraded when the assumption was not satisfied. Relaxing this assumption for the condition of classes will be our future work.

Fig. 13
figure 13

Examples of failure cases for C-ROI (left) and FR (right) extraction; a failure case 1—industrial class and b failure case 2—bedroom class

4.5 Comparison with existing methods

Image classification results were obtained from the four datasets (Tables 6, 7, 8, 9). In these experiments, we compared our result with SPM as baseline and with other existing methods for both cases of without and with weights of region detectors over each category.

For the Caltech-4 dataset (Table 6), our method showed almost 100 % classification accuracy. Simple SPM showed better accuracy than other methods [6, 9, 26] which were developed to solve problems involved in classifying images in which object location and size vary.

For the Caltech-101 dataset (Table 7), our method showed classification accuracy comparable to the extended versions of SPM [34, 36]Footnote 7which adopt efficient encoding techniques although our method uses hard assignment for encoding. Our method showed much more improvement over SPM in this dataset than in the Caltech-4 dataset because variation in object size and location in images is greater in the Caltech-101 dataset than in the Caltech-4 dataset.

For the CMU Faces dataset (Table 8), our result achieved as much as 8.7 % higher classification accuracy than the SPM result; this improvement over the SPM result was larger than achieved in the Caltech-4 dataset and the Caltech-101 dataset, and seemed to be achieved mainly due to our method’s ability to detect the most discriminative region (i.e., FR) in the C-ROI. Actually, we obtained the best accuracy (92.97 %, Table 5) when we used only C-ROI and FR without background information. This means that the background information of this dataset disturbs rather than assists image classification. Our method gave results better than Nguyen’s method [26] which is similar to ours in that automatically localize the subwindows that are most discriminative for classification.

For the Scene 15 dataset (Table 9), our method achieved higher classification accuracy than SPM and the extended versions of SPM [34], and achieved comparable accuracy with state-of-the-art method [31] which uses the discriminative spatial saliency as the interest region for classification task.

For all datasets, the use of only one region detector (Table 4) usually gave results better than did SPM; this observation implies that the proposed representation using C-ROI and FR contributed to solve the problems caused by varying size and location of objects in images.

Table 6 Classification rate (%) on Caltech-4 dataset
Table 7 Classification rate (%) on 36 classes from Caltech-101
Table 8 Classification rate (%) on CMU faces dataset
Table 9 Classification rate (%) on scene 15 dataset

5 Conclusion

We proposed a new method to construct a class-specific representation that is better than SPM for classification of images with large variation of object size and location in images. To obtain good classification accuracy despite these variations, we proposed two kinds of region, called class-specific region-of-interest (C-ROI) and focal region (FR). The C-ROI is the region that is common in images of same class; the FR is the region that is most discriminative in the C-ROI. To extract those two regions, we used the DoG, Harris-Laplace, Hessian-Laplace, and salient multiple scale-invariant region detectors. Image representation using these two regions gave better classification results for several well-known datasets than did SPM. In future, this concept could be extended to find the best combination of macro-features to describe object classes.