Structure preserving dimensionality reduction for visual object recognition

Song, Jinjoo; Yoon, Gangjoon; Cho, Heeryon; Yoon, Sang Min

doi:10.1007/s11042-018-5682-5

Structure preserving dimensionality reduction for visual object recognition

Published: 31 January 2018

Volume 77, pages 23529–23545, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Structure preserving dimensionality reduction for visual object recognition

Download PDF

Jinjoo Song¹,
Gangjoon Yoon²,
Heeryon Cho¹ &
…
Sang Min Yoon¹

402 Accesses
6 Citations
Explore all metrics

Abstract

Robust object recognition has drawn increasing attention in the field of computer vision and machine learning with fast development in feature extraction and classification techniques, and release of public datasets, such as Caltech datasets, Pascal Visual Object Classes, and ImageNet. Recently, deep learning based object recognition systems have shown significant performance improvements in visual object recognition tasks using innovative learning methodology. However, high dimensional space searching and recognition is time consuming, so performing point and range queries in high dimension is reconsidered for object recognition. This paper proposes optimized dimensionality reduction using structured sparse principle component analysis. The proposed method retains high dimensional feature structures, removes redundant features that do not contribute to similarity, and classifies the query image in a large database. The qualitative and quantitative experimental results, including a comparison with the current state-of-the-art visual object recognition algorithms, verify that the proposed recognition algorithm performs favorably in reducing the query image dimension and number of training images.

Supervised data-dependent kernel sparsity preserving projection for image recognition

Article 08 August 2018

Image retrieval based on dimensionality reduction of second-order information

Article 04 January 2024

L1-norm orthogonal neighbourhood preserving projection and its applications

Article 04 August 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Representation of high dimensional data has been one of the most important research topics across a wide variety of information processing fields, including pattern recognition, machine learning, signal processing, data compression, and database navigation, because good data representation usually results in better performance in solving the problems related to those fields. Generally, good representation means that the representation preserves as much essential and meaningful information about the data as possible, while expressing it in a simpler way. Two common methods have been adopted to achieve this goal: dimensionality reduction and redundant feature removal that retains the structure. Dimensionality reduction attempts to transform high dimensional data into a lower dimensional form, simplifying and cleaning data for pattern recognition and machine learning tasks. Redundant feature removal retaining the main structure tries to find the optimized set of key features.

The data representation problem has become more challenging as very large high dimensional datasets have become available. This is because the number of possible distinct configurations of a set of variables increases as the number of variables increases. Many algorithms have been proposed and developed to tackle this problem, such as principle component analysis (PCA) [24], factor analysis (FA) [2], projection pursuit (PP) [20], independent component analysis (ICA) [29], etc. These algorithms come from different fields and have different approaches, but can be considered to have the same goal: finding a better representation. Hence, these algorithms focus on how to reduce dimensions or build a proper dictionary and show the resultant benefits.

PCA is one of the most popularly adopted methods in dimensionality reduction, which finds linear combinations of orthogonal factors that effectively represent the data. However, the PCA factors are still affected by all original data variables, which means they retain unwanted features when representing the data. In addition, the factors’ meanings are often difficult to interpret. Several alternatives have been proposed to overcome the limitation of PCA, such as non-negative matrix factorization (NMF) [30] and sparse PCA (SPCA) [50]. In particular, the underlying motivation for sparse representation-based dimension reduction is that even though the signal is in high-dimensional space, it can actually be obtained in some lower-dimensional subspace due to its sparsity. However, the main problem of typical dimension reduction of sparse representation is that it does not consider the data structure, and can result in structural information loss. Structured sparsity [19] was proposed to solve this problem by optimal selection over structures like groups of input features. In this paper, we present a novel visual object recognition methodology by applying structured sparse PCA (SSPCA) to effectively reduce the dimension of the high-dimensional input features and retain the meaningful features and structures. The benefit of exploiting the structure of data was proved by Jenatton et al. [21]. According to them, given any intersection-closed groups of patterns of variables, we can build some quasi-regularization norms (Ω) enforcing that the support set regularized by Ω belongs to the group patterns when solving the classification and regression tasks.

Figure 1 shows the proposed approach for object recognition. Finding the optimal dimensionality from very high dimensional data using SSPCA means it trains the dataset of the recognition task, while retaining recognition performance comparable to the previous state of the art approaches. The main contribution of this paper is summarized as below:

The method to find the optimal dimensionality reduction for high dimensional data when representing features.
Analysis of the relationships among the optimized size of the reduced dimension, number of training images, and non-optimized high dimensional features.
Structure-based similarity measure and classification instead of traditional energy-based measure and classification.

The remainder of this paper is organized as follows. Section 2 reviews related work in the view of dimensional reduction, sparse representation and learning, and visual object recognition, and Section 3 provides the technical detail of how to adopt SSPCA for solving the object recognition problem. Section 4 presents quantitative and qualitative evaluation of our approach, and analyses of the experimental results. Section 5 summarizes and concludes this paper.

2 Related works

2.1 Dimensionality reduction

As data dimensionality grows, the size of dimensional space increases exponentially. This is called the curse of dimensionality [4], which must be tackled particularly for many tasks in pattern recognition and machine learning fields. If dimensionality can be reduced, this enables lower computational cost and improved performance by removing noise and less informative features, and finding more general regions or rules applicable to new data for a variety of tasks. Several dimensionality reduction techniques have been introduced: PCA [24], multidimensional scaling (MDS) [10], ICA [29], Kernel PCA [17], semidefinite embedding (SDE) [44], Isomap [41], and locally linear embedding (LLE) [38].

PCA [24] is one of the most classical and popular methods for dimension reduction. It finds a linear reduced subspace having lower dimensionality that preserves as much variability as possible of the original data. However, its computational cost increases according to the complexity of the source data, and it is limited by linearity. MDS [10] is another classical dimensionality reduction method, and maps the original dimensional space to a lower dimensional space based on the proximities, indicating the similarities between variables. In MDS, the variables are represented as points in a lower dimensional space, and the distance between the points corresponds to the similarities. However, similar to PCA, MDS has the limitation of linearity. ICA [29] is a method to process non-Gaussian data, which PCA cannot deal with. ICA is a statistical method using the data distribution, but it still finds a linear representation. Kernel PCA [17] is a nonlinear dimensionality reduction method that introduces a kernel to PCA. Kernel PCA is applicable to many high dimensional nonlinear datasets, but reconstruction of training data and test samples is difficult. A variation of kernel PCA is SDE [44] or maximum variance unfolding (MVU) [43]. SDE differs from Kernel PCA in that when choosing a kernel function it does not use a pre-defined kernel function, but learns a kernel matrix with the assistance of semidefinite programming. Isomap [41] is a nonlinear dimensionality reduction algorithm based on MDS. This algorithm finds the neighbors of each data point with the assumption that only neighboring points know pair-wise distances, then computes the geodesic pair-wise distances between all other data points, and performs low dimensional embedding via MDS based on these geodesic distances. It is not difficult to determine the proper value of k for neighborhood graph construction for Isomap. LLE [38] was proposed to address nonlinear dimensionality reduction, and identifies the neighbors of each data point, computes weights that best describes the data point as a linear combination of its neighbors, and finds the low dimensional embedding of points, which are described with the previously determined weights. LLE achieves faster optimization than Isomap using an eigenvector based optimization technique. Marjan et al. represented high dimensional data as a chaos theory-based appearance model for object tracking task [1]. Some approaches try to solve the high dimensional data problem from the perspective of subspace clustering [14, 49].

2.2 Sparse representation and learning

Sparse representation and learning has been widely used in various research fields subsequent to the theoretical proof that sparse representation in a general dictionary is unique and can be found using l₁ minimization [48]. Solving sparse representation and learning involves seeking the sparsest linear combination of basis functions from an overcomplete dictionary. There are many approaches to obtain the sparse solutions, but they can be grouped into three broad categories: convex relaxation [3], greedy pursuits [8], and combinational algorithms. Convex relaxation approaches include basis pursuit (BP) [12], interior point approaches, projected gradient methods, and iterative thresholding. To emphasize the original data structure while retaining sparseness, Huang et al. [18] proposed an efficient dynamic group sparsity concept that adaptively learns the dynamic group structure in practical applications. They also provided theoretical proofs for less measurement requirement and lower computation complexity in group sparsity. Inspired by conventional group sparsity, Zhang et al. [47] proposed leveraging feature sparsity and clustering properties to regularize feature selection. Yang et al. [23] extended the robust encoding ability of group sparse coding with spatial correlations among training regions to better depict and index image content. Group sparse representation usually focuses on more specified structures, whereas structured sparse representation works on general coding structures. De Pierrefeu et al. extended the PCA framework by adding a structural constraint, TV-elastic net penalty, to sparse PCA [11].

2.3 Object recognition

Breakthroughs in deep convolutional neural networks have dramatically improved object recognition and detection performances in recent years [9, 28, 31, 37]. However, other approaches that exploit the sparseness of image features have also been promising in realizing the state of the art object recognition performance. Initially, to imitate the human vision system, biologically motivated approaches emulated quantitative models of the visual cortex to recognize objects within images. Serre et al. presented a model that mimicked visual cortical processing to realize position- and scale-tolerant edge detection [39]. The features used in Serre et al.’s system were more flexible than template based approaches, and more selective than histogram based descriptors.

Other biologically inspired systems have attempted to model the three properties of mammalian primary visual cortex, i.e., spatially localized, oriented, and selective to structure at different scales [36]. Field proposed a coding strategy that maximized sparseness of the statistical structure of natural images [13] to mathematically (algorithmically) model these three properties. Olshausen and Field subsequently proposed a learning algorithm that searched for a sparse code by introducing two global objectives; one that preserved information and the other represented sparseness of the features [36]. They demonstrated that the three properties emerge when only the two global objectives are placed on a linear coding of natural images. Mutch and Lowe modified Serre et al.’s model by eliminating weaker responses that disagreed with the locally dominant responses [33], and matched only the dominant orientation within a feature, rather than comparing all orientations. They found increasing the sparsity of the features to be helpful in improving generalization performance.

As the usefulness of the sparse image features were tested and confirmed, more studies exploiting sparse image representation followed. For example, Kavukcuoglu et al. proposed a predictive sparse decomposition (PSD) algorithm that simultaneously learned an overcomplete linear basis set, and produced a smooth and easily computed approximator to predict the optimal sparse representation [25]. Naikal et al. proposed a Sparse PCA algorithm that selected the subset of informative features based on sparse coefficients in the first few principal vectors [34]. They also introduced an algorithm to speed up Sparse PCA using the augmented Lagrangian method. Sohn et al. studied an efficient training method for sparse and conventional restricted Boltzmann machines (RBMs) through the connections between mixture models and RBMs, and proposed a mid-level feature extraction method using convolutional RBMs [40]. Oliveira et al. proposed a sparse spatial coding (SSC) algorithm that combined a sparse coding dictionary learning, spatial constraint coding stage, and online classification to improve object recognition [35].

Studies have utilized spatial pyramid matching (SPM), with some incorporating sparse coding, have also contributed to improving the object recognition performance. For example, Lazebnik et al. adapted the pyramid matching scheme of Grauman and Darrell [15], which found an approximate correspondence between two sets of vectors, and pioneered the SPM method [26]. Their SPM method repeatedly subdivided the image using a hierarchy of rectangular windows and computed the histograms of local features at increasingly fine resolutions. Bosch et al., rather than focusing on the entire scene of each image, automatically learned the region of interest, and applied random forests to the spatial pyramid representation [6]. Yang et al. extended the SPM method by computing a spatial pyramid image representation based on sparse codes for SIFT [32] features, rather than the K-means vector quantization in the traditional SPM [23]. Boureau et al. investigated performance for different pairings of coding (vector quantization vs. sparse coding) and pooling (average vs. max) modules within the spatial pyramid framework and showed that sparse coding systematically outperformed vector quantization irrespective of the pooling modules, and that max pooling improved linear classification performance over average pooling [7]. He et al. [16] added a spatial pyramid pooling layer on top of the last convolutional layer of CNNs [27] to eliminate the fixed input image size requirement posed by current CNNs. This allowed training of variable-size images, possibly increasing scale invariance and reducing overfitting. Other object recognition methods include fast approximated locality constrained linear coding that utilizes the locality constraint to project each descriptor into its local coordinate system [42].

3 Structure preserving dimensional reduction for effective object recognition

We consider a method for recognizing objects from data by extracting appearance features enabling us to represent them in lower dimension while retaining the structure of the high dimensional features to efficiently represent and measure the similarity and classification. To effectively and efficiently recognize desired objects, we propose an object recognition method by classifying the coefficients of the image’s appearance model.

To use a smaller number of dictionary vectors than the descriptor dimensions, the number of dictionary vectors need to be sufficient to reasonably reflect the similarity the objects of a class share with each other, and to distinguish the mutually distinct structures in different classes as well. Although PCA is one of the most popularly adopted methods in dimensionality reduction solution, the PCA factors [Robert9]are still affected by all original data variables, and there are insufficient PCA aspects to express each individual characteristic that objects of the same type have in common. Thus, it is difficult for PCA to deliver the common features of objects of the same type. However, SSPCA [22] complements the PCA constraints, and can be an excellent candidate to represent the data because it allows a dictionary to be learned by exploiting a priori structural constraints as well as sparsity while reducing dimensionality [21].

The current paper applies SSPCA to generate dictionary vectors from appearance features. The proposed object recognition procedure comprises two steps: training the dictionary using SSPCA, and recognition by measuring similarity.

3.1 Symbols and notations

Before proceeding to the technical details, we introduce the notation and symbols to help understanding. To denote real numbers, we use lower case letters for indices, parameters and dimensions, and upper case letters for the constants used for the recognition process. We use boldfaces to denote multi-dimensional terms such as images, vectors, and matrices. Vectors are considered to be columns. Table 1 shows the detailed symbols and notations.

Table 1 Symbols and notations used in this paper

Full size table

3.2 Dictionary learning and dimensionality reduction

We collected a large training dataset $\mathcal {B}$ of 2D images and categorized the images into M groups $\{{\mathbb B_{i}}\}_{i = 1}^{M}$ where each $\mathbb {B}_{i}$ consists of images containing the same (relevant) type of objects.

To create feature descriptors, we randomly choose K images $\{\mathbf {b}_{i, j}\}_{j = 1}^{K}$ from each group $\mathbb {B}_{i}$. By resizing, we may assume that all the images have the same size divided into S subimages of the same size. From each subimage of b_i,j, we created a feature descriptor [32] of dimension P, and arranged them in a column vector, $\mathbf {x}_{i,j}\in \mathbb {R}^{P\cdot S}$. Thus, we created the descriptor x_i,j corresponding to images b_i,j for i = 1,…,M,j = 1,…,K.

Normally, P ⋅ S is large, e.g. P = 128 and S ≥ 100, and we propose to reduce the dimensionality. That is, we need to find dictionaries such that every vector x_i,j has a well fitted sparse representation. Before finding such dictionaries, we note that for each i = 1,…,M, feature vectors $\{\mathbf {x}_{i,j}\}_{j = 1}^{K}$ come from the images containing the same kind of objects and they are likely to share similar structures.

Therefore, we apply the SSPCA proposed by Jenatton et al. [22] to find such dictionaries. Let X be the P ⋅ S × K ⋅ M matrix having x_i,j as its column vectors. To apply the SSPCA, we first divide the set $\mathbb {G}=\{1, 2, \ldots , P\cdot S\}$ into $t=\lfloor \frac {P\cdot S}{M}\rfloor $ subsets $\mathbb {G}_{1}, \mathbb {G}_{2}, \ldots , G_{M}$, where $\mathbb {G}_{i}=\{(i-1)t + 1,(i-1)t + 2, \ldots , i \cdot t\}$ for i = 1,…,M − 1; $\mathbb {G}_{M}=\{(M-1)t + 1,(M-1)t + 2, \ldots , P\cdot S\}$; and ⌊x⌋ denotes the largest integer smaller than or equal to x. Then we define the sparsity-inducing mixed ℓ₁/ℓ₂ norm Ω over a vector $\mathbf { u}\in \mathbb {R}^{P\cdot S}$,

$$ {\Omega}(\mathbf{u})=\sum\limits_{j = 1}^{M}\left( \sum\limits_{k\in \mathbb{G}_{j}} {u_{k}^{2}}\right)^{\frac12} \quad \text{for}\, \mathbf{u}=(u_{1}, u_{2}, \ldots, u_{P\cdot S}). $$

(1)

With this setting, we apply the SSPCA dictionary search algorithm to the descriptors $\{\mathbf {x}_{i,j}\}_{i = 1, j = 1}^{M, K}$ to find dictionaries $\{\mathbf {u}_{i}\}_{i = 1}^{r}\in \mathbb {R}^{P\cdot S}$,

$$\begin{array}{@{}rcl@{}} (\mathbf{U}, \mathbf{C}) &=& \underset{{\begin{array}{l} \mathbf{U}\in \mathbb{R}^{P\cdot S\times r}\\ \mathbf{C}\in\mathbb{R}^{r\times K\cdot M} \end{array}}}{\arg\!\min} \frac{1}{2(P\cdot S)\cdot (K\cdot M)}\|\mathbf{X}-\mathbf{U}\mathbf{C}\|_{F}^{2}\\ &&+\lambda\sum\limits_{j = 1}^{r}{\Omega}(\mathbf{u}_{j})~~\text{subject to}~~\|\mathbf{c}_{j}\|_{2}\le 1, \quad j = 1,\ldots, K\cdot M \end{array} $$

(2)

where $\mathbf {U}=(\mathbf {u}_{k})_{k = 1}^{r}$ is the P ⋅ S × r dictionary matrix; $\mathbf {C}=(\mathbf {c_{j}})_{j = 1}^{K\cdot M}\in \mathbb {R}^{r\times K\cdot M}$ is the coefficient matrix; ∥⋅∥_F is the Frobenius matrix norm, $\|\mathbf {A}\|_{F}^{2}=trace(\mathbf {A}\mathbf {A}^{T})=\sum \limits _{i = 1}^{n}\sum \limits _{j = 1}^{m} a_{ij}^{2}$; $\mathbf {A}=(a_{ij})\in \mathbb {R}^{n\times m}$ and ∥⋅∥₂ is the Euclidean norm; and Ω is the mixed norm defined in (1) which controls the sparsity and the structure of the support of d_j (for details on the norm (see [22] and references therein).

Then, for ℓ = (i − 1) ⋅ K + j, (i = 1,…,M,j = 1,…,K), the ℓ-th column vector x_i,j of X is (approximately or exactly) expressed by a linear combination of the r dictionaries $\mathbf {u}_{k}^{\prime }s$ with coefficients $(c_{k\ell })_{k = 1}^{r}$,

$$ \mathbf{x}_{i,j} \approx \sum\limits_{k = 1}^{r} c_{k\ell} \mathbf{u}_{k} = \mathbf{U} \mathbf{c_{\ell}}, \quad \mathbf{c}_{\ell}=(c_{1\ell}, \ldots, c_{r\ell})^{T}. $$

(3)

Thus, each descriptor vector $\mathbf {x}_{i,j}\in \mathbb {R}^{P\cdot S}$ is represented (approximately) in terms of quite a few dictionaries, i.e., we reduce the dimensionality from P ⋅ S to r. In the following section, we generate support vector machines using the $\mathbf {u}_{k}^{\prime }s$ dictionaries.

We end this section with the following observation. Theorem 1 shows that we don’t expect that the larger dictionary set delivers better recognition performance. Empirically, there exists an optimal choice for r that depends on the number of samples (K ⋅ M) and is shown to be $\lfloor \frac {K\cdot M}{2}\rfloor $.

Theorem 1

For vectorzand finite set{w₁,w₂,…,w_m}in$\mathbb {R}^{n}$,let$\{\mathbf {y}_{k}\}_{k = 1}^{m}$be the orthogonal projections in$\mathbb {R}^{n}$ofzinto the vector space$\mathbb {W}_{k}$spanned by{w₁,w₂,…,w_k},i.e.,

$$ \mathbf{y}_{k} = \underset{\mathbf{y}\in \mathbb{W}_{k}}{\arg\!\min}\|\mathbf{z} -\mathbf{y}\|_{2}, \quad\text{for}\, k = 1, 2, \ldots, m. $$

(4)

Then,

$$ \|\mathbf{z} -\mathbf{y}_{1}\|_{2}\ge \|\mathbf{z} -\mathbf{y}_{2}\|_{2}\ge {\cdots} \ge \|\mathbf{z} -\mathbf{y}_{k}\|_{2} $$

(5)

and

$$ \|\mathbf{y}_{1}\|_{0}\le \|\mathbf{y}_{2}\|_{0}\le {\cdots} \le \|\mathbf{y}_{k}\|_{0}, $$

(6)

where ∥z∥₀is theℓ₀pseudo norm of a vectorz defined to be the numberof non-zero entries in z.

Proof

For k = 1,…,m, let n_k be the dimension of the vector space $\mathbb {W}_{k}$ spanned by {w₁,w₂,…,w_k}. Since the subspaces $\mathbb {W}_{1}\subseteq \mathbb {W}_{2}\subseteq {\ldots } \subseteq \mathbb {W}_{k}$ are nested, n₁ ≤ n₂ ≤⋯ ≤ n_k. Gram-Schmidt orthogonalization implies that there exists an orthonormal basis $\{\mathbf {e}_{i}\}_{i = 1}^{n}$ for $\mathbb {R}^{n}$, such that for k = 1,…,m, $\{\mathbf {e}_{i}\}_{i = 1}^{n_{k}}$ is a basis for $\mathbb {W}_{k}$. Thus, every vector $\mathbf {w}\in \mathbb {W}_{k}$ is uniquely represented as

$$ \mathbf{w} = \sum\limits_{i = 1}^{n_{k}} \langle \mathbf{w}, \mathbf{e}_{i}\rangle \mathbf{e}_{i} \quad \text{and} \quad\|\mathbf{w}\|_{2}^{2}=\sum\limits_{i = 1}^{n_{k}} \langle \mathbf{w}, \mathbf{e}_{i}\rangle^{2}, $$

(7)

where 〈⋅,⋅〉 denotes the Euclidean inner product in $\mathbb {R}^{n}$.

Let $\mathbf {z}\in \mathbb {R}^{n}$ be given arbitrarily. Then for k = 1,…,m, the vector y_k in (4) is the orthogonal projection of z onto $\mathbb {W}_{k}$, i.e.,

$$\mathbf{y}_{k} = \underset{\mathbf{y}\in \mathbb{W}_{k}}{\arg\!\min}\|\mathbf{z} -\mathbf{y}\|_{2}=\sum\limits_{i = 1}^{n_{k}} \langle \mathbf{z}, \mathbf{e}_{i}\rangle \mathbf{e}_{i} $$

and

$$\|\mathbf{z} -\mathbf{y}_{k}\|_{2}^{2}=\sum\limits_{i=n_{k}+ 1}^{n} |\langle \mathbf{z}, \mathbf{e}_{i}\rangle|^{2}. $$

Thus, the inequalities (5) and (6) hold, which completes the proof. □

4 Experimental result

We prove the existence of an optimal number of dictionary elements and evaluate the performance of our method. The schematic experimental setup and evaluation for SSPCA based object recognition is illustrated and we report the results based on quantitative and qualitative evaluations in two parts: Section 4.1 covers the experimental setup, and Section 4.2 presents object recognition results from a public dataset and analysis of the proposed approach based on the experimental results and previous techniques.

4.1 Experimental setup

The experiments were conducted on the Caltech 101 dataset, which consists of 101 classes and one background class. Each class contains 31 to 800 images having the size of approximately 300 × 200 pixels. The images of each class vary in color, pose, and illumination, making the classification task challenging. We use 101 classes, excluding the background class. For pre-processing, we extract features from each image using SIFT [32] on a standard 4 GHz machine with 2 GB RAM.

4.2 Quantitative and qualitative evaluation of visual object detection

For fair performance evaluation and comparison, we start by focusing on the proposed approach’s effectiveness and efficiency to reduce the dimensionality of the visual objects and number of training images.

Figure 2 shows the confusion matrix of the proposed algorithm learned with 30 training samples per class, where the row denotes true class and the column denotes the assigned class. Some of the highly misclassified classes are llama (29.2%) and water lily (25%) classes. Figure 3 displays some examples of misclassified results of these classes. The images of the llama and the water lily classes are assigned to the kangaroo and lotus classes respectively. The image features of the target classes and the assigned classes are considerably similar, and this is the main reason for the low recognition rate of the two classes. The class with the lowest accuracy (16.7%) is the ant class, which has twelve test images. As shown in Fig. 4, only two testing images are recognized as an ant; ten other images are classified into ten different classes. This is because the features and the structures of an ant itself are uncomplicated; i.e., an ant is a relatively simple object consisting of some lines and circle-like shape. As a result, the factors like illumination, texture, pose, and background have a stronger influence on the classification than in the case of the objects composed of more complicated features and structures.

We then conducted the experiments to evaluate the robustness of the proposed approach by reducing the number of training images because traditional learning and recognition approaches are largely dependent on the number of training samples; we seek to prove the Theorem 1 experimentally. The number of training images was varied; 15, 20, 25, and 30 images were randomly chosen from each class for training and the remaining images were used for testing. The online dictionary pertaining to the chosen training images was learned using SSPCA, which enabled us to decide how many feature dimensions were used, i.e., the number of dictionary elements. Table 2 displays the proposed method’s results applied to the task of object recognition using Caltech-101 dataset. We trained the classification model according to the different number of samples and dictionary elements. Generally, increasing the dictionary elements or the number of training samples increases performance. However, when using 25 or 30 training samples, the optimal number of the dictionary elements was 1500, not 2000. When the classification model was trained with 30 samples per class and 1500 dictionary elements, the best accuracy of 83.92% was achieved. This implies that overfitting occurred when more than the optimal number of dictionary elements were used. On the other hand, the combined use of 15 or 20 training samples and 1500 dictionary elements did not produce the best results. Using 2000 dictionary elements showed better performance than the use of 1500 dictionary elements, given 15 or 20 training samples. This means that, as the number of training samples decreases, more dictionary elements are needed to build a more accurate classification model. Thus, the number of dictionary elements and training samples influence the optimality of dimensions. Table 2s shows that, when using 30 training samples, the optimal number of dictionary elements is 1500, i.e., optimal dimensionality = 1500, rather than full dimension of the given features. As the number of feature dimensions increases, overfitting occurs; on the other hand, less dimensions cause underfitting. Increasing the feature dimensions do not guarantee better performance; hence, the optimal number of feature dimensions should be investigated for best performance.

Table 2 Recognition rate of different dictionary element size (r) and training sample combination

Full size table

Next, we investigated the influence of the classifier type. The learned dictionary was then used to create a classification model utilizing SVM or the Softmax function. Many numerical experiments were conducted to identify the optimal number of dictionary elements and then compared with the proposed method and other significant previous approaches. We trained the model with two classifiers: SVM and Softmax. As shown in Table 3, SVM showed better performance than Softmax, but both classifiers showed increased performance with increased number of training samples, and showed the best accuracy using 30 training samples. Thus, the classifier type does not affect dimensional optimality.

Table 3 Recognition rate of different classifier and training sample combination

Full size table

4.3 The comparison with the previous approaches

Lastly, we compared our SSPCA-based object recognition result to the remarkable previous approaches that use various learning techniques without exploiting pretrained models. Table 4 compares the recognition performance of our approach with the previous approaches using the Caltech 101 Dataset. When using fifteen training samples, Yang et al. [45] achieves recognition accuracy below 75%; in comparison, our approach shows relatively good result with fifteen training samples. Our approach successfully removes useless information via dimensionality reduction while preserving the important structures and features for the task. Zeiler and Furges [46], an algorithm that uses deep learning method, reported the recognition accuracy of 22.8% and 46.5% with 15 and 30 training images without using the ImageNet-pretrained model.

Table 4 Comparison of SSPCA-object recognition approach to the previous approaches, which do not use pretrained models

Full size table

Nowadays, approaches adopting deep learning methodologies have shown considerably good performance in object recognition task. Zeiler and Furges [46] reported the recognition accuracy of 83.8% and 86.5% in the Caltech101 Dataset classification task when using 15 and 30 training samples respectively with the ImageNet-pretrained model. He at al. [16], which uses a convolutional neural network (CNN), also reported the experiment result with the recognition accuracy of 91.9% when training with 30 samples. However, both methods pre-trained the classification model using the ImageNet 2012 dataset, which contained about 10 million images, and the methods benefit significantly from their pre-trained models.

5 Conclusion

We proposed an object recognition algorithm using optimal dimensionality reduction and verified the effectiveness of the proposed technique experimentally. The proposed object recognition method employs SSPCA and is effective at preserving important and crucial features while removing noisy and unimportant features by leveraging the strengths of both sparse representation and PCA.

We also proved that an optimal number of feature dimensions exists, which simplifies feature representation and improves performance. The core concept of finding the optimal dimensionality reduction from high dimensional features can be applied to various applications, such as image abstraction, image manipulation, and image composition.

The future work includes the investigation of the following issues: improvement of the classification model to recognize the objects with simple features and structures such as ants and adaptive discovery of the optimal number of dimensions given the number of training samples. We will also work on SSPCA-based deep learning by combining SSPCA, which considers structure and sparsity, with many different deep learning algorithms, which learns features effectively, in order to deal with various machine learning tasks including object detection, image annotation or scene parsing with large scale datasets like the MS COCO dataset or the ImageNet dataset.

References

Abdechiri M, Faez K, Amindavar H, Bilotta E (2017) Chaotic target representation for robust object tracking. Signal Process Image Commun 54:23–35
Article Google Scholar
Akaike H (1987) Factor analysis and AIC. Psychometrika 52(3):317–332
Article MathSciNet MATH Google Scholar
Arias RS A convex optimization algorithm for sparse representation and applications in classification problems. Ph.D. thesis, DigitalCommons@UTEP. http://digitalcommons.utep.edu/dissertations/AAI3565935
Bellman R (1957) Dynamic programming. Princeton University Press
Bo L, Ren X, Fox D (2013) Multipath sparse coding using hierarchical matching pursuit. In: IEEE Conference on computer vision and pattern recognition
Bosch A, Zisserman A, Mu X, Munoz X (2007) Image classification using random forests and ferns. In: IEEE 11th International conference on computer vision (ICCV), pp 1–8
Boureau Y L, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 2559–2566
Chen L, Chen J, Gu Y (2012) Greedy pursuits: stability of recovery performance against general perturbations. In: ICNC. IEEE Computer Society, pp 897–901
Ciresan D C, Meier U, Masci J, Gambardella L M, Schmidhuber J (2011) High-performance neural networks for visual object classification. CoRR arXiv:http://arXiv.org/abs/1102.0183
Davison M L (1983) Multidimensional scaling. Wiley, New York
MATH Google Scholar
De Pierrefeu A, Löfstedt T, Hadj-Selem F, Dubois M, Ciuciu P, Frouin V, Duchesnay E (2016) Structured sparse principal components analysis with the tv-elastic net penalty. arXiv:http://arXiv.org/abs/1609.01423
Donoho D L (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306
Article MathSciNet MATH Google Scholar
Field D J (1994) What is the goal of sensory coding? Neural Comput 6(4):559–601
Article Google Scholar
Gan G, Ng M K P (2015) Subspace clustering with automatic feature grouping. Pattern Recogn 48(11):3703–3713
Article Google Scholar
Grauman K, Darrell T (2005) The pyramid match kernel: Discriminative classification with sets of image features. In: Proceedings of the IEEE international conference on computer vision, vol II, pp 1458–1465. https://doi.org/10.1109/ICCV.2005.239
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916. https://doi.org/10.1109/TPAMI.2015.2389824 https://doi.org/10.1109/TPAMI.2015.2389824
Article Google Scholar
Hoffmann H (2007) Kernel PCA for novelty detection. Pattern Recogn 40 (3):863–874
Article MATH Google Scholar
Huang J, Zhang T (2010) The benefit of group sparsity. Ann Stat 38 (4):1978–2004
Article MathSciNet MATH Google Scholar
Huang J, Zhang T, Metaxas D (2009) Learning with structured sparsity. J Mach Learn Res 12:1–30. https://doi.org/10.1145/1553374.1553429
MathSciNet MATH Google Scholar
Huber P J (1985) Projection pursuit. Ann Statist 13(2):435–475
Article MathSciNet MATH Google Scholar
Jenatton R, Audibert J Y, Bach F (2011) Structured variable selection with sparsity-inducing norms. J Mach Learn Res 12:2777–2824
MathSciNet MATH Google Scholar
Jenatton R, Obozinski G, Bach F (2010) Structured sparse principal component analysis. In: International conference on artificial intelligence and statistics, pp 1–13
Jianchao Y, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: 2009 IEEE Conference on computer vision and pattern recognition pp 1794–1801
Jolliffe I (1986) Principal component analysis. Springer, New York
Book MATH Google Scholar
Kavukcuoglu K, LeCun Y, Ranzato M (2010) Fast inference in sparse coding algorithms with applications to object recognition, pp 1–9. arXiv:http://arXiv.org/abs/1010.3467. https://doi.org/10.1109/ICIP.2001.958968
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 2169–2178. https://doi.org/10.1109/CVPR.2006.68
LeCun Y, Kavukcuoglu K, Farabet C (2010) Convolutional networks and applications in vision. In: ISCAS. IEEE, pp 253–256
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521 (7533):436–444
Article Google Scholar
Lee T W (1998) Independent component analysis, theory and applications. Kluwer Academic Publishers
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–91. https://doi.org/10.1038/44565
Article MATH Google Scholar
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC SSD: single shot multibox detector. arXiv:https://arxiv.org/abs/1512.02325
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
Article Google Scholar
Mutch J, Lowe DG (2006) Multiclass object recognition using sparse, localized features. In: IEEE Conference on computer vision and pattern recognition, pp 11–18. https://doi.org/10.1109/CVPR.2006.200
Naikal N, Yang AY, Shankar S (2011) Informative feature selection for object recognition via Sparse PCA. In: Proceedings of the IEEE international conference on computer vision, pp 818–825. https://doi.org/10.1109/ICCV.2011.6126321
Oliveira GL, Nascimento ER, Vieira AW, Campos MFM (2012) Sparse spatial coding: a novel approach for efficient and accurate object recognition. In: Proceedings - IEEE international conference on robotics and automation, pp 2592–2598. https://doi.org/10.1109/ICRA.2012.6224785
Olshausen BA, Field DJ (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381(6583):607–609. https://doi.org/10.1038/381607a0
Article Google Scholar
Redmon J, Farhadi A YOLO9000: better, faster, stronger. arXiv:https://arxiv.org/abs/1612.08242
Roweis S T, Saul L K (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Article Google Scholar
Serre T, Wolf L, Poggio T (2005) Object recognition with features inspired by visual cortex. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 994–1000. https://doi.org/10.1109/CVPR.2005.254
Sohn K, Jung DY, Lee H, Hero AO (2011) Efficient learning of sparse, distributed, convolutional feature representations for object recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2643–2650. https://doi.org/10.1109/ICCV.2011.6126554
Tenenbaum J B, de Silva V, Langford J C (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323
Article Google Scholar
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 3360–3367. https://doi.org/10.1109/CVPR.2010.5540018
Weinberger K Q, Saul L K (2006) An introduction to nonlinear dimensionality reduction by maximum variance unfolding. In: AAAI. AAAI Press, pp 1683–1686
Weinberger K Q, Saul L K (2006) Unsupervised learning of image manifolds by semidefinite programming. Int J Comput Vis 70(1):77–90
Article Google Scholar
Yang J, Li Y, Tian Y, Duan L, Gao W (2009) Group-sensitive multiple kernel learning for object categorization. In: IEEE International conference on computer vision
Zeiler M D, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision
Zhang S, Huang J, Li H, Metaxas D N (2012) Automatic image annotation and retrieval using group sparsity. IEEE Trans Syst Man Cybern Part B 42(3):838–849
Article Google Scholar
Zhang Z, Xu Y, Yang J, Li X, Zhang D (2015) A survey of sparse representation: algorithms and applications. IEEE Access 3:490–530. https://doi.org/10.1109/ACCESS.2015.2430359
Article Google Scholar
Zhu P, Zhu W, Hu Q, Zhang C, Zuo W (2017) Subspace clustering guided unsupervised feature selection. Pattern Recogn 66:364–374
Article Google Scholar
Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 15(2):265–286. https://doi.org/10.1198/106186006X113430
Article MathSciNet Google Scholar

Download references

Acknowledgements

J. Song and S.M. Yoon were supported by the National Research Foundation of Korea grants funded (No.2015R1A5A7037615, No.2016R1D1A1B04932889) and IITP (#2014-0-00501) by the Korean Government. H. Cho was support by the National Research Foundation of Korea (No. 2017R1A2B4011015). G.J.Yoon was supported by National Institute for Mathematical Sciences (NIMS).

Author information

Authors and Affiliations

HCI Laboratory, College of Computer Science, Kookmin University, Seoul, Republic of Korea
Jinjoo Song, Heeryon Cho & Sang Min Yoon
National Institute for Mathematical Sciences, Daejeon, Republic of Korea
Gangjoon Yoon

Authors

Jinjoo Song
View author publications
You can also search for this author in PubMed Google Scholar
Gangjoon Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Heeryon Cho
View author publications
You can also search for this author in PubMed Google Scholar
Sang Min Yoon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sang Min Yoon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, J., Yoon, G., Cho, H. et al. Structure preserving dimensionality reduction for visual object recognition. Multimed Tools Appl 77, 23529–23545 (2018). https://doi.org/10.1007/s11042-018-5682-5

Download citation

Received: 12 July 2017
Revised: 13 December 2017
Accepted: 18 January 2018
Published: 31 January 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11042-018-5682-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Structure preserving dimensionality reduction for visual object recognition

Abstract

Similar content being viewed by others

Supervised data-dependent kernel sparsity preserving projection for image recognition

Image retrieval based on dimensionality reduction of second-order information

L1-norm orthogonal neighbourhood preserving projection and its applications

1 Introduction