1 Introduction

Representation of high dimensional data has been one of the most important research topics across a wide variety of information processing fields, including pattern recognition, machine learning, signal processing, data compression, and database navigation, because good data representation usually results in better performance in solving the problems related to those fields. Generally, good representation means that the representation preserves as much essential and meaningful information about the data as possible, while expressing it in a simpler way. Two common methods have been adopted to achieve this goal: dimensionality reduction and redundant feature removal that retains the structure. Dimensionality reduction attempts to transform high dimensional data into a lower dimensional form, simplifying and cleaning data for pattern recognition and machine learning tasks. Redundant feature removal retaining the main structure tries to find the optimized set of key features.

The data representation problem has become more challenging as very large high dimensional datasets have become available. This is because the number of possible distinct configurations of a set of variables increases as the number of variables increases. Many algorithms have been proposed and developed to tackle this problem, such as principle component analysis (PCA) [24], factor analysis (FA) [2], projection pursuit (PP) [20], independent component analysis (ICA) [29], etc. These algorithms come from different fields and have different approaches, but can be considered to have the same goal: finding a better representation. Hence, these algorithms focus on how to reduce dimensions or build a proper dictionary and show the resultant benefits.

PCA is one of the most popularly adopted methods in dimensionality reduction, which finds linear combinations of orthogonal factors that effectively represent the data. However, the PCA factors are still affected by all original data variables, which means they retain unwanted features when representing the data. In addition, the factors’ meanings are often difficult to interpret. Several alternatives have been proposed to overcome the limitation of PCA, such as non-negative matrix factorization (NMF) [30] and sparse PCA (SPCA) [50]. In particular, the underlying motivation for sparse representation-based dimension reduction is that even though the signal is in high-dimensional space, it can actually be obtained in some lower-dimensional subspace due to its sparsity. However, the main problem of typical dimension reduction of sparse representation is that it does not consider the data structure, and can result in structural information loss. Structured sparsity [19] was proposed to solve this problem by optimal selection over structures like groups of input features. In this paper, we present a novel visual object recognition methodology by applying structured sparse PCA (SSPCA) to effectively reduce the dimension of the high-dimensional input features and retain the meaningful features and structures. The benefit of exploiting the structure of data was proved by Jenatton et al. [21]. According to them, given any intersection-closed groups of patterns of variables, we can build some quasi-regularization norms (Ω) enforcing that the support set regularized by Ω belongs to the group patterns when solving the classification and regression tasks.

Figure 1 shows the proposed approach for object recognition. Finding the optimal dimensionality from very high dimensional data using SSPCA means it trains the dataset of the recognition task, while retaining recognition performance comparable to the previous state of the art approaches. The main contribution of this paper is summarized as below:

  • The method to find the optimal dimensionality reduction for high dimensional data when representing features.

  • Analysis of the relationships among the optimized size of the reduced dimension, number of training images, and non-optimized high dimensional features.

  • Structure-based similarity measure and classification instead of traditional energy-based measure and classification.

Fig. 1
figure 1

Our structure preserving dimensionality reduction system for robust visual object recognition. This system is separated into three parts: SSPCA-based dimension reduction, decomposition of the coefficients and the dictionaries, and similarity measure and classification

The remainder of this paper is organized as follows. Section 2 reviews related work in the view of dimensional reduction, sparse representation and learning, and visual object recognition, and Section 3 provides the technical detail of how to adopt SSPCA for solving the object recognition problem. Section 4 presents quantitative and qualitative evaluation of our approach, and analyses of the experimental results. Section 5 summarizes and concludes this paper.

2 Related works

2.1 Dimensionality reduction

As data dimensionality grows, the size of dimensional space increases exponentially. This is called the curse of dimensionality [4], which must be tackled particularly for many tasks in pattern recognition and machine learning fields. If dimensionality can be reduced, this enables lower computational cost and improved performance by removing noise and less informative features, and finding more general regions or rules applicable to new data for a variety of tasks. Several dimensionality reduction techniques have been introduced: PCA [24], multidimensional scaling (MDS) [10], ICA [29], Kernel PCA [17], semidefinite embedding (SDE) [44], Isomap [41], and locally linear embedding (LLE) [38].

PCA [24] is one of the most classical and popular methods for dimension reduction. It finds a linear reduced subspace having lower dimensionality that preserves as much variability as possible of the original data. However, its computational cost increases according to the complexity of the source data, and it is limited by linearity. MDS [10] is another classical dimensionality reduction method, and maps the original dimensional space to a lower dimensional space based on the proximities, indicating the similarities between variables. In MDS, the variables are represented as points in a lower dimensional space, and the distance between the points corresponds to the similarities. However, similar to PCA, MDS has the limitation of linearity. ICA [29] is a method to process non-Gaussian data, which PCA cannot deal with. ICA is a statistical method using the data distribution, but it still finds a linear representation. Kernel PCA [17] is a nonlinear dimensionality reduction method that introduces a kernel to PCA. Kernel PCA is applicable to many high dimensional nonlinear datasets, but reconstruction of training data and test samples is difficult. A variation of kernel PCA is SDE [44] or maximum variance unfolding (MVU) [43]. SDE differs from Kernel PCA in that when choosing a kernel function it does not use a pre-defined kernel function, but learns a kernel matrix with the assistance of semidefinite programming. Isomap [41] is a nonlinear dimensionality reduction algorithm based on MDS. This algorithm finds the neighbors of each data point with the assumption that only neighboring points know pair-wise distances, then computes the geodesic pair-wise distances between all other data points, and performs low dimensional embedding via MDS based on these geodesic distances. It is not difficult to determine the proper value of k for neighborhood graph construction for Isomap. LLE [38] was proposed to address nonlinear dimensionality reduction, and identifies the neighbors of each data point, computes weights that best describes the data point as a linear combination of its neighbors, and finds the low dimensional embedding of points, which are described with the previously determined weights. LLE achieves faster optimization than Isomap using an eigenvector based optimization technique. Marjan et al. represented high dimensional data as a chaos theory-based appearance model for object tracking task [1]. Some approaches try to solve the high dimensional data problem from the perspective of subspace clustering [14, 49].

2.2 Sparse representation and learning

Sparse representation and learning has been widely used in various research fields subsequent to the theoretical proof that sparse representation in a general dictionary is unique and can be found using l1 minimization [48]. Solving sparse representation and learning involves seeking the sparsest linear combination of basis functions from an overcomplete dictionary. There are many approaches to obtain the sparse solutions, but they can be grouped into three broad categories: convex relaxation [3], greedy pursuits [8], and combinational algorithms. Convex relaxation approaches include basis pursuit (BP) [12], interior point approaches, projected gradient methods, and iterative thresholding. To emphasize the original data structure while retaining sparseness, Huang et al. [18] proposed an efficient dynamic group sparsity concept that adaptively learns the dynamic group structure in practical applications. They also provided theoretical proofs for less measurement requirement and lower computation complexity in group sparsity. Inspired by conventional group sparsity, Zhang et al. [47] proposed leveraging feature sparsity and clustering properties to regularize feature selection. Yang et al. [23] extended the robust encoding ability of group sparse coding with spatial correlations among training regions to better depict and index image content. Group sparse representation usually focuses on more specified structures, whereas structured sparse representation works on general coding structures. De Pierrefeu et al. extended the PCA framework by adding a structural constraint, TV-elastic net penalty, to sparse PCA [11].

2.3 Object recognition

Breakthroughs in deep convolutional neural networks have dramatically improved object recognition and detection performances in recent years [9, 28, 31, 37]. However, other approaches that exploit the sparseness of image features have also been promising in realizing the state of the art object recognition performance. Initially, to imitate the human vision system, biologically motivated approaches emulated quantitative models of the visual cortex to recognize objects within images. Serre et al. presented a model that mimicked visual cortical processing to realize position- and scale-tolerant edge detection [39]. The features used in Serre et al.’s system were more flexible than template based approaches, and more selective than histogram based descriptors.

Other biologically inspired systems have attempted to model the three properties of mammalian primary visual cortex, i.e., spatially localized, oriented, and selective to structure at different scales [36]. Field proposed a coding strategy that maximized sparseness of the statistical structure of natural images [13] to mathematically (algorithmically) model these three properties. Olshausen and Field subsequently proposed a learning algorithm that searched for a sparse code by introducing two global objectives; one that preserved information and the other represented sparseness of the features [36]. They demonstrated that the three properties emerge when only the two global objectives are placed on a linear coding of natural images. Mutch and Lowe modified Serre et al.’s model by eliminating weaker responses that disagreed with the locally dominant responses [33], and matched only the dominant orientation within a feature, rather than comparing all orientations. They found increasing the sparsity of the features to be helpful in improving generalization performance.

As the usefulness of the sparse image features were tested and confirmed, more studies exploiting sparse image representation followed. For example, Kavukcuoglu et al. proposed a predictive sparse decomposition (PSD) algorithm that simultaneously learned an overcomplete linear basis set, and produced a smooth and easily computed approximator to predict the optimal sparse representation [25]. Naikal et al. proposed a Sparse PCA algorithm that selected the subset of informative features based on sparse coefficients in the first few principal vectors [34]. They also introduced an algorithm to speed up Sparse PCA using the augmented Lagrangian method. Sohn et al. studied an efficient training method for sparse and conventional restricted Boltzmann machines (RBMs) through the connections between mixture models and RBMs, and proposed a mid-level feature extraction method using convolutional RBMs [40]. Oliveira et al. proposed a sparse spatial coding (SSC) algorithm that combined a sparse coding dictionary learning, spatial constraint coding stage, and online classification to improve object recognition [35].

Studies have utilized spatial pyramid matching (SPM), with some incorporating sparse coding, have also contributed to improving the object recognition performance. For example, Lazebnik et al. adapted the pyramid matching scheme of Grauman and Darrell [15], which found an approximate correspondence between two sets of vectors, and pioneered the SPM method [26]. Their SPM method repeatedly subdivided the image using a hierarchy of rectangular windows and computed the histograms of local features at increasingly fine resolutions. Bosch et al., rather than focusing on the entire scene of each image, automatically learned the region of interest, and applied random forests to the spatial pyramid representation [6]. Yang et al. extended the SPM method by computing a spatial pyramid image representation based on sparse codes for SIFT [32] features, rather than the K-means vector quantization in the traditional SPM [23]. Boureau et al. investigated performance for different pairings of coding (vector quantization vs. sparse coding) and pooling (average vs. max) modules within the spatial pyramid framework and showed that sparse coding systematically outperformed vector quantization irrespective of the pooling modules, and that max pooling improved linear classification performance over average pooling [7]. He et al. [16] added a spatial pyramid pooling layer on top of the last convolutional layer of CNNs [27] to eliminate the fixed input image size requirement posed by current CNNs. This allowed training of variable-size images, possibly increasing scale invariance and reducing overfitting. Other object recognition methods include fast approximated locality constrained linear coding that utilizes the locality constraint to project each descriptor into its local coordinate system [42].

3 Structure preserving dimensional reduction for effective object recognition

We consider a method for recognizing objects from data by extracting appearance features enabling us to represent them in lower dimension while retaining the structure of the high dimensional features to efficiently represent and measure the similarity and classification. To effectively and efficiently recognize desired objects, we propose an object recognition method by classifying the coefficients of the image’s appearance model.

To use a smaller number of dictionary vectors than the descriptor dimensions, the number of dictionary vectors need to be sufficient to reasonably reflect the similarity the objects of a class share with each other, and to distinguish the mutually distinct structures in different classes as well. Although PCA is one of the most popularly adopted methods in dimensionality reduction solution, the PCA factors [Robert9]are still affected by all original data variables, and there are insufficient PCA aspects to express each individual characteristic that objects of the same type have in common. Thus, it is difficult for PCA to deliver the common features of objects of the same type. However, SSPCA [22] complements the PCA constraints, and can be an excellent candidate to represent the data because it allows a dictionary to be learned by exploiting a priori structural constraints as well as sparsity while reducing dimensionality [21].

The current paper applies SSPCA to generate dictionary vectors from appearance features. The proposed object recognition procedure comprises two steps: training the dictionary using SSPCA, and recognition by measuring similarity.

3.1 Symbols and notations

Before proceeding to the technical details, we introduce the notation and symbols to help understanding. To denote real numbers, we use lower case letters for indices, parameters and dimensions, and upper case letters for the constants used for the recognition process. We use boldfaces to denote multi-dimensional terms such as images, vectors, and matrices. Vectors are considered to be columns. Table 1 shows the detailed symbols and notations.

Table 1 Symbols and notations used in this paper

3.2 Dictionary learning and dimensionality reduction

We collected a large training dataset \(\mathcal {B}\) of 2D images and categorized the images into M groups \(\{{\mathbb B_{i}}\}_{i = 1}^{M}\) where each \(\mathbb {B}_{i}\) consists of images containing the same (relevant) type of objects.

To create feature descriptors, we randomly choose K images \(\{\mathbf {b}_{i, j}\}_{j = 1}^{K}\) from each group \(\mathbb {B}_{i}\). By resizing, we may assume that all the images have the same size divided into S subimages of the same size. From each subimage of bi,j, we created a feature descriptor [32] of dimension P, and arranged them in a column vector, \(\mathbf {x}_{i,j}\in \mathbb {R}^{P\cdot S}\). Thus, we created the descriptor xi,j corresponding to images bi,j for i = 1,…,M,j = 1,…,K.

Normally, PS is large, e.g. P = 128 and S ≥ 100, and we propose to reduce the dimensionality. That is, we need to find dictionaries such that every vector xi,j has a well fitted sparse representation. Before finding such dictionaries, we note that for each i = 1,…,M, feature vectors \(\{\mathbf {x}_{i,j}\}_{j = 1}^{K}\) come from the images containing the same kind of objects and they are likely to share similar structures.

Therefore, we apply the SSPCA proposed by Jenatton et al. [22] to find such dictionaries. Let X be the PS × KM matrix having xi,j as its column vectors. To apply the SSPCA, we first divide the set \(\mathbb {G}=\{1, 2, \ldots , P\cdot S\}\) into \(t=\lfloor \frac {P\cdot S}{M}\rfloor \) subsets \(\mathbb {G}_{1}, \mathbb {G}_{2}, \ldots , G_{M}\), where \(\mathbb {G}_{i}=\{(i-1)t + 1,(i-1)t + 2, \ldots , i \cdot t\}\) for i = 1,…,M − 1; \(\mathbb {G}_{M}=\{(M-1)t + 1,(M-1)t + 2, \ldots , P\cdot S\}\); and ⌊x⌋ denotes the largest integer smaller than or equal to x. Then we define the sparsity-inducing mixed 1/2 norm Ω over a vector \(\mathbf { u}\in \mathbb {R}^{P\cdot S}\),

$$ {\Omega}(\mathbf{u})=\sum\limits_{j = 1}^{M}\left( \sum\limits_{k\in \mathbb{G}_{j}} {u_{k}^{2}}\right)^{\frac12} \quad \text{for}\, \mathbf{u}=(u_{1}, u_{2}, \ldots, u_{P\cdot S}). $$
(1)

With this setting, we apply the SSPCA dictionary search algorithm to the descriptors \(\{\mathbf {x}_{i,j}\}_{i = 1, j = 1}^{M, K}\) to find dictionaries \(\{\mathbf {u}_{i}\}_{i = 1}^{r}\in \mathbb {R}^{P\cdot S}\),

$$\begin{array}{@{}rcl@{}} (\mathbf{U}, \mathbf{C}) &=& \underset{{\begin{array}{l} \mathbf{U}\in \mathbb{R}^{P\cdot S\times r}\\ \mathbf{C}\in\mathbb{R}^{r\times K\cdot M} \end{array}}}{\arg\!\min} \frac{1}{2(P\cdot S)\cdot (K\cdot M)}\|\mathbf{X}-\mathbf{U}\mathbf{C}\|_{F}^{2}\\ &&+\lambda\sum\limits_{j = 1}^{r}{\Omega}(\mathbf{u}_{j})~~\text{subject to}~~\|\mathbf{c}_{j}\|_{2}\le 1, \quad j = 1,\ldots, K\cdot M \end{array} $$
(2)

where \(\mathbf {U}=(\mathbf {u}_{k})_{k = 1}^{r}\) is the PS × r dictionary matrix; \(\mathbf {C}=(\mathbf {c_{j}})_{j = 1}^{K\cdot M}\in \mathbb {R}^{r\times K\cdot M}\) is the coefficient matrix; ∥⋅∥F is the Frobenius matrix norm, \(\|\mathbf {A}\|_{F}^{2}=trace(\mathbf {A}\mathbf {A}^{T})=\sum \limits _{i = 1}^{n}\sum \limits _{j = 1}^{m} a_{ij}^{2}\); \(\mathbf {A}=(a_{ij})\in \mathbb {R}^{n\times m}\) and ∥⋅∥2 is the Euclidean norm; and Ω is the mixed norm defined in (1) which controls the sparsity and the structure of the support of dj (for details on the norm (see [22] and references therein).

Then, for = (i − 1) ⋅ K + j, (i = 1,…,M,j = 1,…,K), the -th column vector xi,j of X is (approximately or exactly) expressed by a linear combination of the r dictionaries \(\mathbf {u}_{k}^{\prime }s\) with coefficients \((c_{k\ell })_{k = 1}^{r}\),

$$ \mathbf{x}_{i,j} \approx \sum\limits_{k = 1}^{r} c_{k\ell} \mathbf{u}_{k} = \mathbf{U} \mathbf{c_{\ell}}, \quad \mathbf{c}_{\ell}=(c_{1\ell}, \ldots, c_{r\ell})^{T}. $$
(3)

Thus, each descriptor vector \(\mathbf {x}_{i,j}\in \mathbb {R}^{P\cdot S}\) is represented (approximately) in terms of quite a few dictionaries, i.e., we reduce the dimensionality from PS to r. In the following section, we generate support vector machines using the \(\mathbf {u}_{k}^{\prime }s\) dictionaries.

We end this section with the following observation. Theorem 1 shows that we don’t expect that the larger dictionary set delivers better recognition performance. Empirically, there exists an optimal choice for r that depends on the number of samples (KM) and is shown to be \(\lfloor \frac {K\cdot M}{2}\rfloor \).

Theorem 1

For vectorzand finite set{w1,w2,…,wm}in\(\mathbb {R}^{n}\),let\(\{\mathbf {y}_{k}\}_{k = 1}^{m}\)be the orthogonal projections in\(\mathbb {R}^{n}\)ofzinto the vector space\(\mathbb {W}_{k}\)spanned by{w1,w2,…,wk},i.e.,

$$ \mathbf{y}_{k} = \underset{\mathbf{y}\in \mathbb{W}_{k}}{\arg\!\min}\|\mathbf{z} -\mathbf{y}\|_{2}, \quad\text{for}\, k = 1, 2, \ldots, m. $$
(4)

Then,

$$ \|\mathbf{z} -\mathbf{y}_{1}\|_{2}\ge \|\mathbf{z} -\mathbf{y}_{2}\|_{2}\ge {\cdots} \ge \|\mathbf{z} -\mathbf{y}_{k}\|_{2} $$
(5)

and

$$ \|\mathbf{y}_{1}\|_{0}\le \|\mathbf{y}_{2}\|_{0}\le {\cdots} \le \|\mathbf{y}_{k}\|_{0}, $$
(6)

where ∥z0is the0pseudo norm of a vectorz defined to be the numberof non-zero entries in z.

Proof

For k = 1,…,m, let nk be the dimension of the vector space \(\mathbb {W}_{k}\) spanned by {w1,w2,…,wk}. Since the subspaces \(\mathbb {W}_{1}\subseteq \mathbb {W}_{2}\subseteq {\ldots } \subseteq \mathbb {W}_{k}\) are nested, n1n2 ≤⋯ ≤ nk. Gram-Schmidt orthogonalization implies that there exists an orthonormal basis \(\{\mathbf {e}_{i}\}_{i = 1}^{n}\) for \(\mathbb {R}^{n}\), such that for k = 1,…,m, \(\{\mathbf {e}_{i}\}_{i = 1}^{n_{k}}\) is a basis for \(\mathbb {W}_{k}\). Thus, every vector \(\mathbf {w}\in \mathbb {W}_{k}\) is uniquely represented as

$$ \mathbf{w} = \sum\limits_{i = 1}^{n_{k}} \langle \mathbf{w}, \mathbf{e}_{i}\rangle \mathbf{e}_{i} \quad \text{and} \quad\|\mathbf{w}\|_{2}^{2}=\sum\limits_{i = 1}^{n_{k}} \langle \mathbf{w}, \mathbf{e}_{i}\rangle^{2}, $$
(7)

where 〈⋅,⋅〉 denotes the Euclidean inner product in \(\mathbb {R}^{n}\).

Let \(\mathbf {z}\in \mathbb {R}^{n}\) be given arbitrarily. Then for k = 1,…,m, the vector yk in (4) is the orthogonal projection of z onto \(\mathbb {W}_{k}\), i.e.,

$$\mathbf{y}_{k} = \underset{\mathbf{y}\in \mathbb{W}_{k}}{\arg\!\min}\|\mathbf{z} -\mathbf{y}\|_{2}=\sum\limits_{i = 1}^{n_{k}} \langle \mathbf{z}, \mathbf{e}_{i}\rangle \mathbf{e}_{i} $$

and

$$\|\mathbf{z} -\mathbf{y}_{k}\|_{2}^{2}=\sum\limits_{i=n_{k}+ 1}^{n} |\langle \mathbf{z}, \mathbf{e}_{i}\rangle|^{2}. $$

Thus, the inequalities (5) and (6) hold, which completes the proof. □

4 Experimental result

We prove the existence of an optimal number of dictionary elements and evaluate the performance of our method. The schematic experimental setup and evaluation for SSPCA based object recognition is illustrated and we report the results based on quantitative and qualitative evaluations in two parts: Section 4.1 covers the experimental setup, and Section 4.2 presents object recognition results from a public dataset and analysis of the proposed approach based on the experimental results and previous techniques.

4.1 Experimental setup

The experiments were conducted on the Caltech 101 dataset, which consists of 101 classes and one background class. Each class contains 31 to 800 images having the size of approximately 300 × 200 pixels. The images of each class vary in color, pose, and illumination, making the classification task challenging. We use 101 classes, excluding the background class. For pre-processing, we extract features from each image using SIFT [32] on a standard 4 GHz machine with 2 GB RAM.

4.2 Quantitative and qualitative evaluation of visual object detection

For fair performance evaluation and comparison, we start by focusing on the proposed approach’s effectiveness and efficiency to reduce the dimensionality of the visual objects and number of training images.

Figure 2 shows the confusion matrix of the proposed algorithm learned with 30 training samples per class, where the row denotes true class and the column denotes the assigned class. Some of the highly misclassified classes are llama (29.2%) and water lily (25%) classes. Figure 3 displays some examples of misclassified results of these classes. The images of the llama and the water lily classes are assigned to the kangaroo and lotus classes respectively. The image features of the target classes and the assigned classes are considerably similar, and this is the main reason for the low recognition rate of the two classes. The class with the lowest accuracy (16.7%) is the ant class, which has twelve test images. As shown in Fig. 4, only two testing images are recognized as an ant; ten other images are classified into ten different classes. This is because the features and the structures of an ant itself are uncomplicated; i.e., an ant is a relatively simple object consisting of some lines and circle-like shape. As a result, the factors like illumination, texture, pose, and background have a stronger influence on the classification than in the case of the objects composed of more complicated features and structures.

Fig. 2
figure 2

Confusion matrix of our SSPCA-based visual object recognition using Caltech 101 dataset. The performance is evaluated using the classifiers learned with 30 training images per class. Each row indicates the true class and each column represents the assigned class

Fig. 3
figure 3

Examples of misclassification. The testing images with the target class are on the left column; the assigned class of each testing image and the assigned class’s training images are on the right column

Fig. 4
figure 4

The object recognition result of the ant class. Only two test images are recognized as an ant. Ten other images are classified into ten different classes

We then conducted the experiments to evaluate the robustness of the proposed approach by reducing the number of training images because traditional learning and recognition approaches are largely dependent on the number of training samples; we seek to prove the Theorem 1 experimentally. The number of training images was varied; 15, 20, 25, and 30 images were randomly chosen from each class for training and the remaining images were used for testing. The online dictionary pertaining to the chosen training images was learned using SSPCA, which enabled us to decide how many feature dimensions were used, i.e., the number of dictionary elements. Table 2 displays the proposed method’s results applied to the task of object recognition using Caltech-101 dataset. We trained the classification model according to the different number of samples and dictionary elements. Generally, increasing the dictionary elements or the number of training samples increases performance. However, when using 25 or 30 training samples, the optimal number of the dictionary elements was 1500, not 2000. When the classification model was trained with 30 samples per class and 1500 dictionary elements, the best accuracy of 83.92% was achieved. This implies that overfitting occurred when more than the optimal number of dictionary elements were used. On the other hand, the combined use of 15 or 20 training samples and 1500 dictionary elements did not produce the best results. Using 2000 dictionary elements showed better performance than the use of 1500 dictionary elements, given 15 or 20 training samples. This means that, as the number of training samples decreases, more dictionary elements are needed to build a more accurate classification model. Thus, the number of dictionary elements and training samples influence the optimality of dimensions. Table 2s shows that, when using 30 training samples, the optimal number of dictionary elements is 1500, i.e., optimal dimensionality = 1500, rather than full dimension of the given features. As the number of feature dimensions increases, overfitting occurs; on the other hand, less dimensions cause underfitting. Increasing the feature dimensions do not guarantee better performance; hence, the optimal number of feature dimensions should be investigated for best performance.

Table 2 Recognition rate of different dictionary element size (r) and training sample combination

Next, we investigated the influence of the classifier type. The learned dictionary was then used to create a classification model utilizing SVM or the Softmax function. Many numerical experiments were conducted to identify the optimal number of dictionary elements and then compared with the proposed method and other significant previous approaches. We trained the model with two classifiers: SVM and Softmax. As shown in Table 3, SVM showed better performance than Softmax, but both classifiers showed increased performance with increased number of training samples, and showed the best accuracy using 30 training samples. Thus, the classifier type does not affect dimensional optimality.

Table 3 Recognition rate of different classifier and training sample combination

4.3 The comparison with the previous approaches

Lastly, we compared our SSPCA-based object recognition result to the remarkable previous approaches that use various learning techniques without exploiting pretrained models. Table 4 compares the recognition performance of our approach with the previous approaches using the Caltech 101 Dataset. When using fifteen training samples, Yang et al. [45] achieves recognition accuracy below 75%; in comparison, our approach shows relatively good result with fifteen training samples. Our approach successfully removes useless information via dimensionality reduction while preserving the important structures and features for the task. Zeiler and Furges [46], an algorithm that uses deep learning method, reported the recognition accuracy of 22.8% and 46.5% with 15 and 30 training images without using the ImageNet-pretrained model.

Table 4 Comparison of SSPCA-object recognition approach to the previous approaches, which do not use pretrained models

Nowadays, approaches adopting deep learning methodologies have shown considerably good performance in object recognition task. Zeiler and Furges [46] reported the recognition accuracy of 83.8% and 86.5% in the Caltech101 Dataset classification task when using 15 and 30 training samples respectively with the ImageNet-pretrained model. He at al. [16], which uses a convolutional neural network (CNN), also reported the experiment result with the recognition accuracy of 91.9% when training with 30 samples. However, both methods pre-trained the classification model using the ImageNet 2012 dataset, which contained about 10 million images, and the methods benefit significantly from their pre-trained models.

5 Conclusion

We proposed an object recognition algorithm using optimal dimensionality reduction and verified the effectiveness of the proposed technique experimentally. The proposed object recognition method employs SSPCA and is effective at preserving important and crucial features while removing noisy and unimportant features by leveraging the strengths of both sparse representation and PCA.

We also proved that an optimal number of feature dimensions exists, which simplifies feature representation and improves performance. The core concept of finding the optimal dimensionality reduction from high dimensional features can be applied to various applications, such as image abstraction, image manipulation, and image composition.

The future work includes the investigation of the following issues: improvement of the classification model to recognize the objects with simple features and structures such as ants and adaptive discovery of the optimal number of dimensions given the number of training samples. We will also work on SSPCA-based deep learning by combining SSPCA, which considers structure and sparsity, with many different deep learning algorithms, which learns features effectively, in order to deal with various machine learning tasks including object detection, image annotation or scene parsing with large scale datasets like the MS COCO dataset or the ImageNet dataset.