L1-norm orthogonal neighbourhood preserving projection and its applications

Koringa, Purvi A.; Mitra, Suman K.

doi:10.1007/s10044-018-0745-9

L1-norm orthogonal neighbourhood preserving projection and its applications

Theoretical Advances
Published: 04 August 2018

Volume 22, pages 1481–1492, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Pattern Analysis and Applications Aims and scope Submit manuscript

L1-norm orthogonal neighbourhood preserving projection and its applications

Download PDF

Purvi A. Koringa¹ &
Suman K. Mitra¹

393 Accesses
4 Citations
Explore all metrics

Abstract

Dimensionality reduction techniques based on manifold learning are becoming very popular for computer vision tasks like image recognition and image classification. Generally, most of these techniques involve optimizing a cost function in L2-norm and thus they are susceptible to outliers. However, recently, due to capability of handling outliers, L1-norm optimization is drawing the attention of researchers. The work documented here is the first attempt towards the same goal where orthogonal neighbourhood preserving projection (ONPP) technique is performed using optimization in terms of L1-norm to handle data having outliers. In particular, the relationship between ONPP and PCA is established theoretically in the light of L2-norm and then ONPP is optimized using an already proposed mechanism of PCA-L1. Extensive experiments are performed on synthetic as well as real data for applications like classification and recognition. It has been observed that when larger number of training data is available L1-ONPP outperforms its counterpart L2-ONPP.

L1-Norm and Trace Lasso Based Locality Correlation Projection

Dimensionality reduction based on multi-local linear regression and global subspace projection distance minimum

Article 29 August 2021

Laplace Transformation of Eigen Maps of Locally Preserving Projection (LE-LPP) Technique and Time Complexity

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 INTRODUCTION

Being very high-dimensional data, images produce many challenges while handling them in tasks like machine learning, computer vision. Though image appears to be high-dimensional data, it is proved that it lies in comparatively very low-dimensional linear or nonlinear manifold [7, 10]. Thus, dimensionality reduction techniques are very much applicable in these fields and much research is being done. The fundamental philosophy is to seek a nonlinear or linear transformation to map the data from high-dimensional data space to a lower-dimensional subspace which makes the same class of data more compact for applications like recognition and classification, in addition to that also reduces computational burden. Such manifold learning-based dimensionality reduction techniques have drawn considerable interests in recent years. Some of the examples are principal component analysis (PCA) [21], linear discriminant analysis (LDA) [17], locality-preserving projection (LPP) [6, 20] and neighbourhood preserving embedding (NPE) [7, 11], and some of their 2D variants are discussed in [14, 23, 24]. Techniques such as PCA and LDA preserve global geometry of data in the lower-dimensional space also, whereas techniques such as LPP and NPE tend to preserve global geometry as well as local geometry by a graph structure using neighbourhood information.

The linear dimensionality reduction technique such as orthogonal neighbourhood preserving projection (ONPP) proposed in [10] preserves global geometry of data and captures local relationship of neighbourhood also. A modified version of the same that deals with nonlinearity present in the local neighbourhood is given in [11]. ONPP is a linear extension of Locally Linear Embedding (LLE) proposed in [19]. LLE assumes that the data lie on or near a low-dimensional manifold and can be approximated as a linear combination its neighbours. LLE represents this relation using a weighted neighbourhood graph and tries to find embeddings that preserve this linear relationship in lower-dimensional space also. Because of the nonlinear nature of LLE technique, it cannot be used as a tool for finding embeddings of out-of-sample data. ONPP projects the sample data onto a linear subspace using the same philosophy of locally linear patches and also accommodates out-of-sample data into lower-dimensional space.

All these dimensionality reduction techniques try to optimize error functions based on some criteria imposed either on original data points in higher-dimensional space and/or on its embeddings in lower-dimensional space. Most of these error functions are formulated using L2-norm, which are not robust to outliers [16]. L1-norm, on the other hand, is known for its robustness to outliers [8]. In recent times, many dimensionality reduction techniques involve L1-norm optimization [12, 22, 25]. This article proposes one such algorithm used in PCA-L1[12] to achieve L1-norm-based ONPP (now will be denoted as L1-ONPP). This work firstly documents the experiments performed on synthetic data using L2-ONPP, showing susceptibility of it towards outliers. Secondly, the relationship between ONPP and PCA is established and proved theoretically, and the experiments performed on synthetic as well as real data support the claim that ONPP basis can be obtained using PCA. PCA-L1 is used to obtain L1-ONPP, and performance of L2-ONPP and L1-ONPP in the presence of outliers is compared. Experimental outcomes imply that L1-ONPP outperforms L2-ONPP while dealing with the data having outliers.

In Sect. 2, L1-norm-based PCA is explained in detail, followed by Sect. 3 in which a relation between ONPP and PCA is established theoretically, with supporting experiments and results documented in Sect. 4, followed by conclusion in Sect. 5.

2 L1-NORM FOR DIMENSIONALITY REDUCTION

As discussed in Sect. 1, all conventional dimensionality reduction techniques employ optimization of a cost function expressed using L2-norm. Conventional ONPP proposed in [10] is also based on L2-norm optimization. Despite the fact that it has been employed successfully in many problems like face recognition, etc., it is prone to the presence of outliers because the effect of the outliers with a large norm is magnified by the use of the L2-norm. In order to mitigate this problem and achieve robustness against outliers, research has been performed on dimensionality reduction techniques based on L1-norm. Many works have been done in PCA and LDA based on the use of L1-norm [1, 3, 9, 12, 22, 25]. Not much efforts have been put into the use of L1-norm-based methods in recently proposed dimensionality reduction techniques such as LPP and ONPP.

In [1, 9], instead of assuming that each component of error between the original data point and its projection follows Gaussian distribution, it is assumed to follow a Laplacian distribution and maximum likelihood estimation was used to formulate L1-norm PCA (L1- PCA) basis for the given data. In [1], a heuristic estimation approach for general L1-norm problem was applied to solve L1-PCA optimization, whereas in [9], convex programming methods and the weighted median method were proposed for L1-norm PCA. Despite being robust, L1-PCA has several disadvantages, it is computationally expensive because it is based on linear or quadratic programming. [15] discussed 2D variants of L1-norm PCA. [3] proposed R1-PCA, which bands together the merits of L2-PCA and those of L1-PCA. R1-PCA is rotational invariant like L2-PCA, and it also overcomes the effect of outliers as L1-PCA does. However, these methods are highly dependent on the dimension d of a subspace to be found. For example, the projection vector obtained when $d = 1$ may not be in a subspace obtained when $d = 2$. Moreover, as it is an iterative algorithm so for a large-dimensional input space, it takes a lot of time to achieve convergence. Let us now discuss the work done on L1-norm-based PCA.

2.1 L2-PCA and L1-PCA

Let each data point $\mathbf {x_i}$ be a column of $\mathbf {X}$ such that $\mathbf {X=[x_{1},x_{2},\ldots ,x_{N}]}\in {\mathbf {\mathcal {R}}^{m\times N}}$ be the given data matrix, where m denotes dimensions of the original input space and N denotes number of data samples. Without the loss of generality, data are assumed to be centred at origin i.e. $\mathbf {\bar{x}=0}$. L2-PCA tries to search a $d(< m)$-dimensional linear subspace such that the basis vectors capture the direction of maximum variances by minimizing the error function in terms of L2-norm:

$$\begin{aligned} \arg \max \mathcal {E}(\mathbf {y})&= {} \underset{\mathbf {y}}{\arg \max }{\sum _{i=1}^{N}}\parallel \mathbf {y_{i}-\bar{y}}\parallel _{2}\nonumber \\&\quad{\hbox {where}},\ \mathbf {y_{i}=V^\mathrm{{T}}x_{i}} \end{aligned}$$

(1)

$$\begin{aligned} \arg \max \mathcal {E}(\mathbf {V})&= {} \underset{\mathbf {V}}{\arg \max }{\sum _{i=1}^{N}}\parallel \mathbf {V^\mathrm{{T}}x_{i}-V^\mathrm{{T}}\bar{x}}\parallel _{2}\nonumber \\&= {} \underset{\mathbf {V}}{\arg \max }{\sum _{i=1}^{N}}\parallel \mathbf {V^\mathrm{{T}}x_{i}}\parallel _{2}\nonumber \\ \arg \max \mathcal {E}(\mathbf {V})&= {} \underset{\mathbf {V}}{\arg \max }\parallel \mathbf {V^TX}\parallel _{2}\nonumber \\&s.\ t.\ \mathbf {V^\mathrm{{T}}V=I_{d}} \end{aligned}$$

(2)

where $\mathbf {V\in \mathcal {R}^{m\times d}}$ is the projection matrix and its $\textit{d}$ columns are the bases of the d-dimensional linear subspace.

In PCA-L1 proposed in [12], instead of finding bases in the original data space that capture the direction of maximum variances which is based on the L2-norm, a method that maximizes the dispersion in terms of L1-norm in the feature space is presented to achieve robust and rotation invariant PCA. The approach presented in [12] for L1-norm optimization is iterative and also proven to find a locally maximal solution.

Maximizing dispersion in the feature space using L1-norm can be formulated as

$$\begin{aligned} \arg \max \mathcal {E}(\mathbf {V})=\underset{\mathbf {V}}{\arg \max }\parallel \mathbf {V^\mathrm{{T}}X}\parallel _{1} \end{aligned}$$

(3)

Since the closed form solution of problems involving L1-norm is not possible, the basis is sought iteratively as follows:

For $d=1$

$$\begin{aligned} v_{1}&= {} \underset{\mathbf {v}}{\arg \max }\parallel \mathbf {v^\mathrm{{T}}X}\parallel _{1} = \underset{\mathbf {v}}{\arg \max }{\sum _{i=1}^{N}}\vert \mathbf {v^\mathrm{{T}}x_{i}} \vert \nonumber \\&\qquad s.\ t.\ \mathbf {\parallel {v}\parallel _{2}}=1 \end{aligned}$$

(4)

For $d > 1$:

Once the basis in the direction of ith maximum variance $\mathbf {v_{j}}$ ($\mathbf {v_1}$ for first basis) is sought by solving Eq. 4, the data are projected on this newly found basis vector. For the rest of the basis vectors $\mathbf {v_j}\ (2\le j\le d)$ the same maximization problem given in 4 is solved for projected data ($\mathbf {X_{j}=X_{j-1}-v_{j-1}(v_{j-1}^\mathrm{{T}}X_{j-1})}$) iteratively, which essentially means in every iteration, direction of maximum variance in feature space is sought, until desirable $d(d<m)$dimensional space is achieved.

3 L1-ONPP USING PCA-L1

As stated in Sect. 1, ONPP [10] is a linear extension of LLE and thus inherits the sensitivity of LLE towards outliers. The degradation in manifold learning when the data have outliers, inspired the use of L1-norm minimization in ONPP to tackle the outliers. In order to use PCA-L1 explained in Sect. 2 to achieve L1-ONPP, a relationship between PCA and ONPP in established in this Section. Firstly, ONPP and a modified variant of ONPP, namely MONPP, is explained in detail in Sect. 3.1, followed by an theoretical explanation of a relation between PCA and ONPP. Section 3.3 explains how PCA-L1 can be used to compute L1-ONPP bases.

3.1 L2-ONPP and L2-MONPP

ONPP is a linear extension of LLE, which is a nonlinear dimensionality reduction technique that finds lower-dimensional embeddings of high-dimensional data samples, but the disadvantage of this embedding is a non-explicit mapping, in the sense that embedding is data dependent. In LLE, the inclusion or exclusion of any data point will result in the learning of entirely different manifold. Hence, in tasks such as recognition or classification of out-of-sample data point, LLE fails. ONPP solves the problem of out-of-sample data and finds the explicit mapping of the data points in lower-dimensional subspace through a linear orthogonal projection matrix. This orthogonal projection matrix can embed the new test data point into the lower-dimensional subspace making tasks such as recognition or classification of out-of-sample data possible. Another variant of ONPP is modified ONPP [11], that suggests the use of piece-wise nonlinear weights to reconstruct the data and represent local nonlinearity present in the neighbourhood patch more effectively. ONPP and MONPP both use L2-norm optimization to find projection matrix, but they differ in the mechanism to assign weights to neighbours of a data points as explained below. However, ONPP and MONPP still inherit the susceptibility to the presence of outliers.

Let $\mathbf {x_{1},x_{2},\ldots ,x_{N}}$ be the N data points from m-dimensional space and $\mathbf {X=[x_{1},x_{2},\ldots ,x_{N}]}\in {\mathbf {\mathcal {R}}^{m\times N}}$ be the data matrix. The basic task of the dimentionality reduction techniques is to find a non-orthogonal or an orthogonal projection matrix $\mathbf {V} \in {\mathbf {\mathcal {R}}^{m\times d}}$ which projects the data point $\mathbf {x_i}\in {\mathbf {\mathcal {R}}^{m}}$ in the higher-dimensional space to the embeddings $\mathbf {y_i}\in {\mathbf {\mathcal {R}}^{d}}$ in the lower-dimensional space (as d is assumed to be less than m) such that $\mathbf {y_i}=\mathbf {V}^\mathrm{{T}}\mathbf {x_i}$.

ONPP algorithm achieves the projection matrix in three basic steps. The first step involves finding neighbours of a data point $\mathbf {x_i}$. In unsupervised mode, neighbours are either decided by k-NN method or by using $\varepsilon $-NN method, whereas in the supervised mode, neighbours are decided based on knowledge of class label. The second step considers local patch of a data point, where a linear relationship between a data point and its neighbours is expressed using reconstruction weights. In the third step, ONPP tries to achieve compactness in the lower-dimensional space through a minimization problem such that this linear relationship in the high-dimensional neighbourhood is preserved.

Let set of $k $ neighbours of data point $\mathbf {x_i}$ be $\mathcal {N}_{x_{i}}$. First, data point $\mathbf {x_{i}}$ is approximated as a linear combination of its neighbours as $\sum _{j=1}^{k} w_{ij}\mathbf {x_{j}}$ where $\mathbf {x_{j}}\in \mathcal {N}_{x_{i}}$ and the weight $w_{ij}$ indicates $\mathbf {x_j}$’s contribution in reconstructing $\mathbf {x_i}$. The optimum weights $w_{ij}$ are computed by minimizing the sum of the reconstruction errors, i.e. sum of errors between all $\mathbf {x_{i}}$ and linear combination of its neighbours $\mathbf {x_{j}}\in \mathcal {N}_{x_{i}}$. The minimization problem can be posed as:

$$\begin{aligned} \arg \min \mathcal {E}(\mathbf {W})&= {} \underset{W}{\arg \max }{\displaystyle {\displaystyle \sum _{i=1}^{N}}\parallel \mathbf {x_{i}}-\sum _{j=1}^{k}}w_{ij}\mathbf {x_{j}}\parallel _{2} \nonumber \\&\qquad s.\ t.\ \sum _{j=1}^{k} w_{ij}=1 \end{aligned}$$

(5)

The problem corresponding to each data point $\mathbf {x_{i}}$ can be solved individually as a least square problem. Let matrix $\mathbf {X_{N_{i}}}$ be a neighbourhood matrix such that each neighbour $\mathbf {x_{j}}\in {\mathcal {N}_{x_{i}}}$ constitutes its columns. Note that $\mathbf {x_{i}}$ is also included in $\mathbf {X_{N_{i}}}$ as one of its own neighbour, making dimension of $\mathbf {X_{N_{i}}}$ is $m\times k+1$. Now, for each $\mathbf {x_i}\in \mathbf {X}$ Eq. (5) can be written as an individual least square problem $(\mathbf {X_{N_{i}}-x_{i}e^\mathrm{{T}})w_{i}}=\mathbf {0}$ for a data point $\mathbf {x_i}$ with a constraint $\mathbf {e^\mathrm{{T}}w_{i}}=\mathbf {1}$, which results in a closed form solution for $\mathbf {w_i}$ as shown in Eq. (6). Here, $\mathbf {w_i}$ is a reconstruction weight vector of dimension $k\times 1$ and $\mathbf {e}$ is a vector of ones of dimension $k\times 1$.

$$\begin{aligned} \mathbf {w_{i}}=\frac{\mathbf {G_i}^{-1}\mathbf {e}}{{\mathbf {e}^\mathrm{{T}}\mathbf {G_i}^{-1}\mathbf {e}}} \end{aligned}$$

(6)

where $\mathbf {G_i}$ is Gramiam matrix of dimension $k\times k$. Each element of $\mathbf {G}$ is calculated as $\mathbf {g_{pl}}=(\mathbf {x_{i}}-\mathbf {x_{p}})^\mathrm{{T}}(\mathbf {x_{i}}-\mathbf {x_{l}}), \;{\text{for}}\,\forall \, \mathbf {x_{p}},\mathbf {x_{l}}\,\in \mathcal {N}_{x_{i}}$. Detailed discussion on reconstruction weights can be found in [10, 19].

On the other hand, a variant of ONPP, modified orthogonal neighbourhood preserving projections (MONPP) stresses on the fact that the local neighbourhood patch assumed to be lying on or near a linear manifold may have some inherent nonlinearity. To take this nonlinearity into account while approximating a data point using its neighbours, MONPP uses nonlinear weights incorporating Z-shaped function [11] in place of linear weights obtained using least square solution. Equation (7) is used to assign weight to each neighbour $\mathbf {x_{j}}\in \mathcal {N}_\mathbf {x_i}$ using Z-shaped function based on the distance $d_{ij}$ between data points $\mathbf {x_i}$ and $\mathbf {x_j}$. Note that this equation is same as Eq. (6), where $\mathbf {G}^{-1}$ is replaced by $\mathbf {Z}$. The new weights are

$$\begin{aligned} \mathbf {w_{i}}=\frac{\mathbf {Z_ie}}{\mathbf {e}^\mathrm{{T}}\mathbf {Z_ie}} \end{aligned}$$

(7)

Third and the last step finds the projection matrix $\mathbf {V}\in \mathbf {\mathcal {R}^{m\times d}}$ to reduce the dimensionality, using $\mathbf {V}$ the data point $\mathbf {x_{i}} \in \mathbf {R}^{m}$ is projected on lower-dimensional space as $\mathbf {y_{i}}\in \mathbf {R}^{d}$ ($d<<m)$ assuming that the neighbours $\mathbf {x_j}$s used to approximate data point $\mathbf {x_i}$ using reconstruction weights $w_{ij}$ can be used to reconstruct data point $\mathbf {y_i}$ in lower-dimensional space using the lower-dimensional embeddings $\mathbf {y_j}$s and corresponding weights $w_{ij}$. Problem of finding such embeddings can be posed as a minimization problem with cost function

$$\begin{aligned} \arg \min \mathcal {E}(\mathbf {Y})&= {} \underset{Y}{\arg \min }\sum _{i=1}^{N}\parallel \mathbf {y_{i}}-\sum _{j=1}^{N}w_{ij}\mathbf {y_{j}}\parallel _{2}\nonumber \\ \arg \min \mathcal {E}(\mathbf {V})&= {} \underset{V}{\arg \min }\sum _{i=1}^{N}\parallel \mathbf {V^\mathrm{{T}}x_{i}}-\sum _{j=1}^{N}w_{ij}\mathbf {V^\mathrm{{T}}x_{j}}\parallel _{2}\nonumber \\&\qquad s.\ t.\ \mathbf {V}^\mathrm{{T}}\mathbf {V}=\mathbf {I_{d}} \end{aligned}$$

(8)

This optimization problem results in a eigenvalue problem with the closed form solution. The eigenvectors corresponding to the smallest d eigenvalues of matrix $ \mathbf {M}=\mathbf {X}(\mathbf {I}-\mathbf {W})(\mathbf {I}-\mathbf {W}^\mathrm{{T}})\mathbf {X}^\mathrm{{T}}$ constitute the basis of the low-dimensional ONPP space. ONPP explicitly maps $\mathbf {X}$ to $\mathbf {Y}$, which is of the form $\mathbf {Y}=\mathbf {V}^\mathrm{{T}}\mathbf {X}$, where each column of $\mathbf {V}$ is an eigenvector of $\mathbf {M}$. Once the ONPP projection space is obtained, any test data point $\mathbf {x_l}$ can be embedded into the space using a simple matrix-vector product.

3.2 ONPP as a PCA on reconstruction error

Rewriting Eq. (8) in a matrix form, to establish the relationship between ONPP and PCA:

$$\begin{aligned} \arg \min \mathcal {E}(\mathbf {Y})&= {} \underset{\mathbf {Y}}{\arg \min }\parallel \mathbf {Y}-\mathbf {YW}\parallel _{2}\nonumber \\ \arg \min \mathcal {E}(\mathbf {V})&= {} \underset{\mathbf {V}}{\arg \min }\parallel \mathbf {V^\mathrm{{T}}X-V^{T}XW}\parallel _{2}\nonumber \\&= {} \underset{\mathbf {V}}{\arg \min }\parallel \mathbf {V^\mathrm{{T}}(X-XW)}\parallel _{2}\nonumber \\ \arg \min \mathcal {E}(\mathbf {V})&= {} \underset{\mathbf {V}}{\arg \min }\parallel \mathbf {V^\mathrm{{T}}Er}\parallel _{2}\nonumber \\&\qquad s.\ t.\ \mathbf {V}^\mathrm{{T}}\mathbf {V}=\mathbf {I} \end{aligned}$$

(9)

Now, comparing the optimization problems of PCA (Eq. 2) and the optimization problem of ONPP (Eq. 9), both result in an eigenvalue problems and have closed form solutions in terms of eigenvectors. Equation (2) is maximization problem and thus the bases vectors of PCA are eigenvectors corresponding to largest d eigenvalues, whereas Eq. (9) is minimization problem and thus the desired ONPP bases are eigenvectors corresponding to smallest d eigenvalues.

In other words, ONPP can be stated as a PCA of reconstruction errors, and conventional ONPP algorithm is essentially finding bases vectors $\mathbf {V}$ such that it actually captures the directions of minimum variances of reconstruction error. Thus, finding the strongest ONPP basis is same as finding weakest basis of PCA when performed on the reconstruction errors $\mathbf {Er}$. Each column of $\mathbf {Er}$ is a reconstruction error for ith data point, which is calculated using its neighbours and corresponding weights found using (6) or (7) using

$$\begin{aligned} \mathbf {er_i} = \mathbf {x_i} - \sum _{j=1}^{k}w_{ij}\mathbf {x_j} \end{aligned}$$

(10)

This relationship between PCA and ONPP bases is verified in experiments performed on the synthetic data, and it is observed that the ONPP bases obtained using conventional L2-ONPP algorithm and ONPP bases obtained using L2-PCA on reconstruction error are same. Details of this experiment are documented in Sect. 4.

3.3 L1-ONPP using PCA on reconstruction error

Once the relationship between L2-PCA bases and L2-ONPP bases is in place, it is evident that PCA algorithm can also be used to find ONPP bases. This led to the use of existing L1-norm-based PCA algorithms to solve L1-ONPP optimization problem. Rewriting L2-ONPP optimization problem in Eq. (8) using L1-norm minimization, we have

$$\begin{aligned} \arg \min \mathcal {F}(\mathbf {Y})&= {} \underset{\mathbf {Y}}{\arg \min }\sum _{i=1}^{N}\parallel \mathbf {y_{i}}-\sum _{j=1}^{N}w_{ij}\mathbf {y_{j}}\parallel _{1} \nonumber \\&\qquad s.\ t.,\ \mathbf {V}^\mathrm{{T}}\mathbf {V}={I} \end{aligned}$$

(11)

In matrix form,

$$\begin{aligned} \arg \min \mathcal {F}(\mathbf {Y})&= {} \underset{\mathbf {Y}}{\arg \min }\parallel \mathbf {Y}-\mathbf {YW}\parallel _{1}\nonumber \\ \arg \min \mathcal {F}(\mathbf {V})&= {} \underset{\mathbf {V}}{\arg \min }\parallel \mathbf {V^\mathrm{{T}}X-V^{T}XW}\parallel _{1}\nonumber \\&= {} \underset{\mathbf {V}}{\arg \min }\parallel \mathbf {V^\mathrm{{T}}(X-XW)}\parallel _{1}\nonumber \\&= {} \underset{\mathbf {V}}{\arg \min }\parallel \mathbf {V^\mathrm{{T}}Er}\parallel _{1} \end{aligned}$$

(12)

the problem stated in Eq. (12) is similar to problem stated in PCA-L1 (Eq. 4), and thus the solution of (Eq. 4) can be used to solve Eq. (12). L1-ONPP bases can be found using any L1-norm-based PCA algorithm when performed on reconstruction error matrix $\mathbf {Er}$. As discussed in Sect. 1, many L1-norm-based PCA methods have been developed which find bases vectors through linear or quadratic programming. These methods are computationally expensive. The PCA-L1 algorithm [12] used here is a robust as well as fast L1-norm-based method. PCA-L1 converts the L1-norm variance into a direct sum of signed training points into projection space. The bases vectors are updated by the sum of resigned training points. As a result, the convergence procedure is fast. Refer to [12] for the proof.

L2-ONPP involves closed form solution which involves eigenvalue problem of matrix size $m\times m$. L1-ONPP is computationally costly because it involves an iterative procedure because each basis vector $v_k$ starts from a random m-dimensional vector and polarity check, flipping and maximization are performed iteratively until $v_k$ converges.

Comparing Eq. (12) of L1-ONPP with Eq. (4) of L1-PCA, we can intuitively state that the component in the direction of minimum variance gives the strongest ONPP basis. Considering reconstruction error between a data point $\mathbf {x_i}$ and its approximation $\mathbf {x_{i}^{'}}$ as a vector $\mathbf {er_{i}}$, which is also a point in m-dimensional space (as shown in Fig. 1) L1-PCA can be performed on the reconstruction errors to search for the d-dimensional space such that the bases vectors are in the direction of minimum variances of these reconstruction errors. Such bases can be computed using PCA-L1 algorithm, and the detailed algorithm is given in Table 1.

Table 1 L1-ONPP Algorithm

Full size table

4 EXPERIMENTS

To validate the theoretical conclusion on the relationship between ONPP and PCA, experiments were performed on the synthetic as well as real data as documented in this section.

4.1 A small problem with Swiss-role data

In the literature, definition of outlier is given as a data point that seems to be taken from an entirely different distribution. To observe the effect of outliers on L2-ONPP algorithm, an experiment was performed on Swiss-role data. Over 2000 three-dimensional data points were randomly sampled from a continuous Swiss-role manifold. Two-dimensional embeddings of clean data were found using L2-ONPP as shown in Fig. 2a right. Now, 50 data points (nearly 2.5% of clean data) from a normal distribution are added to these 2000 clean data point as outliers, two-dimensional embeddings of this data are also found using L2-ONPP as shown in 2b right. Comparing embeddings from clean data (Fig. 2a) and embeddings from data having outliers (Fig. 2b), it can be observed that global structure as well as local geometry is well preserved in the case of clean data, whereas in the case of noisy data (Fig. 2b), two-dimensional representation is distorted. The reason is all neighbours of the clean data point may not lie on locally linear patch of a manifold in the presence of outliers, which leads to the biased reconstruction. On the other hand, the neighbourhood patch of the outlier will be comparatively very large and thus does not capture local geometry very well, as the effect of large distance is exaggerated by the use of L2-norm. It has been known that L2-norm-based techniques are not robust, in the sense that the presence of outliers can arbitrarily skew the solution from the desired solution.

Another experiment was performed to analyse the effectiveness of the proposed algorithm in the presence of different amounts of outliers. As shown in Fig. 3b, 1000 3D clean data points were sampled from S-shaped continuous manifold shown in Fig. 3a. Figure 3c, h shows 2D representation of clean data using L2-ONPP and L1-ONPP, respectively. The clean data are then corrupted with 100, 200, 300 and 400 outliers sampled from an uniform distribution. Figure 3d–g shows 2D representation of noisy data using L2-ONPP, and Fig. 3i–l shows 2D representation of noisy data using L1-ONPP. As it can be seen from this experiment L1-ONPP very well handles outliers by preserving intrinsic neighbourhood relations as well as global geometry of data. On the contrary, increasing density of outliers distorts the learned manifold in increasing manner, the presence of outliers even affects the orientation of data as can be seen in Fig. 3d–e.

4.2 Comparing bases of L2-PCA, L2-ONPP and L1-ONPP

To validate the relationship between PCA and ONPP as described in Sect. 3.1, the experiment was performed on synthetically generated data. 2D data were randomly generated to form 7 clusters with 100 data point each resulting in 700 data points. The clusters are closely placed and slightly overlapping, and 2 out of 7 were slightly separated as shown in Fig. 4a. L2-ONPP bases were found using the conventional algorithm, and another set of bases vectors were computed by performing L2-PCA on reconstruction error. The bases found using both methods are same.

L2-norm ONPP basis [Fig. 4a]

1st basis : $[0.6361, 0.7716]^\mathrm{{T}}$ 2nd basis: $[-\,0.7716, 0.6361]^\mathrm{{T}}$

PCA basis on reconstruction errors [Fig. 4b]

1st basis : $[0.6360, 0.7717]^\mathrm{{T}}$ 2nd basis: $[-\,0.7717, 0.6360]^\mathrm{{T}}$

Figure 4c shows that L2-ONPP bases are essentially pointing the direction in which the variance of reconstruction error is minimum. Note that the reconstruction error of all data point is centred at origin (same as the assumption in PCA that the data points are mean centred). For this data, L1-ONPP bases were computed using L1-PCA algorithm. As it can be seen from Fig. 4d, the projection bases are tilted towards the outlier data.

L1-norm ONPP basis [Fig. 4b]

1st basis : $[0.4741, 0.8805]^\mathrm{{T}}$ 2nd basis: $[-\,0.8805, 0.4741]^\mathrm{{T}}$

In this experiment, the residual error was observed for both, L2-ONPP and L1-ONPP. Residual error is a measure of how well the information is preserved while projecting data on the lower-dimensional space using few strongest bases, while discarding other dimensions. In this case, the data were projected using only one dimension using the strongest basis vector. The average residual error was calculated using

$$\begin{aligned} \mathbf {e_{avg}=\frac{1}{N}\sum _{i=1}^{N} x_i-v_1\left( v_1^Tx_i\right) } \end{aligned}$$

(13)

The average residual errors of L2-ONPP and L1-ONPP are 2.3221 and 0.7894, respectively. Thus. it can be concluded that L1-ONPP is less susceptible to outliers compared to L2-ONPP. The same behaviour related to residual error is observed with real data also as stated in the following experiment.

4.3 Experiment with IRIS dataset

To further compare behaviour of L2-ONPP and L1-ONPP regarding residual error and performance in classification task, Iris data form UCI Machine Learning Repository [4] are used. The data set contains 150 instances of 4-dimensional data belonging to three different classes. The residual error obtained while reconstructing the data using varying number of dimensions is shown in Fig. 5. Table 2 lists the residual errors using different numbers of dimensions; as it can be seen, the residual error is less in L1-ONPP as compared to L2-ONPP which significantly improves classification accuracy at lower dimensions. When all 4 dimensions are used in projection space, the projection of data spans entire original space, and thus the residual error drops to almost zero for both methods, L2-ONPP and L1-ONPP. Similar behaviours expected for classification also at 4 dimensions; as can be seen from Table 2, the classification error at 4 dimensions yields greater than the lower dimension representation because it includes the redundant details present in higher dimensions. The same behaviour at higher dimensions can be observed in all dimensionality reduction techniques. Here, nearest neighbour (NN) is used as a classifier.

Table 2 Comparison of performance in terms of residual error and classification error (in %) of L2-ONPP and L1-ONPP with varying number of dimensions on IRIS data

Full size table

4.4 Experiment with handwritten numerals

Handwritten text or numerals have huge variations in terms of shape, stroke width, orientations and pattern, thus making an ideal data to test the capacity of L1-ONPP to handle outliers. To compare performance of L1-ONPP and L2-ONPP with real-world data having outliers, three handwritten numeral databases, one English and two different Indian languages Gujarati and Devnagari, are used. The images in each database are resized to $30\times 30$ to maintain uniformity across all three databases.

English numerals

The MNIST database [13] is a large database of handwritten English digits which contains nearly 70,000 images of each digit. Some of the examples of digit $'2'$ are shown in Fig. 6a. 1000 samples were selected randomly such that each digit is equally present in training, while remaining samples were used for testing. Average recognition accuracy of 20 randomization are reported here. Performance of L2-ONPP and L2-MONPP is compared with L1-ONPP and L1-MONPP with varying number of dimensions as shown in Fig. 7. As it can be seen, L1-ONPP and L1-MONPP outperform their L2-norm counterparts with large difference. Best average recognition accuracy of L1-ONPP and L1-MONPP is almost same, nearly 87.52% achieved at 210 dimensions. As it can be seen from Fig. 7, the performance of L2-ONPP and L2-MONPP is poor compared to its L1-norm counterparts.

Gujarati numerals

The Gujarati Handwritten numeral dataset [5, 18] have nearly 1300 samples for each digit. Some of the examples of digit $'7'$ are given in Fig. 6b to show the large variations in the dataset. Randomly, 1000 samples were selected such that each digit is well represented in training data and the remaining samples were used as testing data. Recognition accuracy of L2-ONPP, L2-MONPP, L1-ONPP and L1-MONPP with varying number of dimensions are compared in Fig. 8. Average recognition accuracy achieved by L1-ONPP and L1-MONPP is 75.20% and 67.32% nearly at 20 dimensions, whereas that of L2-ONPP and L2-MONPP is 67.16% and 61.28% nearly at 290 dimensions, respectively. As it can be seen, the performance of L1-ONPP and L1-MONPP stabilizes nearly at 30 dimensions, but the performance of L2-ONPP and L2-MONPP deteriorates after 300 dimensions due to the presence of redundant information present at higher dimensions, but the use of all dimensions again leads to almost similar recognition as that of L1-norm counterparts as observed with IRIS data.

Devnagari numerals

The Devnagari handwritten database [2] has approximately 1800 sample of each digit. Randomly selected samples of digit ‘4’ are shown in Fig. 6c. Randomly 900 samples are used for training data, and the remaining are used for testing. As it can be seen from Fig. 9, L1-ONPP and L1-MONPP achieve nearly 50% and 60% recognition accuracy at 30 dimensions, respectively, whereas the performance of L2-ONPP and L2-MONPP is consistently poor compared to L1-norm counterpart.

The purpose of this experiment is not to show recognition performance of L1-ONPP and compare it with other state-of-the-art OCR techniques, but to compare L2-ONPP and L1-ONPP when data are very diverse and have large variability and to show the capacity of L1-ONPP and L1-MONPP of handling such diverse data. With a larger number of training data, L1-ONPP proves to be a better at recognizing digits compared to L2-ONPP. L2-MONPP and L1-MONPP also perform at par with L2-ONPP and L1-ONPP. Here, nearest neighbour (NN) is used for classification, the use of sophisticated classifier like SVM can lead to improved recognition accuracy. The proposed L2-ONPP algorithm converges in about 16 iterations for a single basis vector when nearly 1000 samples of images having size $30\times 30$ are used for training and the procedure of learning L1-ONPP bases took average 3 min. On the other hand, having an closed form solution, L2-ONPP takes on an average 7 s to learn the bases. The configuration of the machine used is as follows: Intel Xeon^® E5-2620 @ 2.40GB, 24 core, 64-bit with 2GB RAM allocation.

5 CONCLUSION

Linear dimensionality reduction techniques such as PCA, LDA, LPP and ONPP solve an optimization problem based on some criteria. Usually, the optimization problem is defined using L2-norm. However, use of L2-norm makes these techniques susceptible to outliers present in the data. The present work is first attempt to compute bases vectors for ONPP using L1-norm. In particular, a relation is established to show that ONPP bases can be obtained by performing PCA on reconstruction error. These phenomenon is established both theoretically and experimentally. An existing algorithm of finding PCA bases using L1-norm optimization is applied to compute the L1-ONPP bases. It has also been proved experimentally that the residual error calculated after discarding few dimensions in projection space and reconstructing data with less number of dimensions is comparatively low in the case of L1-ONPP than that of L2-ONPP. Experiments are performed for synthetic as well as real data, and the same conclusion as mentioned above is observed. Performance of L1-ONPP is compared with L2-ONPP on numeral recognition task, and it is observed that with larger number of training data, L1-ONPP outperforms L2-ONPP with huge margin, but being an iterative method, L1-ONPP is computationally expensive compared to L2-ONPP.

References

Baccini A, Besse P, De Falguerolles A (1996) A L1-norm PCA and a heuristic approach. Ordinal Symb Data Anal 1:359–368
Article MATH Google Scholar
Bhattacharya U, Chaudhuri B (2005) Databases for research on recognition of handwritten characters of indian scripts. In: Proceedings eighth international conference on document analysis and recognition, 2005. IEEE, pp 789–793
Ding C, Zhou D, He X, Zha H (2006) R1-PCA: rotational invariant l 1-norm principal component analysis for robust subspace factorization. In: Proceedings of the 23rd international conference on machine learning, pp 281–288
Dua D, Karra Taniskidou E (2017) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml. Accessed 03 Aug 2018
Goswami MM, Mitra SK (2015) Offline handwritten gujarati numeral recognition using low-level strokes. Int J Appl Pattern Recognit 2(4):353–379
Article Google Scholar
He X, Niyogi P (2004) Locality preserving projections. Adv Neural Inf Process Syst 16:153–160
Google Scholar
He X, Cai D, Yan S, Zhang HJ (2005) Neighborhood preserving embedding. In: Tenth IEEE international conference on computer vision, ICCV 2005, vol 2. IEEE, pp 1208–1213
Ke Q, Kanade T (2005a) Robust L1 norm factorization in the presence of outliers and missing data by alternative convex programming. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1, pp 739–746
Ke Q, Kanade T (2005b) Robust L1 norm factorization in the presence of outliers and missing data by alternative convex programming. In: IEEE computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 1. IEEE, pp 739–746
Kokiopoulou E, Saad Y (2007) Orthogonal neighborhood preserving projections: a projection-based dimensionality reduction technique. IEEE Trans Pattern Anal Mach Intell 29(12):2143–2156
Article Google Scholar
Koringa P, Shikkenawis G, Mitra SK, Parulkar S (2015) Modified orthogonal neighborhood preserving projection for face recognition. In: Kryszkiewicz M, Bandyopadhyay S, Rybinski H, Pal SK (eds) Pattern recognition and machine intelligence. Springer, Berlin, pp 225–235
Chapter Google Scholar
Kwak N (2008) Principal component analysis based on L1-norm maximization. IEEE Trans Pattern Anal Mach Intell 30(9):1672–1680
Article Google Scholar
Lecun Y, Cortes C (2009) The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/. Accessed 03 Aug 2018
Li M, Yuan B (2005) 2D-LDA: a statistical linear discriminant analysis for image matrix. Pattern Recognit Lett 26(5):527–532
Article Google Scholar
Li X, Pang Y, Yuan Y (2010a) L1-norm-based 2DPCA. IEEE Trans Syst Man Cybern Part B (Cybern) 40(4):1170–1175
Article Google Scholar
Li X, Pang Y, Yuan Y (2010b) L1-norm-based 2DPCA. IEEE Trans Syst Man Cybern Part B (Cybern) 40(4):1170–1175
Article Google Scholar
Lu J, Plataniotis KN, Venetsanopoulos AN (2003) Face recognition using LDA-based algorithms. IEEE Trans Neural Netw 14(1):195–200
Article Google Scholar
Nagar R, Mitra SK (2015) Feature extraction based on stroke orientation estimation technique for handwritten numeral. In: 2015 eighth international conference on advances in pattern recognition (ICAPR), pp 1–6
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Article Google Scholar
Shikkenawis G, Mitra SK (2012) Improving the locality preserving projection for dimensionality reduction. In: Third international conference on emerging applications of information technology (EAIT), 2012. IEEE, pp 161–164
Turk M, Pentland A (1991) Eigenfaces for recognition. J Cognit Neurosci 3(1):71–86
Article Google Scholar
Wang H, Lu X, Hu Z, Zheng W (2014) Fisher discriminant analysis with L1-norm. IEEE Trans Cybern 44(6):828–842. https://doi.org/10.1109/TCYB.2013.2273355
Article Google Scholar
Zhang D, Zhou ZH (2005) (2D) 2PCA: two-directional two-dimensional pca for efficient face representation and recognition. Neurocomputing 69(1):224–231
Article Google Scholar
Zhang H, Wu QJ, Chow TW, Zhao M (2012) A two-dimensional neighborhood preserving projection for appearance-based face recognition. Pattern Recognit 45(5):1866–1876
Article MATH Google Scholar
Zhong F, Zhang J (2013) Linear discriminant analysis based on L1-norm maximization. IEEE Trans Image Process 22(8):3018–3027
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, India
Purvi A. Koringa & Suman K. Mitra

Authors

Purvi A. Koringa
View author publications
You can also search for this author in PubMed Google Scholar
Suman K. Mitra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Purvi A. Koringa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koringa, P.A., Mitra, S.K. L1-norm orthogonal neighbourhood preserving projection and its applications. Pattern Anal Applic 22, 1481–1492 (2019). https://doi.org/10.1007/s10044-018-0745-9

Download citation

Received: 11 July 2017
Accepted: 27 July 2018
Published: 04 August 2018
Issue Date: November 2019
DOI: https://doi.org/10.1007/s10044-018-0745-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

L1-norm orthogonal neighbourhood preserving projection and its applications

Abstract

Similar content being viewed by others

L1-Norm and Trace Lasso Based Locality Correlation Projection

Dimensionality reduction based on multi-local linear regression and global subspace projection distance minimum

Laplace Transformation of Eigen Maps of Locally Preserving Projection (LE-LPP) Technique and Time Complexity

1 INTRODUCTION