1 Introduction

In the fields of machine learning and data mining, researchers are confronted with different forms of data, such as video, audio, image and text in very high dimensions which lead to the problem of dimensionality curse [17]. Therefore, it is crucial to avoid this problem through dimensionality reduction (DR) [24] techniques, to improve the performance of subsequent processing such as classification and clustering. DR techniques can be broadly classified into feature selection [9] and subspace learning [33]. Feature selection techniques selects a subset of most representative or discriminative features from the input feature set, and while subspace learning methods transforms the original input features to a lower dimensional subspace.

Principal Component Analysis (PCA) [4], Linear Discriminant Analysis (LDA) [4], Locality Preserving Projection (LPP) [11] and Correlation Analysis (CA) [12] are perhaps the most popular DR methods. Despite the different motivations of these methods, they can all be interpreted by a unified graph embedding framework [33]. One major disadvantage of the above methods is that, the projections are a linear combination of all the original features. Thus, it is often difficult to interpret the results. Sparse subspace learning methods attempted to solve this problem. For example, Zou et al. proposed a sparse PCA algorithm based on L2-norm and L1-norm regularizations [36]. Mohammad et al. [20] proposed both exact and greedy algorithms for binary class sparse LDA as well as its spectral bound. Cai et al. proposed a unified sparse subspace learning (SSL) framework based on L1-norm regularized Spectral Regression [5].

Among these DR algorithms, correlation analysis is a widely used technique for modeling the relationship between two datasets. There exist several variants of correlation analysis (CA) techniques. For instance, Magnus et al. proposed a unified approach to PCA, PLS, MLR, and Canonical Correlation Analysis(CCA) [2]. Discriminant CCA (DCCA) and local discriminant CCA (LDCCA) [29] were presented for fusing multi-feature information. Sun et al. [34] combined CCA with uncorrelated linear discriminant analysis and proposed a multi-view uncorrelated linear discriminant analysis (MULDA). It seeks discriminative correlations in the inter-view and intra-view data points simultaneously by dealing with new linear weighted combination methods for sparse ensembles.

Although all the above methods can attain a good performance on clean datasets, their performances degrade seriously when noisy data points are present in the datasets. This is because existing techniques concentrate on only useful feature selection and therefore, fail to effectively learn the correlation structure of datasets in the presence of noise. This leads to decreasing performances of models in classification and DR. Unfortunately, due to social media upsurge, these noisy or corrupt data points are prevalent these days. To address this problem, a Correlation Analysis Framework via Joint Sample and Feature Selection (CAF-JSFS) is proposed in this paper. In the propose model, in order to discriminate between noisy and relevant data points and suppress the impact of the former in pursuing projections, we introduced sample factors which impose penalties on each data point. To effectively suppress the effect of outliers, two sample selection strategies: cosine similarity and total distance metrics are used geometrically to iteratively learn the relationship between each sample and the principal projections in the feature space. In addition, feature selection is introduced into the proposed sample selection methods to obtain joint sample and feature selection methods to ensure that the proposed framework can classify data more accurately.

The main contributions of this paper are as follows:

  1. 1)

    We propose a novel framework by introducing sample factors into some traditional correlation analysis (CA) models to suppress the impact of outliers in order to obtain better correlation structures.

  2. 2)

    We further propose two sample selection strategies: cosine similarity and total distance metrics. These metrics iteratively evaluate the importance of each sample in pursuing projections by learning the relationship between each sample and the principal projections in the feature space. This is to discriminate between authentic and corrupt data samples.

  3. 3)

    Finally, we introduced structured sparse L2,1-norm to eliminate feature redundancy in the process of sample selection and thus, propose a joint sample and feature selection framework (CAF-JSFS). CAF-JSFS can therefore learn a compact subspace resulting in better correlation structures in noisy datasets. Extensive experiments on many image datasets demonstrate the superiority of our method over state-of-the-art methods such as ALPCCA [31] and SPCA [18].

The rest of this paper is organized as follows. In Section 2, we present related work. Section 3 presents formulation of the proposed CAF-JSFS, experiments and result analyses is presented in Section 4, conclusion and future work are also presented in Section conclusion 5.

2 Related work

2.1 Dimension reduction through feature selection

DR has gained much attention in recent years due to the vital role it plays in machine learning. Many DR methods have therefore been proposed with the same focus on mapping high dimensional data to low dimensional spaces. In other words, given a problem of classification as in [33], with the training sample set X = [x1,x2,⋯ ,xN],xiRm with N samples and m dimensions for each sample. DR methods focus on finding a mapping function that transform the original data xi to a low-dimensional representation yiRd where m >> d.

Compared with subspace learning techniques which create new features, feature selection does not change the original representations of data variables. Consequently, many feature selection techniques have been proposed in the past few years. These feature selection methods are mainly put into two different categories: supervised and unsupervised. Since there is no label information in unsupervised feature selection methods, they are more difficult to implement than their supervised counterparts. Due to this, there are relatively fewer investigations dedicated to unsupervised techniques. Most unsupervised feature selection approaches are either based on filters [22], wrappers [26] or embeddings [8]. Although the performances of traditional unsupervised feature selection approaches are prominent in many cases, their efficiencies can still be improved since: (1) from the view of manifold learning [6], high dimensional data naturally lie on a low dimensional manifold. Traditional methods have not taken full considerations of data manifold structures. (2) Different from feature learning, traditional feature selection approaches only employ data statistical character to rank the features essentially. There is a lack of learning mechanism as in feature learning, which is proved to be powerful and widely used in many areas [23].

2.2 Correlation analysis

Correlation analysis is a well-known family of statistical tools for analyzing associations between variables or sets of variables. Principal Component Analysis (PCA), Canonical Correlation Analysis (CCA), Partial Least Squares (PLS) and Multiple Linear Regression (MLR) are four efficient correlation analysis methods. Based on the least squares framework, we present four different correlation analysis objective functions of these four methods in Table 1.

Table 1 Four models of correlation analysis

Among these methods, PCA is the most popular correlation analysis technique. It can assist in understanding underlying data structures, clustering analysis, regression analysis, and many other tasks. Hu et al. [14] presented methodological, theoretical, and numerical studies on PCA in high-dimensional settings. In many practical studies, it is found that only a small subset of variables are relevant, while others are noise. To identify relevant variables and generate more interpretable results, a sparse PCA (SPCA) [18] technique that applies regularized estimation to generate sparse loadings has been developed. New PCA algorithms for graph embedding that incorporate data distribution and multiple penalty factors into the least squares framework regularized with multiple local graphs for multiview dimension reduction were proposed [3, 27]. Nie et al. [21] proposed to maximize the L21-norm based robust PCA objective, which is theoretically connected to the minimization of reconstruction error. More importantly, we propose the efficient non-greedy optimization algorithms to solve our objective and the more general L2,1-norm maximization problem with theoretically guaranteed convergence.

Proposed by H. Hotelling in 1936 [10], CCA finds basis vectors for two sets of variables such that the correlation between the projections of the variables onto these basis vectors are mutually maximized. In an attempt to increase the flexibility of feature selection, kernelization of CCA (KCCA) has been applied to map the hypotheses to a higher-dimensional feature space. KCCA has been applied in some preliminary work by Fyfe and Lai [15], Akaho [1] and recently by Vinokourov et al. [32] with improved results. Ping [25] proposed the label-wise orthogonal canonical correlation analysis (LOCCA), which constrains the label-based relationships and orthogonalizes correlation projection directions. In the method, the discriminative structures constrained by class labels are effectively preserved, and the correlation projection directions from LOCCA reduce the information redundancy by orthogonality criterion as much as possible. Chen [7] introduces four deep neural network (DNN) models that are suitable to combine with CCA, and the general form of DNN-CCA is given in detail. Then, the experimental comparison of these methods is conducted through three cases, so as to analyze the characteristics and distinctions of CCA aided by each DNN model. Finally, some suggestions on method selection are summarized, and the existed open issues in the current DNN-CCA form and future directions are discussed.

PLS is a multivariate technique that delivers an optimal basis in x-space for y on x regression. Reduction to a certain subset of the basis introduces a bias but reduces the variance. In general, PLS is based on a maximization of the covariance between < v,x > and < w,y >, which are successive linear combinations in x and y spaces respectively. L. Hoegaerts et al. proposed kernel Partial Least Squares (KPLS), which fits naturally in a primal dual optimization class of kernel machines [30]. To model the function of non-linear relationships among videos in NDVR, KPLS maps the original video data into a Reproducing Kernel Hilbert Space (RKHS), and therefore it is able to efficiently handle high-dimensional videos in NFVs.

Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple linear regression (MLR) is to model the linear relationship between the explanatory (independent) variables and response (dependent) variables.

3 Proposed method

3.1 Sample selection induced correlation analysis

In this section, we present a framework of the proposed correlation analysis based on sample selection (CAF-JSS). From Table 1, the four correlation analysis(CA) algorithms can be summarized as the using the objective function as shown in (1), and PCA is a special case:

$$ \arg\min\sum\limits_{i,j}\left \| v^{T}x_{i}-w^{T}y_{j} \right \| $$
(1)

It can be seen that the CA algorithms use least square framework to minimize the sum distance between the original data set X and the reconstructed data set vTX, the original data set Y and the reconstructed data set wTY.

This geometrical characteristic will force the projection vectors v and w to pass through the densest data points to minimize the sum distance, which is illustrated in Fig. 1, where v and w are the principal projection vectors. We consider the relationship between the projection vectors and the data samples. This geometrical relationship between data samples and projection vectors motivates us to evaluate the importance of each data sample in pursuing projections. Therefore, we reformulate formula (1) by introducing sample factors which impose penalties on the sample spaces to minimize the impact of corrupt data samples in (2) as follows:

$$ \arg \min \sum\limits_{i,j} {\left\| {{v^{T}}{x_{i}}{d_{{x_{i}}}} - {w^{T}}{y_{j}}{d_{{y_{j}}}}} \right\|}_{2}^{2} $$
(2)
Fig. 1
figure 1

Illustration of importance evaluation of data samples

where \({d_{{x_{1}}}}\) and \({d_{{y_{1}}}}\) are sample factors that consider the contributions of data samples to projections.

Similarly, by introducing sample factors into the four traditional CA models presented in Table 1, we obtain our four proposed D-CA models as presented in Table 2. In our new model, a new data representation is presented as \( \hat {X}=XD_{X} \) and \( \hat {Y}=YD_{Y}\), therefore, \( \mathop X\limits ^ \wedge \) and \( \mathop Y\limits ^ \wedge \) are now obtained with the effect of corrupt data samples suppressed. DX and DY are diagonal sample factor matrices where \( D_{x}=diag\left (d_{x_{1}},{\cdots } ,d_{x_{n}} \right )\) and \( D_{y}=diag\left (d_{y_{1}},{\cdots } ,d_{y_{n}} \right )\). By introducing the Lagrange multiplier (λ) into the DCA models and taking partial derivatives w.r.t. v and w, we obtain the eigenvector solutions presented in Table 3.

Table 2 The Four proposed canonical analysis methods
Table 3 Standard eigenvector solutions of the proposed D-CA models

To demonstrate the effectiveness of our proposed models, we give a mathematical singular value decomposition (SVD) explanation to the CCA model. Mathematically, there is a direct relationship between CCA and SVD when CCA components are calculated from the covariance matrix [19]. The following demonstrate the singular value decomposition (SVD) of X and Y. However, we first introduce some notations, let Cxx = XTX, Cyy = YTY and Cxy = XTY. For simplicity, assume Cxx and Cyy are full rank, and also let

$$ {\tilde C_{xy}} = C_{xx}^{- \frac{1}{2}}{C_{xy}}C_{yy}^{- \frac{1}{2}} $$
(3)

Let \(\tilde C_{xy} =V {\Sigma } W^{T}\) be the SVD of Cxy where vi, wj denote the left, right singular vectors and τi denotes the singular values. Then \( XC_{xx}^{- \frac {1}{2}}{v_{i}}\), \( YC_{yy}^{- \frac {1}{2}}{w_{j}}\) are the canonical variables of the X, Y spaces respectively. In our proposed model, as \( \mathop X\limits ^ \wedge = X{D_{X}}\) and \( \mathop Y\limits ^ \wedge = Y{D_{Y}} \), therefore \(X{D_{X}}C_{XX}^{- \frac {1}{2}}{v_{i}}\), \( Y{D_{Y}}C_{yy}^{- \frac {1}{2}}{w_{j}}\) are the canonical variables respectively. In this way, with the imposition of the sample factors DX and DY on the sample spaces, our proposed model can learn a better low dimensional subspace in corrupt data sets.

3.2 Obtaining sample factors d x and d y

In this subsection, we discuss how to model the relationship between data samples and the principal projections. Intuitively, the closer a sample to the projection vector v or w, the more important the sample is for calculating the projections. Based on this intuitive observation, we iteratively learn the relationship between data samples and the principal projections v and w using two strategies: total distance and cosine similarity metrics. This is to effectively distinguish between authentic and corrupt data samples based on how a data sample and the principal projection relate. Both can be obtained geometrically as shown in Fig. 1.

The first strategy uses total distance metric to iteratively learn the relationship between each sample and the principal projection. The total distance of an instance is the square sum of the distances between the coordinate of each instance and the coordinates of every other instance in the training set to the projections v or w. From Fig. 1, the coordinate (si) of data sample xi to the projection v and the coordinate (ti) of data sample yi to the projection w are obtained based on (4) respectively as follows:

$$ {s_{i}} = {v^{T}}{x_{i}}, {t_{i}} = {w^{T}}{y_{i}} $$
(4)

We then compute the total distance of data samples as follows:

$$ {d_{{x_{i}}}} = {\sum}_{i,j = 1}^{n} {{{({s_{i}} - {s_{j}})}^{2}}} , {d_{{y_{j}}}} = {\sum}_{i,j = 1}^{n} {{{({t_{j}} - {t_{i}})}^{2}}} $$
(5)

The total distance of a data sample is a natural way to evaluate its importance within the dataset in constructing projections. From Fig. 1, we can observe that the total distance of sample xi or yi which are outside the clusters will be relatively bigger than that of samples xj and yj within the clusters. Therefore, samples xi and yi are more likely to be outliers than samples xj and yj. Thus, the bigger \({d_{{x_{i}}}}\) or \({d_{{y_{i}}}}\), the more likely xi and yi are noisy data samples and hence their relevance will be scaled accordingly to suppress their effects on the projections.

The second strategy uses the cosine similarity metric to build the sample factors \({d_{{x_{i}}}}\) and \({d_{{y_{i}}}}\). This iteratively learns the angle relationship between each data sample in the training set and the principal projections v and w. In Fig. 1, the angle between data sample xi and the projection v is αi, and the angle between data sample yi and the projection w is εi. The construction of sample penalty factor proposed can be obtained by normalizing (6):

$$ d_{x_{i}}=cos {\alpha_{i}} = \frac{{{v^{T}}{x_{i}}}}{{\left\| v \right\|\left\| {{x_{i}}} \right\|}},d_{y_{j}}=\cos {\theta_{j}} = \frac{{{w^{T}}{y_{j}}}}{{\left\| w \right\|\left\| {{y_{j}}} \right\|}} $$
(6)

In formula (6), a bigger \({\cos \limits } {\alpha _{i}}\) implies a smaller angle α between sample xi and the principal projection v and vice versa. Similarly, a bigger \( {\cos \limits } {\varepsilon _{i}}\) implies a smaller angle ε between sample yi and the principal projection w. From Fig. 1, it can be seen that, the angle β between sample xj and the principal projection v is smaller than the angle α of sample xi and the principal projection v. Thus, xi is considered less important in finding the best projections than xj; likewise, we consider that yi is less important than yj.

Futhermore, \( {d_{{x_{i}}}} \) and \( {d_{{y_{i}}}} \) can now be obtained as follows :

$$ d_{x_{i}}=cos {\alpha_{i}}+\eta ,d_{y_{j}}=\cos {\theta_{j}}+\eta $$
(7)

where η is a adjust parameter to prevent \( {d_{{x_{i}}}} \) and \( {d_{{y_{i}}}} \) from approaching infinity.

3.3 CAF-JSFS

We further extend the proposed four models to feature selection in this subsection. After adding feature selection [13], the proposed D-CA model in (2) can now be written as:

$$ \arg \min {\sum}_{i,j} {\left\| {{v^{T}}{x_{i}}{d_{{x_{i}}}} - {w^{T}}{y_{j}}{d_{{y_{j}}}}} \right\|}_{2}^{2}+\lambda_{1}\left \| v \right \|_{2,1}+\lambda_{2}\left \| w \right \|_{2,1} $$
(8)
Table 4 Standard eigenvector solutions of the four proposed DQ-CA models

Taking the D-CCA model as an example, the derivation of our proposed feature selection DQ-CCA model is as follows:

$$ \begin{array}{@{}rcl@{}} L &=& 2{w^{T}}Y{D_{Y}}{D_{X}}{X^{T}}v - {\lambda_{1}}{v^{T}}{Q_{x}}v - {\lambda_{2}}{w^{T}}{Q_{y}}w \\ && - {\lambda_{3}}{v^{T}}X{D_{X}^{2}}{X^{T}}v - {\lambda_{4}}{w^{T}}Y{D_{Y}^{2}}{Y^{T}}w \end{array} $$
(9)

where λ1,λ2,λ3 and λ4 are regularization parameters, and QxRd×d is a diagonal matrix with (i,i) th element \( Q_{ii}^{x} = (\gamma /2{(\left \| {{v_{i}}} \right \|_{2}^{2} + \varepsilon )^{1/2}})(\varepsilon \to 0)\) and \(v = \left [ \begin {array}{l}{v_{\vdots }}\\{v_{d}} \end {array}\right ] \in {R^{d \times m}}\), so is Qy. And \({\left \| v \right \|_{2,1}}\), \({\left \| w \right \|_{2,1}}\) are based on 2-norm and 1 -norm regularizations [35].

By incorporating the feature selection into the D-CA models and taking partial derivatives w.r.t. v and w, according to Table 3 , we obtain the following eigenvector solutions of the proposed four DQ-CA models, which are presented in Table 4. We further extend the proposed four models to feature selection in this subsection. After adding feature selection [13], the proposed D-CA model in (2) can now be written as:

$$ \arg \min {\sum}_{i,j} {\left\| {{v^{T}}{x_{i}}{d_{{x_{i}}}} - {w^{T}}{y_{j}}{d_{{y_{j}}}}} \right\|}_{2}^{2}+\lambda_{1}\left \| v \right \|_{2,1}+\lambda_{2}\left \| w \right \|_{2,1} $$
(10)

Taking the D-CCA model as an example, the derivation of our proposed feature selection DQ-CCA model is as follows:

$$ \begin{array}{@{}rcl@{}} L &= & 2{w^{T}}Y{D_{Y}}{D_{X}}{X^{T}}v - {\lambda_{1}}{v^{T}}{Q_{x}}v - {\lambda_{2}}{w^{T}}{Q_{y}}w \\ && - {\lambda_{3}}{v^{T}}X{D_{X}^{2}}{X^{T}}v - {\lambda_{4}}{w^{T}}Y{D_{Y}^{2}}{Y^{T}}w \end{array} $$
(11)

where λ1,λ2,λ3 and λ4 are regularization parameters, and QxRd×d is a diagonal matrix with (i,i) th element \( Q_{ii}^{x} = (\gamma /2{(\left \| {{v_{i}}} \right \|_{2}^{2} + \varepsilon )^{1/2}})(\varepsilon \to 0)\) and \(v = \left [ \begin {array}{l}{v_{\vdots }}\\{v_{d}} \end {array}\right ] \in {R^{d \times m}}\), so is Qy. And \({\left \| v \right \|_{2,1}}\), \({\left \| w \right \|_{2,1}}\) are based on 2-norm and 1 -norm regularizations [35].

By incorporating the feature selection into the D-CA models and taking partial derivatives w.r.t. v and w, according to Table 3 , we obtain the following eigenvector solutions of the proposed four DQ-CA models, which are presented in Table 4.

The algorithm for the proposed CAF-JSFS is shown in Algorithm 1 as follows:

Algorithm 1
figure a

CAF-JSFS.

4 Experimental results

In this section, we first evaluate the performance of the proposed correlation analysis and feature selection methods against classical methods such as CCA, PCA, PLS and MLR. We further evaluate the performance of the proposed methods against state-of-the-art methods ALPCCA and SPCA.

4.1 Parameter settings and datasets description

For each dataset, we randomly sampled 60% and 40% for training and testing respectively in our experiments. We set the k-nearest-neighbors parameter K to 5 in the proposed D-CA and DQ-CA methods and all other comparative methods, in order to make a very fair comparison. Also, the parameters of ALPCCA, SPCA and LDA were set according to their literature. We finally make use of the K-nearest neighbor (KNN) classifier for classifications. The experiments are repeated 20 times and we record the average classification accuracies and standard deviations for the various methods. In our experiments, in order to fairly compare the performance among the KNN-graph models, graphs were constructed with the same neighbors N in CAF-JSFS correspondings to (10)

$$ \underset{v}{\min} {\sum\limits_{i = 1}^{n} {\left( {v^{T}}{x_{i}} - \sum\limits_{j = 1}^{n} {{N_{ij}}{v^{T}}{x_{j}}} \right)}^{2}} $$
(12)

As unsupervised constructions, v in CAF-JSFS corresponds to Table 4. In order to evaluate the performance of CAF-JSFS, we carry out a series of experiments on face and handwritten datasets.

4.2 Experiments on image datasets

Aside from the seven UCI datasets, we use several image datasets to test the proposed method’s performance in this subsection. The image datasets include:

  • ORLFootnote 1 face dataset contains 400 face image samples taken from 40 subjects, each with 10 face images. The face images per subject were taken by varying the lighting, facial expressions, and facial details at different times [28].

  • ARFootnote 2 face database was created by Aleix Martinez and Robert Benavente in the Computer Vision Center (CVC) at the U.A.B. It contains over 4,000 color images corresponding to 126 people’s faces (70 men and 56 women).

  • Extended YaleBFootnote 3 dataset contains 165 face images from 15 subjects, each of which has 11 face images. The face images were taken by varying lighting conditions and facial expressions. In our experiment, each image is cropped and resized to 32 × 32, and the gray level values of each image are rescaled to \(\left (0,1 \right )\). That is, the dimensionality of each image sample is 1024 [28].

  • CMU-PIEFootnote 4 face database contains more than 750,000 images of 337 people recorded in up to four sessions over the span of five months. Subjects were imaged under 15 view points and 19 illumination conditions while displaying a range of facial expressions. In addition, high resolution frontal images were acquired as well. In total, the database contains more than 305 GB of face data. The Content page describes the database in more detail.

  • USPSFootnote 5 dataset contains a total of 9298 digit images of 0 through 9, each of which is of size 16 × 16 pixels, with 256 gray levels per pixel. In the experiment, each image is represented by a 256-dimensional vector [16].

  • MNISTFootnote 6 dataset is constructed from the larger NISTs Special Database 3 and 1, which consist of binary images of handwritten digits. The images of each class (digit) are of size 28 × 28. Thus, each digit image is represented by a 784-dimensional vector [16].

The further detailed descriptions of the ORL, Extended YaleB, MINIST, USPS, COIL20, and CIFAR-10 datasets are presented in Table 5

Table 5 Summary of the six datasets

4.3 Experimental analysis in no-noise scene

In this section we analyse and discuss the results obtained by each method on the different datasets used in our experiments.

4.3.1 Face recognition

In this section, we first demonstrate our proposed DQ-CCA, DQ-PCA, DQ-PLS and DQ-MLR have superior performances than the traditional CCA, PCA, PLS and MLR on face recognition. We further undertake experimental comparison with two state-of-the-art algorithms, ALPCCA and SPCA. We present results for each method on the ORL, AR, Extended YaleB and CMU-PIE datasets as shown in Table 6 with best results in bold in each case.

Table 6 Mean Classification Accuracies ± standard deviations (%) of the various methods on the ORL, AR and extended YaleB data sets

From Table 6, we can see that DQ-CCA, DQ-PCA, DQ-PLS and DQ-MLR all have superior performances than all the comparative methods on the ORL, AR, Extended YaleB and COMU-PIE face datasets. For the ORL face dataset, with an impressive recognition accuracy of 95.01%, DQ-CCA outperforms the traditional CCA by a significant margin of 3.01% and D-CCA by a small margin of 0.82%. The results show our proposed D-CCA and DQ-CCA both have improved results than the traditional CCA because they are able to significantly suppress the effect of corrupt data samples better than the traditional CCA. Also, the recognition accuracy of DQ-PCA is 87.70% in excess of 1.93% to the traditional PCA and 0.08% to D-PCA. Again, the proposed DQ-PCA and D-PCA show significant improvement in face recognition as compared to the traditional PCA due to their abilities to distinguish between authentic and corrupt data samples. DQ-PLS also performs 2.56% more than the traditional PLS and 0.97% more than D-PLS. DQ-MLR also has a superior performance of 6.23% over the traditional MLR and 0.35% over D-MLR.

Also, for the AR dataset, DQ-CCA outperforms D-CCA by 0.70% and the traditional CCA by a significant margin of 8.54%. DQ-PLS also has a superior performance of 1.00% over D-PLS and 2.48% more than the traditional PLS. DQ-PCA also proves to be 1.03% better than D-PCA and 3.01% better than the traditional PCA for the Extended YaleB dataset. Still on the Extended YaleB data set, DQ-MLR also outperforms D-MLR by 1.22% and the traditional MLR by 3.20%. For CMU-PIE dataset, the accuracies of ALPCCA and SPCA are not as good as all the CA algorithms proposed in this paper.

4.3.2 Handwriting recognition

To further evaluate the effectiveness of the proposed methods on handwritten digits recognition, we run experiments on the USPS and MNIST datasets. The results for the various methods for these datasets are presented in Table 7.

Table 7 Mean Classification Accuracies ± standard deviations (%) of the various methods on USPS and MNIST data sets

It is apparent from Table 7 that, the proposed methods once again demonstrate their superiority over the traditional techniques in handwritten digit recognition. For the USPS dataset, with a digit recognition accuracy of 72.56%, DQ-CCA outperforms D-CCA by 1.54% and the traditional CCA by 3.56%. DQ-PCA also has superior digit recognition of 1.69% over D-PCA and 3.58% over the standard PCA. DQ-PLS also outperforms D-PLS and the traditional PLS by 0.96% and 3.00%, respectively. The proposed methods have superior performances than the traditional methods because they are able to discover the intrinsic data structure and also suppress the impact of corrupt data samples. On the MNIST dataset, our method is superior to the traditional CA methods, ALPCCA and SPCA.

4.4 Experimental analysis in noisy scenarios

In this section, we add different degrees of salt and pepper noise to the ORL and USPS datasets to verify whether the performances of the proposed algorithms are better than the traditional CA methods in noisy scenarios so as to prove their superior noise suppression abilities. The noise added to the experimental datasets in this section is divided into three levels: 0%, 5%, and 15%. Best results are bolded in each case.

4.4.1 Face image denoising analysis

The average classification accuracies of the various methods on a noisy ORL dataset are recorded in Table 8.

Table 8 Mean Classification Accuracies (%) of the various methods on noisy ORL dataset

It can be seen from Table 8 that, the proposed correlation methods have higher classification accuracies than the traditional methods in the absence of noise. When the noise level is 5%, the classification accuracy of DQ-PCA is 1 to 2% higher than DQ-CCA, DQ-PLS, and DQ-MLR. With an increase in the noise level, the proposed CAF-JSFS framework shows good robustness in classification performance. Thus, when the noise level is 15%, DQ-PCA has the best classification accuracy among all the correlation analysis methods. In addition, it can be seen that, with increasing noise levels, CCA model has the worst performance. Unlike the other methods, the accuracies of the proposed methods decrease at a slower rate with increasing noise levels. Generally, the proposed methods show significant improvements in face recognition as compared to the traditional techniques due to the abilities of the proposed methods to distinguish between authentic and corrupt data samples. The proposed methods also prove their consistency in performances due to the lower variances they obtain in all cases as compared to the traditional techniques. Figure 2 shows the experimental box diagrams.

Fig. 2
figure 2

Box diagram of four correlation analysis models in noisy ORL dataset (80% training)

4.4.2 Experiments on noisy USPS dataset

Table 9 shows the classification results on a noisy USPS dataset. Similar to the results in Table 8, the proposed CAF-JSFS framework has the best facial recognition accuracy among all methods.

Table 9 Mean Classification Accuracies (%) of the various methods on noisy USPS dataset

Also, from Fig. 3, it is evident that, the proposed methods have been stable in performance since they obtain lower variances in digit recognition.

Fig. 3
figure 3

Box diagram of four correlation analysis models in noisy USPS dataset (80% training)

4.5 Dimensionality analysis

In this subsection, we test our proposed method on object recognition using Columbia Object Image (COIL-20) dataset. We use 70% training and 30% testing samples in this section. To intuitively see whether the proposed method is effective in expressing data features in low dimensional spaces, we draw graphs showing classification accuracies in varying dimensions in Fig. 4. It can be seen that, the proposed CAF-JSFS framework has better classification accuracy and stability of low dimensional subspace representation data characteristics. From Fig. 4, it can also be observed that, the DQ-CA models have leading performances and their classification performances seem to stabilize in lower-dimension.

Fig. 4
figure 4

Classification results of the four CA models under the COIL20 dataset (70% training)

5 Conclusion

In this paper, a correlation analysis framework via joint sample and feature selection (CAF-JSFS) is proposed. Different from other variants of correlation analysis, sample factors that impose penalties on the sample spaces are introduced to suppress the impact of noise in pursuing projections. Two strategies, cosine similarity and total distance metrics are used geometrically to iteratively learn the relationships between each sample and the principal projections. This enables our framework to discriminate between authentic and corrupt data samples in order to suppress the impact of the latter. We further combined our sample selection idea with feature selection to obtain a joint sample and feature selection methods. With these ideas combined in our models, better correlation projections are achievable by sample and feature selection jointly. Our CAF-JSFS can learn better correlation projections in a noisy scenario, with the effect of noisy data points being suppressed. Extensive experiments on ORL, AR, extended YaleB and USPS datasets demonstrate CAF-JSFS achieves superior classification performance over state-of-the-art correlation analysis methods. This is because the proposed methods are able to effectively distinguish between authentic and corrupt data samples, thereby minimizing the impact of the latter. In the future, we will extend our proposed framework to low-rank representation and graph embedding.