Keywords

1 Introduction

With the rise of artificial intelligence and big data technology, the research and application of face recognition technology has been promoted. In recent years, domestic and foreign scholars have proposed some feasible algorithms for face recognition technology. Such as eigenface method, neural network, wavelet transform, support vector machine, hidden Markov and other methods [1]. Most algorithms currently require hardware and long training time. Generally, the face recognition method commonly uses wavelet transform for feature extraction and dimensionality reduction or PCA for feature dimensionality reduction and for performing classification operations using a neural network. In this paper, we use wavelet transform and PCA for image feature extraction and dimensionality reduction. Wavelet transform is a major breakthrough after Fourier analysis in recent years. The multi-resolution analysis of wavelet has good localization characteristics of space and frequency, and the detailed analysis of the object can be realized by adopting gradually detailed step size for the high-frequency part of the signal. Therefore, wavelet transform is especially suitable for the processing of non-stationary signals such as images [2]. PCA is a commonly used method for feature dimension reduction. The basic idea is to extract the main information of the data, and then discard the redundant information to achieve the purpose of compression. SVM is a common classification algorithm in machine learning. For multi-classification problems, SVM maps data from low-dimensional to high-dimensional space by introducing kernel functions, so as to achieve linear divisibility and achieve multi-classification goals. In this paper, we use the international standard universal face recognition library which is ORL face recognition data set. The ORL face database consists of a series of face images, with 40 objects of different ages, genders and races. Each person’s 10 images consists of a total of 400 grayscale images, the image size is 92 × 112, and the image background is black, as shown in Fig. 1. In this paper, we introduce the feature extraction methods using PCA and wavelet transform, respectively, and use SVM for classification and recognition. The structure of this paper is as follows. Section 2 introduces the feature extraction method wavelet transform and PCA. In Sect. 3, we introduced the classification algorithm SVM. In Sect. 4, we detail the experimental process and experimental results. Finally, In Sect. 5, we draw conclusions.

Fig. 1.
figure 1

ORL face recognition data set

2 Feature Extraction Method

2.1 Wavelet Transform

Wavelet transform is applied to face recognition, mainly because the resolution of the subgraph is reduced in different directions after the wavelet image is decomposed by wavelet, and the computational complexity is reduced accordingly [3]. At the same time it provides good local information in both the airspace and the frequency domain. The low frequency portion of the face image information describes the overall shape of the image. The high frequency part describes the details of the image. The face low frequency information obtained by wavelet transform can better describe the face features useful for classification [4].

In our study, we used a multi-layer (4-layer) SWT (stationary wavelet transform) wavelet decomposition on the image set. SWT consists of two opposite processes of decomposition and reconstruction, where the decomposition is the time series data \( {\text{y }} \)(recorded as \( a^{0} \)) through J iterations. Transform into a set of wavelet coefficients distributed in J + 1 wavelet metrics

$$ c_{y} = \left[ {a^{J} ,b^{J} ,b^{J - 1} , \cdots ,b^{1} } \right] $$
(1)
$$ \left\{ {\begin{array}{*{20}l} {a^{j} = H^{{\left[ {j - 1} \right]}} a^{{\left[ {j - 1} \right]}} ,b^{j} = G^{{\left[ {j - 1} \right]}} a^{{\left[ {j - 1} \right]}} } \hfill \\ {H^{\left[ j \right]} = U_{o} H^{{\left[ {j - 1} \right]}} ,G^{\left[ j \right]} = U_{o} G^{{\left[ {j - 1} \right]}} } \hfill \\ \end{array} } \right. $$
(2)

In the above formula, decomposition series \( {\text{j}} = 1,2, \cdots ,{\text{J}} \); \( H^{\left[ 0 \right]} \) and \( G^{\left[ 0 \right]} \) are wavelet low pass, high pass decomposition filters. \( U_{o} \) indicates that zeros are inserted after each coefficient of the filter to double the filter length. \( a^{j} \) and \( b^{j} \) are called the j level low frequency scale and the high frequency scale, respectively. Through wavelet decomposition, different frequency band components in y are separated into different scales in \( c_{y} \). The above formula is therefore called multi-scale analysis.

Unlike DWT, each wavelet scale of SWT is the same length as y, so SWT is a redundant transform [5]. \( H^{{\left[ {j - 1} \right]}} \), \( G^{{\left[ {j - 1} \right]}} \) decomposing \( a^{j - 1} \) into \( \left( {a^{j} ,b^{j} } \right) \) is non-orthogonal. Its inverse transformation is not unique. But if defined \( D_{0 } \) and \( D_{1 } \) perform a downsampling operator for the second choice of keeping the even and odd terms. Then transforms \( \left( {D_{o} H^{{\left[ {j - 1} \right]}} ,D_{o} G^{{\left[ {j - 1} \right]}} } \right) \) and \( \left( {D_{1} H^{{\left[ {j - 1} \right]}} ,D_{1} G^{{\left[ {j - 1} \right]}} } \right) \) are orthogonal transforms, and gives the even and odd terms of \( a^{j} \) and \( b^{j } \) respectively. Therefore, if we remember that the inverse transformation of the two is \( R_{0}^{{\left[ {j - 1} \right]}} \) and \( R_{1}^{{\left[ {j - 1} \right]}} \). Then refactored SWT is

$$ a^{j - 1} = \frac{1}{2}\left( {R_{0}^{{\left[ {j - 1} \right]}} + R_{1}^{{\left[ {j - 1} \right]}} } \right)\left( {a^{j} ,b^{j} } \right) $$
(3)

If \( {\text{j}} = 1,2, \cdots ,{\text{J}} \), then the sequence data y can be obtained from \( c_{y} \). If the coefficient of the partial scale in \( c_{y} \) is kept unchanged, and the coefficient of the remaining scale. The discarding scale is set to zero, it is a partial scale reconstruction.

Because it is a redundant transform, SWT retains more information than DWT and has translation invariance, which is more conducive to time series analysis [6]. However, if the length of y is \( 2^{J} \), the complexity of SWT decomposition \( J \) is \( {\text{O}}\left( {J2^{J} } \right) \), which is larger than \( {\text{O}}\left( {2^{J} } \right) \) of DWT.

For simple calculation, use W to represent J-level SWT decomposition, which is \( c_{y} = W_{y} \). Use \( W^{ - } \) to indicate the corresponding refactoring, which is \( {\text{y}} = W^{ - } c_{y} \). The use of \( S_{0} \) for partial scale reconstruction adds zero to the coefficient of the discarding scale.

Replace DWT with SWT and use wavelet sym2 to study wavelet extraction [7]. Sym2 is an approximate symmetric wavelet function with better symmetry, which can reduce the phase distortion when analyzing and reconstructing signals to a certain extent.

2.2 Principal Component Analysis

Principal Component Analysis (PCA) is the most commonly used linear mapping method in pattern recognition analysis [8]. It is based on the position distribution of sample points in multi-mode space, and the maximum direction of sample points in space. The direction with the largest variance is used as the discriminant vector to achieve feature extraction [9].

Face frequency map as raw data of PCA [10]. Let a known low-frequency sub-graph form a column vector whose size \( {\text{D}} = {\text{M}} \times {\text{N}} \) is a dimension. Let n be the number of training samples, and \( X_{i} \) denote the face vector formed by the i-th face low-frequency sub-graph, then the covariance matrix of the required samples is

$$ S_{r} = \mathop \sum \nolimits_{i = 1}^{n} \left( {X_{i} - \mu } \right)\left( {X_{i} - \mu } \right)^{T} $$
(4)

Where u is the average image vector of the training sample

$$ \mu = \frac{1}{n}\mathop \sum \nolimits_{i = 1}^{n} x_{i} $$
(5)

Let \( {\text{A = }}\left[ {x_{1} - \mu ,X_{2} - \mu , \cdots X_{n} - \mu } \right] \), there is \( S_{r} = AA^{T} \), and its dimension is \( {\text{D}} \times {\text{D}} \). According to the K-L transform principle [11], the new coordinate system to be required is composed of the feature vectors corresponding to the non-zero eigenvalues of the matrix \( AA^{T} \).

The calculation of direct calculation is relatively large, so the eigenvalues and eigenvectors of \( AA^{T} \) are obtained by solving the eigenvalues and eigenvectors of \( A^{T} A \) using the SVD (singular value decomposition) theorem.

According to the SVD theorem, let \( \lambda_{i} = \left( {i = 1,2, \cdots ,r} \right) \) be the r non-zero eigenvalues of matrix \( A^{T} A \). \( \upsilon_{i } \) is the eigenvector of \( A^{T} A \) corresponding to \( \lambda_{i} \), then the orthogonal normalized eigenvector \( \mu_{i} \) of \( AA^{T} \) is

$$ \mu_{i} = \frac{1}{{\sqrt {\lambda_{i} } }} $$
(6)

Then the feature face subspace is \( \upomega = \left( {\mu_{1} ,\mu_{2} , \cdots ,\mu_{r} } \right) \). The training sample is projected into the feature face subspace, and a set of projection vectors \( {\text{W}} = w^{T} \mu \) is obtained to form a database for face recognition [12]. When identifying, each image of the face to be recognized is first projected into the feature face subspace, and is identified as input data of the SVM classifier.

3 Classification and Recognition Based on SVM

Support Vector Machine is a learning method based on statistical learning theory developed [13] in the 1990s. The generalization ability of machine learning is improved by seeking the minimum of structured sorting, and the empirical risk and confidence range are minimized. In the case of a small amount of statistical samples, good statistical laws can also be obtained.

In the training data, each data has n attributes and a second class of category markers, we can think of these data in an n-dimensional space. Our goal is to find an n-1 dimensional hyperplane. This hyperplane divides the data into two parts, each of which belongs to the same category. In fact, there are many such hyperplanes, we have to find the best one [14]. Therefore, a constraint has been added, the distance from this hyperplane to the nearest data point of each class is the largest.

A linearly separable data set \( \left\{ {\left( {\overrightarrow {{x_{1} }} ,y_{1} } \right),\left( {\overrightarrow {{x_{2} }} ,y_{2} } \right), \cdots ,\left( {\overrightarrow {{x_{N} }} ,y_{N} } \right)} \right\} \), ample feature vector \( \vec{x} \in D^{T} \), that is, \( \vec{x} \) is a vector in the D-dimensional real space. Class label \( y \in \left\{ { - 1, + 1} \right\} \), that is, there are only two types of samples. Usually, a sample with a class label of \( + 1 \) is a positive example, and a sample with a class label of \( - 1 \) is a counterexample. Now classify these two types of samples [15]. The goal is to find the optimal segmentation hyperplane, i.e. the segmentation hyperplane of the largest classification interval determined from the training samples. We set the equation for the optimal hyperplane as \( \mathop \to \limits_{w} \,^{T} \mathop \to \limits_{x} + b = 0 \). According to the point-to-plane distance formula, the distance between sample \( \mathop \to \limits_{x} \) and the best hyperplane \( \left( {\mathop \to \limits_{w} ,b} \right) \) is \( \frac{{\mathop \to \limits_{w} \,^{T} \mathop \to \limits_{x} + b}}{{\left\| {\mathop \to \limits_{w} } \right\|}} \). By scaling the vector \( \mathop \to \limits_{w} \) and the deviation term \( b \) proportionally, there are many solutions to the optimal hyperplane. The hyperplane is normalized to select the sample \( \vec{x}_{k} \) that is closest to the hyperplane to satisfy \( \mathop \to \limits_{w} \) and \( b \) of \( \left| {\mathop \to \limits_{w} \,^{T} \mathop \to \limits_{{x_{k} }} + b = 0} \right| = 1 \). We can get normalized sample hyperplane. The distance from the nearest sample to the edge is

$$ \frac{{\mathop \to \limits_{w} \,^{T} \mathop \to \limits_{x} + {\text{b}}}}{{\left\| {\mathop \to \limits_{w} } \right\|}} = \frac{1}{{\left\| {\mathop \to \limits_{w} } \right\|}} $$
(7)

And the classification interval becomes

$$ {\text{m}} = \frac{2}{{\left\| {\mathop \to \limits_{w} } \right\|}} $$
(8)

In fact, it can be seen from the above formula that the most critical point for finding the support vector is to find the upper method to maximize the normal vector \( \mathop \to \limits_{w} \). Put \( \mathop \to \limits_{w} \) with a relationship \( \mathop \to \limits_{w} \,^{T} \mathop \to \limits_{x} + b = 1 \) to get \( b \). The key to the entire support vector is to find the following objective function:

$$ \mathop {\arg \hbox{max} }\nolimits_{{\mathop \to \limits_{w} ,b}} \left\{ {\frac{1}{{\left\| {\mathop \to \limits_{w} } \right\|}}\mathop {\hbox{min} }\nolimits_{i} \left[ {y_{i} \left( {\mathop \to \limits_{w} \,^{T} \mathop \to \limits_{{x_{i} }} + b} \right)} \right]} \right\} $$
(9)

Find the maximum spacing of the nearest support vector points for the two types of data, and satisfy the conditions \( \mathop \to \limits_{w} \) and \( b \). At the same time, there is a constraint on this formula:

$$ y_{i} \left( {\mathop \to \limits_{w} \,^{T} \mathop \to \limits_{{x_{i} }} + b} \right) \ge 1 $$
(10)

At this point, the problem of finding support vector points is transformed into an extremum problem of a function with constraints. For the problem of this kind, we can solve it by Lagrangian multiplier method to get the following formula:

$$ {\text{L}}\left( {\mathop \to \limits_{w} ,b,\alpha } \right) = \frac{1}{2}\left\| {\mathop \to \limits_{w} } \right\|^{2} - \mathop \sum \nolimits_{i = 1}^{N} \alpha_{i} \left[ {y_{i} \left( {\mathop \to \limits_{w}^{T} \mathop \to \limits_{x_{i}} + b} \right) - 1} \right], \alpha_{i} > 0 $$
(11)

Find the partial conductance of L on \( \mathop \to \limits_{w} \) and \( b \) make it equal to zero.

$$ \frac{{\partial \left( {{\text{L}},\mathop \to \limits_{w} ,b,\alpha } \right)}}{{\partial_{w}^{ \to } }} = 0 \Rightarrow \mathop \to \limits_{w} = \mathop \sum \nolimits_{i = 1}^{N} \alpha_{i} y_{i} \mathop \to \limits_{x_{i}} $$
(12)
$$ \frac{{\partial \left( {{\text{L}},\mathop \to \limits_{w} ,b,\alpha } \right)}}{\partial b} = 0 \Rightarrow \mathop \sum \nolimits_{i = 1}^{N} \alpha_{i} y_{i} = 0 $$
(13)

Bring the above two styles into the following formula.

$$ {\text{L}}\left( {\mathop \to \limits_{w} ,b,\alpha } \right) = \frac{1}{2}\mathop \to \limits_{w}^{T} \mathop \to \limits_{x} - \mathop \sum \nolimits_{i = 1}^{N} \alpha_{i} y_{i} \mathop \to \limits_{w}^{T} x_{i} - b\mathop \sum \nolimits_{i = 1}^{N} \alpha_{i} y_{i} + \mathop \sum \nolimits_{i = 1}^{N} \alpha_{i} $$
(14)

Find the support vectors of the two types of data sets by finding \( \mathop \to \limits_{w} \) and b that satisfy the maximum value of the above formula.

The commonly used kernel functions are linear kernel function, polynomial kernel function, Gaussian kernel function (RBF kernel function) and sigmoid kernel function [16]. The choice of kernel function has a great influence on the classifier. We compare and analyze the following kernel functions [17].

The mapping function corresponding to the RBF kernel function projects the sample into an infinite dimensional space, and performs a polynomial expansion on the RBF kernel function to obtain the result. Second, after mapping to the new space, all sample points are distributed over a 1/4 sphere with a radius of 1 at the origin as:

$$ {\text{K}}\left( {x_{i} ,x_{j} } \right) = { \exp }\left( { - \frac{{\left\| {x_{i} - x_{j}} \right\|^{2}}}{{2\upsigma^{2} }}} \right) $$
(15)

Linear kernel are mainly used for linear separability. We can see that the dimension of the feature space to the input space is the same. Its parameters are less fast, and for linearly separable data, the classification effect is very good, so we try to use linear kernel function to do classification as:

$$ {\text{K}}\left( {x,x_{i} } \right) = x \cdot x_{i} $$
(16)

Polynomial kernel functions can map low-dimensional input spaces to high-dimensional feature spaces, but the polynomial kernel function has many parameters. When the order of the polynomial is relatively high, the element values of the kernel matrix will tend to infinity or infinitesimal. The computational complexity is too large to calculate as:

$$ {\text{K}}\left( {x,x_{i} } \right) = \left( {\left( {x \cdot x_{i} } \right) + 1} \right)^{d} $$
(17)

4 Experimental Results and Analysis

We first used wavelet transform to extract features. Wavelet transform can reduce dimension while extracting features, and the extracted feature data was classified by SVM. In this experiment, we use python language for code writing, wavelet transform USES the third-party package pywt in python, and SVM USES the popular third-party machine learning extension library scikit-learn in python. The experimental hardware environment we used was inter(R) Core(TM) i7-8700k CPU @3.70 ghz processor, 32G RAM and 1080TI GPU.

In this study, we used the standard face recognition data set ORL. In this data set, each person has 10 different pose faces and 40 people have 400 photos. During the experiment, we selected one, two, three, four, five, six and seven face poses of each person as the training set, and the rest as the test set.

In the experimental process, we use the sym2 wavelet base in the pywt library. The Symlets wavelet system is a finitely tightly supported orthogonal wavelet, which has strong localization ability in the time domain and frequency domain, especially in the wavelet decomposition process of the signal. Actually more specific digital filters, so the sym2 wavelet base in the Symets wavelet system was chosen to perform wavelet transform on image features. In the experiment, we use wavelet to decompose the original image into the fourth layer which as shown in Fig. 2. The feature data after wavelet decomposition is used. The image before and after decomposition is shown in Fig. 3. For the feature data after wavelet transform, we use SVM for multi-classification. The prediction and real results of SVM classification test set are shown in Fig. 4. In SVM, we set the value of parameter c to 1000. C is the error item. Penalty factor. The larger C, the greater the degree of punishment for the fault-divided sample, so the higher the accuracy in the training sample, but the lower the generalization ability, that is, the classification accuracy of the test data is reduced. Conversely, if C is reduced, there are some misclassification errors in the training samples, and the generalization ability is strong. For the case where the training sample is noisy, the latter is generally used, and the sample that is misclassified in the training sample set is used as noise. The kernel functions we can use are ‘linear’, ‘poly’, ‘rbf’, etc. We set the kernel function coefficient gamma to 0.001.

Fig. 2.
figure 2

(a) The original image; (b) Decompose the original image to the image of the fourth layer based on sym2 wavelet;

Fig. 3.
figure 3

(a) The distribution of the features of the first primitive face (the first one on the left in Fig. 2) in space; (b) The distribution of the features of the first original face after four layers of wavelet transform (the graph on the right in Fig. 2)

Fig. 4.
figure 4

(a) Training sample 240, test sample 160, when inputting test sample predicted classification result; (b) Training sample 240, test sample 160, when entering the correct classification result of the test sample

All figures and tables should be cited in the main text as Fig. 1, Table 1, etc.

Table 1. Recognition rates of applied techniques according to increasing pose count

In the experiment, we also tried to use PCA as image feature extraction. We use the PCA in scikit-learn for dimension reduction. The feature data before and after dimension reduction is shown in Fig. 5. We set the value of n_components in sklearn.decomposition.PCA to be equal to 0.9, which is to retain 90% of the principal components. Then we use SVM to complete the classification identification. We also use wavelet transform for feature extraction, using the PCA for dimension reduction (the dimensionality reduction method is the same as above), and then using SVM to complete the classification and recognition. After three different methods of testing, we obtained the experimental results shown in Table 1.

Fig. 5.
figure 5

(a) The spatial distribution of the training set before dimension reduction; (b) After the dimension reduction, the training set is spatially distributed.

From Table 1, we can see that the four-layer wavelet decomposition and SVM face recognition using Sym2 are better than the other two methods. With the increase of training samples, when the sample pose is increased to 7, the recognition rate classified by the method proposed in this paper can reach 100%.

Consider previous studies [18]. Comparing the proposed method with the previous method, we plot the recognition rate of two different methods using different kernel functions, as shown in Fig. 6 (RBF kernel), Fig. 7 (LINEAR kernel) and Fig. 8 (POLY kernel). When the face number of each training set is less than 2 (80 in total), the support vector machine with three different kernel functions is higher than the previous method. When the training set selects less than 4 faces per face (160 total), the previous method is superior to this method. When the training set is greater than 160, the recognition rate of this method is significantly higher than the previous method.

Fig. 6.
figure 6

Comparison of two methods based on RBF kernel function recognition rate

Fig. 7.
figure 7

Comparison of two methods based on LINEAR kernel function recognition rate

Fig. 8.
figure 8

Comparison of two methods based on POLY kernel function recognition rate

5 Conclusions

In this paper, we propose a method based on Sym2 based wavelet transform and face recognition. This method is validated on the ORL data set. We select the feature data of the face through the wavelet transform of the 4-layer sym2 wavelet, and then use svm to complete the classification and recognition. We use three different kernel functions and compare them with the other two methods (One is using pca and svm, another is using wavelet transform, pca and svm). The experimental results show that the method is better than due to two other methods. This paper also compares this method with the previous method. With the increase of training samples, the recognition rate of this method is higher than the previous method. When the training sample reaches 280, the recognition rate of this method reaches 100%.