1 Introduction

The task of automatic human authentication and identification on the basis of face images is performed by a Face Recognition System (FaReS). Block diagram of a typical FaReS is shown in Fig. 1. The numbers 1–5 denote the elements: (1) initial processing, (2) face detection, (3) user registration and database creation/update, (4) feature extraction, (5) classification or comparison. Our article focuses on feature extraction stage, hence the problem of face detection and localization is ignored. Some information regarding authors’ achievement in that field can be found in [14, 15]. The most complicated stages are 1,2,4 and 5, which is related to the often unstable conditions of enrolling the input data, the large number of individuals with small number of template images, sophisticated methods of processing and ambiguous interpretation.

Fig. 1
figure 1

Typical FaReS elements

Problems presented above cause the task of building an universal face recognition system still not to be solved in any acceptable way [34]. They also made international bodies to join their efforts to build common standards of digital portraits, which define size, format, quality and content of facial image for various applications: visas, traveling documents, identification cards. According to the ISO/IEC standard that is currently undergoing many discussions (ISO/IEC JTC 1/SC 37 N 506 : Part 5: Face Image Data) facial portrait has to be an image of indoor class, illuminated with natural ambient light, neither too light nor not too dark, minimum dimensions of 320 × 240 pixels. The face presented in an image must have frontal orientation. The facial area has to be of vertical orientation and cover at least 80% of whole image. The distance between centers of eyes has to be not less than 60 pixels. The expression should be neutral (non-smiling) with both eyes open normally (i.e., not wide-open), and mouth closed. If the person normally wears glasses then they should wear glasses when their photograph is taken. Glasses shall be clear glass and transparent so the eye pupils and irises are clearly visible. The glasses frames can not obscure the eyes. There shall be no lighting artifacts or flash reflections on glasses [3]. Recommended configuration of facial image is presented in Fig. 2.

Fig. 2
figure 2

Face image configuration together with sample numerical values [3]

As it can be seen in Fig. 2, the parameter W denoting the image width binds other dimensions and distances, like eyes line and face proportions, which helps getting highly standardized geometry.

The introduction of new standards of facial images made possible to interchange the datasets and integrate them in order to improve current procedures in international struggle against terrorism and border control.

Databases involved in these tasks feature fast and frequent changes of content, yet requiring high performance and reliability, which is in opposition of real-time operation. Hence, we want to propose a new structure of such a system, which involves 2-tier architecture. It is oriented at:

  1. 1.

    integration of data (containing not only cropped faces, but also faces with background, and even non-faces),

  2. 2.

    frequent updates of database,

  3. 3.

    small number of face templates per one person,

  4. 4.

    high performance.

Such systems are aimed at different real-world applications involving fast face retrieval [21], but in many cases the publications describe only the general ideas not the technical aspects used [40, 43]. It is worth noticing that such a system can be implemented mainly because of current progress in defining standardized input data (dimensions, format, quality and content of image). This paper present some important elements of prospective system which is especially oriented at the new biometric standards. However, databases meeting all these standards do not exist. That is why we cope with three available datasets, which characteristics are close to the referred ones: FERET—(fa and fb sets); pictures of indoor class containing frontal faces), BioID—faces in frontal orientation with complicated background; ORL—one of the most popular benchmark databases.

According to the strategy presented above, the solution assumes two subsystems (tiers): face retrieval and face recognition, which is graphically presented in Fig. 3.

Fig. 3
figure 3

The strategy of 2-tier face recognition system

The first stage (face retrieval) solves a problem of fast image “‘filtering"’ giving U r objects form the whole set U b basing on approximated similarity between query face and other faces (U r << U b ). The most important at this stage is very high performance and good reliability of algorithms being used. Hence, this stage uses relatively simple methods and features. The second stage (face recognition) processes only small subset of images chosen at the first stage. Algorithms at this stage have to give exact and highly reliable results. This subsystem involves several state-of-the-art methods, e.g., dimensionality reduction [16, 19] and complicated classifiers In our paper we focus on the first stage algorithms, especially the feature extractors which determine the whole process of recognition.

All well-known FaReSes can be divided into three categories [11, 13, 23, 29, 35, 41] based on the way they represent and use face information (“Features extractor” block in Fig. 1). The first group includes systems where 3D model of frontal part of human head is used. The second group consists of systems which exploit anthropometrical face parameters like graph models and elastic 2D face models. The third group are systems where digital face images are represented by a certain set of primitive features (physical or mathematical), mainly distances and angles.

Modern FaReSes, encountered in practice, are usually classified as the third type. The initial (low-level) features of a face image in these systems are represented by the luminance value of every pixel. One can find several arguments, which justify their popularity:

  • brightness features are a natural representation of digital images, even after its scaling or rotating;

  • it is possible to find areas with strong changes of brightness (gradients, leaps) and assign them to specific face areas like eyes or pupils, corners of eyes, eyebrows, nose, lips, hair-line, bottom part of face oval; estimation of these points is a starting point for creating a 3D model;

  • by using brightness of pixels it is possible to build models of face images on the base of their approximation in the eigenface space [13, 29].

Further analysis of the above presented class of FaReSes makes it possible to classify them on the basis of feature extractor they use, which determines the way the face image is represented. In the discussed class of FaReSes we would like to focus on the following four groups [17]:

  1. 1.

    Straightforward representation in the initial feature space (scaling, random selection of pixels).

  2. 2.

    Representation in the feature space obtained by mathematical transformations of input features (Fourier Transform, Cosine Transform, Wavelet Transform).

  3. 3.

    Representation by selection of features obtained by above presented methods.

  4. 4.

    Compact representation obtained by dimensionality reduction methods (principal component analysis, linear discriminant analysis and many more).

Usually the work of feature extractor ends in the very early stage. Further processing of data (e.g., comparing or storing) is performed in the chosen feature space (see Fig. 1). It should be stressed that the complexity of FaReS structure is, in fact, strongly influenced by the complexity of the feature extractor. The other, more complex procedures of that kind are the dimensionality reduction algorithms [2, 9, 10, 13, 24, 25, 34, 42]. Therefore we can divide the FaReSes into simple and complex ones remembering that simple FaReS does not include dimensionality reduction stage. On the other hand, a complex FaReS can be build as the simple one with the dimensionality reduction block added on its output. The latest research (not ours only) show that the potential of systems based on simple feature extractors is still not exhausted [34]. On the other hand, it is obvious that the performance of simple features (mostly appearance-based) may not be as good as current state of the art (involving geometrical and anthropometrical features).

Let us notice that in the wide range of scientific publications it is shown that the accuracy of face recognition task is a simple function of FaReS complexity, e.g., the more complex FaReS is, the better it works. Many scientists also claim that the presence of dimensionality reduction stage automatically improves the recognition accuracy. On the other hand, it is not common to investigate the accuracy of simple FaReSes (not using dimensionality reduction). It should be emphasized that we are not against those claims, but we are sure that the total accuracy of recognition process depends mainly on the feature extraction stage. The problem is that the extractors and their investigations are not “standardized”. The variety of test-benches (facial images databases) has some negative influence, because each scientific group focuses on its own environment.

Based on the reason stated above, in the remaining part of the article particular attention is paid to “simple extractors” and “simple FaReSes”. In the article we also show the comparative analysis of simple facial image feature extractors together with new opportunities for this type of FaReSes as well as new ways of their implementation. Their main application area are: initial processing as face retrieval system in large-scale face-recognition systems, direct hardware implementation or utilization by mobile devices of limited computing power (e.g., Java2 Micro Edition).

The remaining part of the article is structured as follows. Section 2 presents different means of feature extraction from facial images. In Sect. 3, the parameters of the optimal feature extraction are introduced. Section 4 presents face recognition experiments on the Olivetti Research Lab (ORL) database of faces [27] as well as on BioID [4] and FERET [30] databases. The article ends with a proposition of parallel structure (committee) of feature extractors (Sect. 5) and a summary where some conclusions and future work directions are presented.

2 Facial visual feature extraction

All benchmark databases used in our experiments and described in this paper do not meet the norms addressed above. However, they comply with at least some of them. Their distortions are: small variations in illumination (mostly directional lighting), changes in scale, orientation and slightly different framing.

Let the input image for a FaReS contain a human face. This face should be the only one or at least the biggest object in the image space (see Fig. 4). We do not deal with face localization because it is outside the scope of this article. If the size of the head is M × N pixels, then by using brightness of pixels for image representation, the size of the feature vector DIM equals MN. For example, if M = 112 and N = 92 (typical for the ORL database of faces), then DIM = 10,304.

Fig. 4
figure 4

Features extracted from a face image

According to [7, 35, 38] such an image (understood as a pattern) may be represented by a feature vector of much lower size than MN. To achieve this, the following methods are exploited (see Fig. 4):

  • F1. Reducing the image resolution to the size m × n with concatenation of pixels of the resulting image into a smaller feature vector (where m << M and n << N);

  • F2. Creating the feature vector by pseudo-randomly selecting single pixels from the input face image (the same points across all images in the set) which, to some extent, is invariant to pose and expression changes [38];

  • F3. Generating the feature vector from a set of selected spectral components obtained by performing orthogonal transformations on the input face image (DFT, DCT, Wavelets), which is motivated by the fact that these compact features are invariant or at least resistant to low image quality and noise [2, 7, 9, 22, 31, 36, 44];

  • F4. Calculating the feature vector from the intensity histogram of the input image—simple yet effective method of creating affine transformation invariant descriptor [18].

These features were chosen intentionally to make the FaReS computing requirements as low as possible. They can be calculated using fixed-point arithmetic, which is very important for mobile devices that do not have hardware support for floating point arithmetic [33]. The other, also important feature, are strict memory limitations of simple computer systems, hence the analyzed methods do not require large memory or disk-space. The most complicated procedure is, in fact, FFT, which can be supported by fast software routines library [6] or hardware implementation [37]. Other algorithms presented here can be implemented using elementary arithmetic operations in integral numbers space. The computational complexity of the first three methods is strictly linked to the dimensionality of a feature vector, hence it is equal to O(DIM). The complexities of DCT and DFT are linked to the operations involving matrix multiplications, and are equal to \(\hbox{O(DIM\; log(DIM))}\) and O(DIM2), respectively.

3 Parameters of feature selection

The input image is represented by a set of physical features—luminance values—in the F1 and F2 methods.

In the F3 and F4 approaches, input image is represented by a set of mathematical features. In the F3 case, the feature vector is taken from spectral features obtained using a discrete orthogonal transformation (e.g., Discrete Cosine Transform or Discrete Fourier Transform). Features obtained from a histogram are used in the F4 method.

3.1 The scaled representation (F1) approach

The F1 approach is well known and widely used in FaReSes [28, 29]. Its main goal is to decrease the resolution of the input image to the size m × n satisfying the following condition (see Fig. 4):

$$ \hbox{DIM}=mn < Z,$$
(1)

where: Z is the number of all images in the FaReS data-base (with additional condition that DIM = MNZ).

We should remember that scaling can be preformed by both averaging/interpolation or using the wave-let transform [32]. It should be remembered that the last operation is orthogonal and does not changes the distances in the image.

3.2 The random points (F2) approach

The F2 method is based on the ideas described in [38]. While the approach presented above (F1) makes a regular sampling of an image plane, this approach performs certain type of irregular sampling. In the problem addressed in this paper the image segment relevant to object recognition is proportional to the area of the image occupied by the object, and is still too large to be efficiently used for recognition due to the “curse of dimensionality”. That is, a large number of pixels or features in the image (larger than the number of object classes) may not lead to improved recognition rates [1]. Instead, consider smaller number of pixel values extracted along a line between two points in the image. This dimensionality is small enough for efficient classification but of course may not capture the information necessary for correctness. However, with some small probability (larger than random) the line predicts the correct object class. In the beginning of the algorithm p lines are placed on the input image plane. In the next step we select at random d points on each line (see Fig. 5). Coordinates of each point describe a specific pixel location in the input image. Their positions are random, but fixed for all images in the set. The intensity values of all pd pixels define a set of features which describe the face image. It is important that the order of selection of coordinates is preserved for all input images (for both the database and new test images). As it was shown in [38] the influence of the changes of face location and rotations and mimics is not important, because most of points are placed in the face areas featuring the uniform intensity. This approach can be used in images that are meeting the new standards of digital face portrait (stated above).

Fig. 5
figure 5

Sample paths of random points

The process of selecting the p lines and the d points is based on the random number generator. The initially generated random numbers (from the interval \(\left\langle 0, 1\right\rangle)\) are scaled and rounded according to the image dimensions, then sorted and corrected by shifting. The sorting procedure makes it possible to arrange the coordinates of pixels on the line in rising or falling order. The shifting process moves pixels to the image center (into the face area), in order to cover the most important part of the face. In our experiments, to achieve the best pixel distribution on the face image area, four alternative variants of lines were generated. They had the following general directions: from lower left to upper right corner, from upper left to lower right corner, along the horizontal axis and along the vertical axis. Only those lines, which approximately cover the whole face area, are chosen to form the feature vector.

A few examples of distribution of lines on the input image plane are presented in Fig. 5, where three lines for each general direction are shown and * marks the location of coordinates on those lines (here d = 20).

A demonstration showing the coordinates calculation based on the procedure described above is presented in Fig. 6. We selected 20 points on each of the 10 lines generated (p = 10 and d = 20). On each sample face image there are 200 points, whose intensities form the feature vector.

Fig. 6
figure 6

Sample faces and points selected according to F2 approach

Although each image in represented by randomly selected points, their positions are retained across the whole set. It should be also mentioned that no points are linked to any specific face features (eyes, mouth, nose) but they represent the high-resolution image by a nonuniformly and pseudo-randomly sampled miniature (in the form of a vector). The stability of the features obtained in this way is high—the only important thing is that they should be spread on the whole face-area. In our experiments we tested t both normally distributed and uniformly distributed pseudo-random numbers. We came to the conclusion that the latter are more suitable and the lines based on them are much less sensitive to face movements. In our experiments we selected random 200 feature-points, and the mean recognition accuracy was equal to 87% while the standard deviation—2%.

We also tested a few scenarios of recognition for p = {8,10,12,14, 16,18,20} and d = {15,17,19,21,23,25}. The results are presented in Table 1.

Table 1 Recognition rate (%) for different parameters of F2 approach (ORL DB)

The average recognition accuracy for the cases presented above was equal to 87.4%, while standard deviation is 1.6% only.

3.3 The spectral components (F3) approach

One of the most frequently and widely used spectral transformations not only in FaReSes, but also in general image recognition tasks, are the Discrete Fourier Transform (DFT) [10, 22, 31] and the Discrete Cosine Transform (DCT) [2, 7]. By using the DFT coefficients, it is possible to obtain features which are invariant to periodic shifting of the face in the image plane. We assume, however, that the face localization stage gives acceptable results. As it was shown in [12], the acceptable reconstruction of a face requires a sub-block of 20 × 20 DFT components. Thanks to the the symmetric nature of the DFT spectrum, the size of the feature vector in that case can be two times lower (DIM ≤ 200).

The feature vector calculated from the absolute values of DFT components of the input image is derived from the following:

$$ C(p,q) = \left| \frac{1}{MN}{ \sum\limits_{m = 0}^{M - 1} {\sum\limits_{n = 0}^{N - 1} {X(m,n) e^ {\left({ - j \frac{2\pi qn} {N}} \right)} e^ {\left({ - j \frac{2\pi pm}{M}} \right)}} } } \right|, $$
(2)

where: \(q \in \left\{ 0,1,2, \ldots,\frac{Q}{2} \right\} \cup \left\{ N - 1,N - 2, \ldots,N - \frac{Q}{2} \right\};\)

p :

— 0, 1, 2,..., P − 1;

X(m, n) :

—input image pixel at coordinates (m, n);

C(p, q) :

—spectral component at coordinates (p,q);

P, Q :

—size of the spectral window.

If we represent the coordinates p and q in Eq. (2) on the shifted (centered) DFT spectrum (see Fig. 4), then the spectral window covered by these coordinates is placed in the center and under the horizontal axis of symmetry of the spectral matrix.

All window components are included in the final feature vector, except C(0, 0), which describes the average value of intensity of the input image. By eliminating this component we get a feature vector which is more sensitive to the form than to the brightness of the face image.

The use of the second type of transform—DCT is motivated by the fact that in many simple practical applications, its basis functions are often used instead of the eigenfunctions from the Karhunen–Loéve transform. The other motivation is the widespread of fast DCT calculation algorithm in the field of digital image processing [7]. On the other hand, it does make it possible to describe the face image sufficiently with a small feature set—the spectral DCT components. Like in the PCA/KLT, the high energy components tend to be in the top-left corner of the two dimensional spectrum.

If the initial image is represented by a matrix X M × N , then the calculation of spectral features of the DCT can be described in the following matrix form:

$$ C_{P \times P} = T_{P \times M} X_{M \times N} T_{N \times P},$$
(3)

where: T P  × M and T N × P are the cosine transform matrices:

$$ T_{P \times M} = \left[{t_M^{(p,m)} } \right],$$

where:

$$ t_M^{(p,m)} = \left\{ {\begin{array}{*{20}l} {1/\sqrt M,\left\{ {\begin{array}{*{20}l} {p = 0;}\\ {m = 0,1, \ldots,P - 1;}\\ \end{array} } \right.}\\ {\sqrt {\frac{2} {M}} \cos \frac{{\pi (2m + 1)}} {{2M}},\left\{ {\begin{array}{*{20}l} {p = 1, \ldots,P - 1}\\ {m = 0,1, \ldots,M - 1;}\\ \end{array} } \right.}\\ \end{array} } \right. $$

\(T_{N \times P} = \left[ {t_N^{(n,p)} } \right]\) where:

$$ t_N^{(n,p)} = \left\{ {\begin{array}{*{20}l} {1/\sqrt N,\left\{ {\begin{array}{*{20}l} {n = 0,1, \ldots,N - 1;}\\ {p = 0;}\\ \end{array} } \right.}\\ {\sqrt {\frac{2} {N}} \cos \frac{{\pi (2n + 1)}} {{2N}},\left\{ {\begin{array}{*{20}l} {n = 0,1, \ldots,N - 1;}\\ {p = 1, \ldots,P - 1;}\\ \end{array} } \right.}\\ \end{array} } \right. $$

In the resulting matrix \(C_{P \times P}\) only \(p\frac{p + 1}{2} - 1\) components are taken into consideration. When projecting coordinates p on to resulting DCT matrix (see Fig. 4), then the window covering these coordinates is located in the top-left corner. A sample representation of these features is presented in Fig. 7, where a square block of 20 × 20 points creates a feature vector of DIM ≈ 200. It is worth noticing that, like in the DFT case, the component C(0, 0) is not used. Eliminating this component renders a feature vector more sensitive to the form than to brightness of the face image. Further in the algorithm, the remaining components of the spectral window are selected in a zig-zag fashion. It means that the most important components from upper left corner of a matrix are transferred to the beginning of a feature vector. The same method is used in JPEG/MPEG compression [39].

Fig. 7
figure 7

Visual representation of DCT coefficients of sample images

3.4 The intensity histogram (F4) approach

We will now proceed to the last, and in fact the simplest method of feature extraction—the histogram calculation [28, 35]. Histogram of a digital grayscale image represents the information about the distribution of pixels in the intensity space. The shape of the histogram may be crucial in the process of digital images comparison. It is invariant to rotation (on the image plane) or scaling (along each axis individually or both axes simultaneously). When taking into consideration geometrical transformations, normalized histogram can be taken as an invariant representation of the image. For rotated images the similarity of the histograms comes from the definition of the histogram itself. On the other hand, for scaled images or images of different scales along each axis, the similarity of histograms arises from the possibility of a coarse discretization of the intensity range (usually 256 bins) and their normalization. The discretization is characterized by a BIN parameter. The resulting histogram has to be normalized by the number of pixels of its base image. In most problems of image recognition BIN is usually set between 8 and 64. It allows not only to eliminate the scale factor but it also lowers the influence of head rotations.

The procedure of creating the feature vector on the base of intensity histogram is straightforward. Each histogram value H(j) is characterized by a number of pixels of the same intensity j = 0, 1, ..., 255 in the input image. In the next step the histogram H(j) is transformed into a histogram H(b) according to the following rules:

$$ H(b) = \sum\limits_{j = (b - 1)\lfloor {\frac{256}{}{BIN}}\rfloor}^{(b \lfloor {\frac{256}{}{BIN}} \rfloor) - 1} H(j), \, b= 1,2,\ldots,\hbox{BIN}.$$
(4)

Then H(b) is normalized:

$$H_{(b)}^{\rm (norm)} = \frac{H(b)}{MN}, \quad b = 1,2,\ldots,\hbox{BIN}, $$
(5)

where M, N are the number of rows and columns in the input image.

The resulting H (norm)(b) is stored in the final feature vector.

Figure 8 shows four face images from the ORL data-base. Under each face image corresponding histogram is presented (here BIN = 16).

Fig. 8
figure 8

Face images and their normalized histograms

One can observe that the first two faces (1 and 2) belong to the same class while faces numbered as 3 and 4 belong to other classes. It is important that histograms of face images 1 and 2 have similar forms, while histograms of images 3 and 4 are different, which is particularly useful in retrieval and recognition tasks.

It should be noted that the presence of the extra objects in the input image and variations of lighting strongly changes the shape of the histogram which can make recognition impossible. However, if the database is normalized (standardized), these problems are eliminated and the recognition is possible.

Experiments show that tuning the BIN parameter is crucial to achieving high recognition accuracy. When BIN is set correctly, we can eliminate the influence of geometrical image transformations and the problems coming from the presence of other objects in the scene.

3.5 Similarity measure

The main goal of our experiments was to compare a few different features under similar conditions. In the beginning we had to select a common similarity measure. We have evaluated the following four approaches: minimal distance criterion estimated by a L 2 metric, Mahalonobis distance, K-means and a cosine of an angle between feature vectors. K-means method was used in the following manner. The features of the test image were inserted into the feature-space with known number of classes. After that the K-means algorithm was performed to cluster a modified feature-space. The query image was then assigned to one of the known classes. Figure 9 shows a comparison of three different methods of calculating the similarity between images described by a DFT window of a different size. It can be clearly seen that the L 2 metric gives the best results no matter which size we chose. In this experiment, ORL faces were used.

Fig. 9
figure 9

Comparison of three methods of similarity measure for different sizes of DFT window

The performance of Mahalonobis distance and K-means in this case was lower probably because of the fact that in practice we cope with only one image class (faces) which elements have certain common features. These measures would work better for databases containing more classes of objects (i.e., different objects like faces and non-faces).

Figure 10 shows the results of an experiment performed on FERET cropped faces. For each iteration, three faces were taken: two of the same class (Face 1 and Face 2) and one belonging to another class (Face 3). For each pair of these images the cosine of an angle between feature vectors (DFT) was calculated. It can be seen that in general the cosine of an angle for images belonging to the same class is lower than for a pair of different classes and the separation is good. However, there are some cases where the results for both pairs are similar. It can suggest that this measure may not be optimal for features presented in the article. Hence in the further part of the article a similarity measure basing on L 2 metric will be used.

Fig. 10
figure 10

Comparison of a cosine of an angle between feature vectors describing faces from the same class (upper line) and different classes (lower line)

4 Experiments

All presented results were obtained from authors’ system FaReS implemented in MATLAB [26] and a standalone Face Recognition System Modeler “FaReS-Mod” for MS Windows environment [13]. In the recognition experiments presented here we used face images from three well-known benchmark databases: the ORL face database [27], BioID [4] and FERET [30]. Although the databases we use do not comply with assumptions specified in the beginning of this paper (i.e., the need of frontal, neutral pose, strict geometrical proportions and the lack of variable illumination), but such datasets are not yet available. Moreover, we believe that if the algorithms presented here work on such data, they will also work on standardized sets described in the introductory part of this article.

Despite the fact that recognition process presented in the paper is performed as simply as possible (using distance metrics or nearest neighbor), the accuracy is quite acceptable. The other, also important reason, is the problem of low computing power requirements and the ease of hardware implementation. We wanted our approach not to require high computing power and large memory space. That is why we do not utilize complicated operations like covariance calculation, convolution, inverting the high-order matrices, dimensionality reductions and other linear algebra procedures. The most complicated operation we use is DCT/DFT which can be performed on fixed-point arithmetic and is common to be implemented in many software libraries and in hardware through FFT [5, 6].

4.1 The ORL database

The ORL database contains 40 different classes of face images (10 patterns for each class/person). They are all indoor, grayscale images of the same size 112 × 92 pixels, stored in bitmap files. Parameters for experiments were adjusted in a way which gives equal lengths of feature vectors across all the methods, m = 15, n = 13, p = 10, d = 20, P = 10, Q = 20 and P =  Q = 20. The histogram feature vector was defined for BIN = 32. The summary is presented in Table 2

Table 2 Feature vectors sizes

.

The structure of the input data for each experiment is presented in the upper part of Table 3. The recognition process was performed for all 40 classes and for different splits of the initial data (used for learning and testing) in every class. Eight possible variants of splits were taken into account. For example, the “1/9” variant means one template, nine test images, and “2/8”—2 templates, 8 test images. The total number of templates stored in the database of FaReS, as well as the total number of test images, are presented in the second row of the table (e.g., 160/240, 200/200, ...).

Table 3 ORL DB—Recognition rate (%) for different number of test images

Classification of the test images (understood as their membership to a given class) was carried out with the minimal distance criterion. We searched for the nearest neighbor of given image be means of L 2 metric. The percentage of correctly recognized test images (for every variant of data split) is shown for every method. Final recognition results were evaluated as the proportion of correctly recognized images to the total number of test images. The analysis of the results provides the following conclusions:

  • For all alternative sets of features and for all variants of the input data splits, satisfactory results were obtained.

  • Among all considered methods the most accurate approach is DFT (from the F3 group of methods).

  • The second place, with regard to the results of recognition, is taken by the histogram approach.

  • In FaReS realized using the histogram method, the size of a feature vector equals 32 × 1 and for presented FaReS it is the smallest size of the feature space which still gave acceptable recognition results.

It is well known that face recognition results depend not only on the testing data quality but also on the data stored in the database. Hence, it is possible that presented good recognition results are associated with the appropriate selection of the test images and the “lucky” selection of templates during database creation. Moreover, it is possible that with a different choice of testing and training images the recognition results may be worse. To check these arguments we carried out additional experiments. Using the “7/3” variant of the initial data split of the ORL database we prepared ten different models of initial data divisions. Cyclic shifting is performed during elements selection. No rearrangement is made.

According to the above procedure, eight variants of data splits, of three test images in each, are obtained. Number of images in the testing sets is presented in Table 3 on the left hand side. The right side of the table presents the results of test image recognition for respective variant.

In the lower part of the Table 4 mean recognition rates are displayed. From presented results it can be seen that the best performance is achieved by spectral components of the DFT (98% mean and 100% maximum). The second place is taken by the spectral components of the DCT (97% mean and 99% maximum). These can be explained by the properties of base DCT functions (see explanation in Sect. 3.3). Features obtained from the histogram are worse than the DCT spectral components but not significantly. The accuracy of the recognition in that case reaches 96% (mean) and 99% (maximum) for BIN = 32 and the “7/3” variant of the ORL database division. In this case it profits from the representative characteristics of data-set.

Table 4 ORL DB—Recognition rate (average, minimal, maximal and standard deviation) (%) for 7 test images (cross-validation used)

Finally, presented features can give some “extraordinary” recognition opportunities. Figure 12 shows the results of an experiment involving a simulation of tearing the pictures. In this experiment, the test images were distorted in a way close to the situation when we tear the paper pictures. Both, DFT/DCT and Histogram approaches made successful recognition by retrieving corresponding images from the database.

Fig. 11
figure 11

Results of recognition of ORL faces; the first column contains query image, the other 5—images found in the database sorted in decreasing similarity order; DCT features used

Fig. 12
figure 12

The results of experiment involving “torn” pictures from ORL database

Presented results (Table 4) show that simple methods of feature extraction can give satisfactory effects, comparable to other, more complex approaches [10, 22, 35, 38, 44].

4.2 The BioID database

The BioID database includes 1,500 grayscale images of the “indoor” type with dimensions of 384 × 286 pixels [4]. Each image contains only one human face, which usually is not rotated (it looks straight on). In general background shows an office interior. Lighting conditions for images are not controlled, producing quite high variations. The face area occupies no less than one thirds, and not more than half of the whole image. In fact, those are not typical face images (like, for example, in the ORL database), but pictures of interiors with faces. Besides, images in the BioID database are not grouped into classes.

To estimate the recognition accuracy for images from the BioID database, we performed two sets of experiments. The first set works with original images from the BioID database, while the second—with images that contain only the cropped facial area. These sets were created in the following manner:

  1. 1.

    1,500 images: 30 persons (classes), 50 “raw” images per person; training and testing set created in a random way: each containing 25 images;

  2. 2.

    300 images: 30 persons (classes), 10 cropped face images per person; training and testing set created in a random way: each containing 5 images;

Our experiments show that the rate of the correct classification of test images for the simplest histogram features is 96% (for BIN = 16 and BIN = 32), which is a very good result. To compare, the recognition rate for the DFT features is approximately 94% (see Table 5).

Table 5 BioID DB—the average rate (%) of recognition for different features

The experiments confirmed that the overall shapes of faces are more important than their backgrounds. Moreover, the classification of a test image generally does not depend on head rotation, size and location of the face area, or mimicry.

In further experiments we tested the recognition of cropped face images. The localization of the face area is performed automatically, directly on input images during one-pass process—without changing nor correcting cropped face images [15, 20]. This scenario gave the recognition rate equal to 87% (average across all features used). The lower recognition accuracy, caused by not fully corrected face localization, does not however disqualify this method.

4.3 The FERET database

In the last part of our survey we used the FERET data-base, which originally contains more than 12,000 grayscale face images (portraits) of the “indoor” class. Each image is represented by a matrix of 256 × 384 pixels [30]. Images in each class cover the most important changes in lighting, contrast, background, face size (up to 4 times), hairstyle, mimicry, head orientation, age, clothes, etc.

In the experiment, 2,000 images taken from the data- base (groups “fa” and “fb”) were randomly split into templates (images from the group “fa”) and test images (portraits from the group “fb”) according to the variant 1,000/1,000. It should be mentioned, that for any image from the first group there is only one image from the second one, however with slightly different mimicry and head orientation. Our experiments were performed on three types of images: taken directly form the database and cropped using an automatic face-detection method [20]. We used two types of face-area selection: the first cropping method involved tight face framing, while the second one—whole face view (see Table 6).

Table 6 Input data form (FERET)

The best recognition rate for presented data was more than 85% for spectral features (DCT/DFT) discussed in this paper (see Table 7). The lower performance of Histogram-based features comes from complicated background and strong illumination changes.

Table 7 Performance of FERET faces recognition

5 Parallel structure of feature extractors

Our experiments show that an implementation of a FaReS based on presented principles is possible [14]. To increase the overall recognition rate we have investigated the possibility of joining these simple feature extractors into a parallel structure (defined as a committee of extractors). The general framework of such a system is presented in Fig. 16, where “Proc” means initial processing, “FE”: feature extractor, “S/Red”: optional feature space reduction, “C”: comparator, “DB”: appropriate database. The number of features used (r) depends mainly on the required performance.

Fig. 13
figure 13

Experimental results for BioID database. Query image in the first column. Resulting images (DCT) in decreasing similarity order in next three columns

Fig. 14
figure 14

Experimental results of DFT-based face retrieval (FERET database). The first column contains query face, the other-resulting faces in decreasing similarity order

Fig. 15
figure 15

Sample results of recognition (retrieval) for FERET cropped faces based on DFT. The first column contains query face, the other-resulting faces (face number and its distance are indicated above each one)

Fig. 16
figure 16

Framework of a combined face recognition system

Let us assume that an example system includes three feature extractors {FE 1, FE 2, FE 3} and three comparators, respectively. We denote the mean recognition rate of the pairs extractor–comparator as: P 1, P 2 and P 3. The total recognition rate P of such a combined system can be calculated (on the interval \(\left\langle 0,1\right\rangle)\) according to the following formula:

$$P=P_1 P_2 P_3 + P_1 P_2 \bar{P_3} + P_1 \bar{P_2} P_3 + \bar{P_1} P_2 P_3; $$
(6)

where \(\bar{P_1}=1-P_1,\bar{P_2}=1-P_2,\bar{P_3}=1-P_3.\)

For example, if the recognition accuracy of a single feature–comparator pair is equal to 0.9, than the combined accuracy will increase to 0.972. In our experiment we tried to simulate the case, when single image is input to the system and only one corresponding image (of the same person) is available in the database (ORL DB used). This case is close to the one included in the Table 3 (column “1/9”) and it should be stressed that it is the “worst” scenario of a recognition problem. We build a parallel structure of r = 3 features and used voting principle. It is important to notice that in this system we did not use any feature-space dimensionality reduction algorithm. On the other hand, methods such as PCA or LDA can obviously be used in that kind of systems to increase the recognition effectiveness [16, 19, 25, 29].

As it can be seen, the best performance is obtained by combined usage of DFT, DCT and Histogram or Random points approaches. Hence, it is advised to use spectral features which in this case have the best discriminant power. The average recognition rate (see Table 8) is, in general, much higher now regardless of the features used. The comparison of results presented in Tables 3 and 8 shows that in the best case the difference is 46%, while in the worst case—16%. Increasing the number of images stored in the database would lower that disproportions, but the general trend is persistent.

Table 8 Joint feature extractors-recognition rates (%)

6 Conclusions

Experiments presented in this article and observations made show that it is possible to use simple feature extractors in face recognition problems. The unexpected and interesting result is the performance of FaReSes built on the histogram feature extractor, it should however be remembered that using that extractor as a stand-alone approach can be risky.

In the above approaches we do not need to store huge amounts of data in order to build a working recognition system. It is also important that this class of features does not rely on the global distribution of features so the process of extending the face database is straightforward. As opposed to the methods described in the literature, presented methods do not involve any complicated processing stage or combining many classifiers [8, 24]. Presented results show that the recognition rate is comparable to the more complex systems [16, 19]. Finally, results presented here show that the potential of simple systems based on presented feature extractors is still not exhausted in comparison to more complex ones [2, 8, 10, 34, 35, 41].

It should be mentioned once again that presented feature extractors make the structure of a FaReS extremely useful not only for hardware-based implementation, but also for large-scale systems, where some fast and responsible initial processing is required.

Nowadays, when the high demand for biometric access control for devices ranging from desktop computers to mobile devices (cellular phones, pen-drives, PDAs) is still not satisfied, presented simple methods can be successfully utilized. Even if the computing power of these popular devices is growing, their potential is very low in comparison to the modern high-end computers. For example, presented feature extractors are highly suitable for mobile devices implementing JAVA 2 Micro Edition, where natively only integer numbers can be used. Feature extractors we described do not require performing sophisticated calculations and can be implemented in low-memory-based systems.