Keywords

1 Introduction

In this paper, face of the person present in a video is identified and further prediction is made whether the face of a person belongs to class-1 or class-2 with the help of linear based support vector machine (SVM) classifier. We are classifying two persons; Class-1 means a face of first person while class-2 means the second person. A single person can exhibit many facial expressions like happiness, sadness and so classification of faces with different facial expressions, illuminations in a video is a challenging problem. To address this topic, the video is divided into frames. Key frames are extracted. If a face is present in an extracted key frame it is identified and features are extracted from the respective key frames. The feature extracted values is used to build a model for prediction of the class labels (class-1 or class-2) using SVM. The classification of faces into different classes find its applications in the areas of content based retrieval, face detection, face recognition and face tracking [1].

In [2], authors propose a new pattern classification method called Nearest Feature Line (NFL), which has been shown to yield good results in face recognition and audio classification and retrieval. In [3], author extended the NFL method to video retrieval. Unlike conventional methods such as NN and NC, the NFL method takes into consideration of temporal variations and correlations between key-frames in a shot. The main idea is to use the lines passing through the consecutive feature points in the feature space to approximate the trajectory of feature points. In [4], the idea of Eigen face is introduced, which is one of earliest successes in the face recognition research and successfully applied the texture descriptor, local binary pattern (LBP), on the face recognition problem. In [5], author proposes the use of sparse representation derived from training images for face recognition. The method is proved to be robust against occlusions for face recognition.

In [6], some researches also use a reference set to improve the accuracy of face recognition and retrieval and used attribute classifiers, SVM classifiers trained on reference set, for face verification. Methods for off-line recognition of hand printed characters have successfully tackled the problem of intra-class variation due to differing writing styles. However, such approaches typically consider only a limited number of appearance classes, not dealing with variations in foreground/background color and texture [7].

The rest of the paper is organized as follows. Section 2 describes proposed approach. The sequential approach of key frame extraction is discussed in Sect. 3. The statistical feature extraction using LBP are detailed in Sect. 4. Classification using SVM is explained in Sect. 5. The results are mentioned on the considered data-set in Sect. 6 and K-fold cross validation is used to obtain the accuracy of the model. Finally we conclude in Sect. 7.

2 Proposed System

The proposed system Fig. 1, consists of three main modules. They are key frame extraction, feature extraction using LBP, classification using SVM classifier. The system also involves two phases, training and testing phase. In the training phase, training videos are divided into frames. Only the key frames are extracted for which sequential key frame technique is used. These key frames (images) are stored in a separate directory. Furthermore, faces present in the images are detected from which features are extracted using LBP and for the convenience histogram is plotted to store the count of each LBP value. In the next phase, a model is built using SVM classifier using training data-set. During training phase, labels are assigned to the classes considered. For class-1, label provided is 1 and for class-2, label provided is 0. In testing phase, separate testing data-set is considered which are unlabeled. In this phase, SVM classifier makes a prediction whether the face belongs to class-1 or class-2. Class-1 is named as face1 and class-2 is named as face2. Thus, linear SVM is used to classify two persons.

Fig. 1.
figure 1

System model showing the steps involved in proposed system

3 Key Frame Extraction

Video contains massive quantity of information at different levels in terms of scenes, shots and frames. The objective here addressed is the removal of redundant data which makes further processing easier. So, key frame extraction is the fundamental step in any of the video retrieval applications. Key frames are the frames which provide the summarized information of the complete video. Key frames are selected based on their uniqueness when compared to their subsequent frames. Dissimilarity between the frames must be computed in order to detect the key frames. The proposed system makes use of sequential comparison method wherein the first extracted key frame is compared with all other frames. This process is carried out until a different key frame is obtained. The sequential key frame extraction method is easy to implement as it has low computational complexity [8, 9].

Fig. 2.
figure 2

Phases involved in Linear Binary Pattern

4 Feature Extraction Using LBP

Local Binary Pattern (LBP) is widely used in the field of computer vision as a type of optical identity. This algorithm is helpful and serves as a powerful feature for texture classification. Steps carried out in LBP:

  1. 1.

    LBP looks at 9 pixels at a time.

  2. 2.

    It looks at 3 × 3 pixels and particularly interested at the central pixel.

  3. 3.

    For example, if the central pixel is 8 as shown in Fig. 3.

    Fig. 3.
    figure 3

    Conversion of 1-byte of data into decimal carried during LBP computation

  4. 4.

    The central pixel is compared with the neighboring 8 pixels. If the neighboring pixel value is greater than or equal to central pixel value, then we assign 1 or else 0.

  5. 5.

    The binary values from the Fig. 3 are noted down as 11100010. This binary value will be converted into a decimal number which will be used to train the system. Here the binary values are noted in a clockwise manner [10] (Fig. 2).

The main advantage about LBP is that it is illumination invariant. If the lighting in the image is increased, the pixel values will also rise but the relative difference between the pixels will remain the same.

Consider the Fig. 4, 32 is greater than 28, so LBP value remains the same irrespective of illumination variation.

Fig. 4.
figure 4

Example for illumination invariant behavior of LBP

Consider Fig. 5, it is an image which is divided into 9 blocks. LBP also helps to detect the edges in a face like outline of mouth, eyelids. In the Fig. 5, three 1’s and next 0, this transition indicates there are edges. By this, it is easy to make out the dark areas and light areas in the face. So basically there is a conversion from high dimensional space into a low dimensional space that only encodes relative intensity values and in doing so encodes edges.

Fig. 5.
figure 5

Example to show transition of values helps to detect edges in LBP

5 Classification Using SVM

Classification is a process which involves separation of classes based on extracted features. Classes formed will be different from each other. The classifier is built by considering regularity patterns of training data-set. There are different classification algorithms like k-nearest neighbors, decision tree learning and support vector machine which are majorly used in various types of classification. K-nearest neighbor algorithm (kNN) uses the k-nearest neighbors to build the selection of class assignment straight from the training example. There are many algorithms like C4.5 to construct decision trees to predict the class of the input. SVM uses the concept of hyperplanes to predict the class of the input [11]. In the proposed system, SVM is used for classification of two persons.

Support Vector Machine (SVM) makes use of hyperplane that acts as boundary which divides two classes. The position of the hyperplane has to be decided for better classification. In the Fig. 6, the circles represent the features belonging to class \( C_{1} \) (class-1) and triangles represent the features belonging to class \( C_{2} \) (class-2). The position of hyperplane as shown in the Fig. 6, is not desirable because it gives a large bias in favor of class \( C_{2} \) whereas it puts penalty with class \( C_{1} \).

Fig. 6.
figure 6

Undesirable positioning of hyperplane

The reason is that, an interspace which is represented as black region in the Fig. 7, is given to the class \( C_{2} \) and margin for class \( C_{1} \) is less.

Fig. 7.
figure 7

Undesirable positioning of hyperplane

The classifier provides more appropriate results if the position of hyperplane is as shown in the Fig. 8, which is at an equal distance from the two classes. For Support Vector Machine, 1 (class-1) and 0 (class-2) are assigned as labels for two classes [12]. The circles and triangles that lie on the line as shown in Fig. 8 are support vectors. Decision function for a Support Vector Machine (SVM) is completely identified by batch of information which defines the location of hyperplane. Support Vector Machine (SVM), specifically explains the norm about selection boundary which is best, when it is distant off from whichever data instant. Presently, the classes are class-1 and class-2, and the features are the statistical texture features extracted for each image. Basically the SVM’s use hyperplanes to separate different classes through an ideal function committed from training data as follows:

Fig. 8.
figure 8

Desirable positioning of hyperplane

$$ g\left( x \right) = ax + b $$
(1)

\( ax + b = 0 \), represents a hyperplane in multi-dimensionality

  • \( a \) – vector perpendicular to the hyperplane

  • b – position of the hyperplane in the d-dimensional space

For every feature vector X, the linear function ax + b has to be computed.

Considering two classes C 1 and C 2 , a feature vector x 1 , if function lies on the positive side of the hyperplane then,

$$ g\left( {x_{1} } \right) = ax_{1} + b > 0,\quad x_{1}\, \EUR \,C_{1} $$
(2)

If function lies on the negative side of the hyperplane then,

$$ g\left( {x_{1} } \right) = ax_{1} + b < 0,\quad x_{1}\, \EUR \, C_{2} $$
(3)

and if function lies on the hyperplane then,

$$ g\left( {x_{1} } \right) = ax_{1} + b = 0 $$
(4)

Classifier has to be trained initially i.e., to find \( a \) and \( b \). Supervised learning is used for training. The calculated \( a \) and \( b \) values are taken and for every sample \( ax + b \) is checked. If a sample from class \( C_{1} \) is chosen and if the value of \( ax + b \) is not greater than zero, then the values of \( a \) and \( b \) are modified in such a way that the position of hyperplane is so modified that particular \( x \) taken from \( C_{1} \) is moved to positive side of the hyperplane.

The aim of SVM takes care of maintaining maximal distance between the feature vectors of two separating classes [13]. For every \( x_{i} \), we can have \( y_{i} \) which represents the belongingness \( ( \pm 1) \). Therefore, \( y_{i} \left( {ax_{i} + b} \right) \) is always greater than zero irrespective of the class. Now \( ax_{i} + b = \gamma \) where \( \upgamma \) is the margin which is the measure of distance of \( x_{i} \) from the separating plane. Considering \( ax + b = 0 \), the distance of a point \( x \) from the hyperplane is given by,

$$ \frac{ax + b}{{\left| {\left| a \right|} \right|}} \ge \gamma $$
(5)

where, \( \left\| a \right\| \) tells the orientation of the plane.

$$ ax + b \ge \gamma \left| {\left| a \right|} \right| $$
(6)

such that \( \upgamma\left| {\left| {\text{a}} \right|} \right| = 1 \).

hence,

$$ ax + b \ge 1\,if\,x \, \epsilon \, C_{1} $$
(7)
$$ ax + b \ge 1\,if\,x \, \epsilon \, C_{2} $$
(8)

we have,

$$ y_{i} \left( {ax_{i} + b} \right) \ge 1 $$
(9)

therefore, we can conclude that,

if \( y_{i} \left( {ax_{i} + b} \right) > 1 \), then \( {\text{x}}_{\text{i}} \) is not a support vector. And, if \( y_{i} \left( {ax_{i} + b} \right) \) = 1, then \( x_{i} \) is a support vector.

SVM is a linear machine whose design is greatly influenced by the position of support vectors. The distance of the point \( x_{i} \) from the plane has to be maximized.

From Eq. 5, \( (ax + b) \) should be maximized and \( \left| {\left| a \right|} \right| \) should be minimized.

From Eq. 9, it can be observed that \( y_{i} \left( {ax_{i} + b} \right) \ge 1 \) acts as a constraint. This constraint problem can be converted to un-constraint one by using Lagrange’s multiplier.

we have,

$$ L\left( {a, b} \right) = \frac{1}{2} $$
(10)

6 Results and Discussion

A set of 100 images of both class-1 and class-2 was taken from the key frames. The collected images were then cropped manually to be of same size. The data-set has 40 class-1 images and 40 class-2 images which are named as c1 to c80. The remaining 10 images are used for testing. All the images classified as class-1 are named as face1 and images classified as class-2 are named as face2. They are named likely because images named as face1 is the face of one person whose face is different from the images named as face2.

The Fig. 9, gives training image data-set which has been utilized to build SVM model and also depicts which all images are of class-1 and class-2 (Fig. 10).

Fig. 9.
figure 9

Training data-set

Fig. 10.
figure 10

Testing data-set

The Fig. 11, is the output which shows the prediction made by the SVM classifier. The output contains the image name and it’s classification as belonging to class-1 (face1) or class-2 (face2).

Fig. 11.
figure 11

Results of image classification using SVM classifier

The classifier built can be evaluated by evaluation technique such as k-fold cross validation. Cross-validation is a model validation technique for estimating the performance of an independent data-set. In K-fold cross validation, entire data-set is partitioned into initial 80 images for the training phase and rest 20 images for the testing phase. Likewise, the process is continued for complete data-set. Finally, the exactness is studied by computing mean of 5 repetitions. During the process of K-fold cross-validation, number of folds considered are 5. The accuracy is estimated to be 80.23%. The pseudo code is as follows:

The Fig. 12, describes the result of the same as method in the Algorithm 1, where the accuracy of classification for the first, second, third, fourth and fifth iteration are namely, 79.75%, 81.83%, 78.17%, 80.33% and 81.01%. The complete accuracy is calculated as follows:

Fig. 12.
figure 12

Results of K-fold cross validation

$$ Total\,accuracy = \frac{{Accuracy\,of\,\left( {iteration\,1 + iteration\,2 + iteration\,3 + iteration\,4, + iteration\,5} \right)}}{5} $$
(11)

7 Conclusion

In this paper, faces of humans are detected and classified using LBP and SVM algorithms. The key frames are extracted from the videos. From these key frames, faces are identified and features are extracted using LBP. The classification of faces is carried out using linear SVM classifier. Further, work will be considered for more than two classes, where the implementation and processing can be carried out in a parallel and distributed manner. This is required because the time requirement for SVM increases with an increase in the number of classes.