1 Introduction

Automatic face analysis is playing an important role in human-robot interaction [1,2,3,4,5]. It includes face detection and localisation, face recognition, facial expression and gender recognition. The main problem to be solved in face recognition is finding competent descriptors for the appearance of the face. Holistic methods such as principal component analysis (PCA) and linear discriminant analysis (LDA) are widely known and studied but have their limitations in accuracy and computational complexity [6,7,8,9,10,11,12].

Illumination variation robustness is among important features that a face recognition system should have [13, 14]. Local descriptors have earned attention of researchers due to their robustness to challenges such as illumination and pose changes. Local binary pattern (LBP) is such a descriptor and is one of the best performing texture descriptors [15]. Its key superiority compared to other descriptors are LBP’s computational efficiency and invariance to monotonic grey-level changes, which makes it opportune for demanding image analysis tasks. When using texture methods it is not rational to try to create a holistic description of a face because large-scale relations would not contain useful information. Also, the experimental results showed that facial images can effectively be presented as a configuration of micro-patterns. This is why local feature-based perspective is chosen [16,17,18].

Compared to other local descriptors LBP was equal to or performed slightly better than Texton histogram and clearly superior to Difference histogram as well as Homogeneous texture descriptors in recognition rates [19,20,21].

LBP methods have also been studied for facial expression recognition and were found to compete well with other state-of-the-art techniques [16]. For instance, LBP was found to be faster and less memory intensive when compared to the Gabor-filter while achieving recognition performance in the same range [22, 23].

Like other face analysis problems, acquiring a competent representation of the original face images is critical for adequate gender classification. If poor features are selected, even the finest classifier could be unsuccessful in obtaining sufficiently accurate recognition rates. Being an efficient method for summarising the local pattern of a picture, LBP has been utilised for determining the gender of a person based on face analysis [24,25,26,27]. As not all regions of an image contain discriminative information to recognise faces various methods such as AdaBoost, support vector machine (SVM), and nonlinear SVM are being used in order to extract the important features [26, 28, 29].

Real-time face recognition has been implemented in many ways and have been adopted on various stationary and mobile devices [30,31,32,33].Viola and Jones suggested a system that reduces the computation time for face detection while maintaining high accuracy [32]. However, while being efficient in controlled conditions, it becomes inaccurate after \({\pm }15^{\circ }\) of rotation in plane and \({\pm }45\) out of plane. Also, when lighting conditions are unfavourable either the detection rate drops or the computational cost increases. In addition, a system has been proposed that based on Compute Unified Device Architecture accelerates the recognition process by parallel computing yet it still requires a considerable amount of processing power [34].

The main focus of this paper is implementing a real-time face-recognition system on NAO humanoids [35], which have limited computational capacity, therefore, reducing computational complexity is playing an essential role. A way to compensate for the limited processing ability of individual devices is using cloud computing servers to do the costly data processing and send the results back to the device. For this, MOCHA mobile-cloudlet-cloud architecture has been used, demonstrating that the technique is feasible but not sufficiently quick yet [36]. Although the approach is certainly promising, we are proposing a stand-alone technique whose computational complexity is considerably lower than many state-of-the-art techniques and can be easily adopted on the NAO humanoids.

Existing real-time face detection and recognition methods require abundant processing power which the NAO humanoid doesn’t have. Therefore, we need one with less computational complexity and added robustness and the LBP approach fulfils these demands.

The remainder of this paper is organised in the following structure. The proposed face recognition algorithm is described in detail in Sect. 2. Section 3 presents the experimental results and includes the discussion. Lastly, Sect. 4 concludes the paper.

2 The proposed real-time face recognition algorithm

In this research work, the proposed real-time face recognition system has been adopted to NAO humanoid platform. Firstly within a frame a face image has been detected by using the Viola–Jones face detection [37]. It is important to note that the frame acquiesced by NAO camera is in YUV colour space. Then the local binary patterns (LBP) of detected face image in each of Y, U, V colour channels have been created [15]. The LBP face is then divided into blocks of \(16\times 16\) and the probability distribution functions (PDF) of each block has been calculated. A PDF is calculated based on the histogram of a given block. Histogram presents the occurrence of every pixel intensity. Pixels in an image can have intensity values that range from 0 to 255, which is why histograms have 256 values on the X-axis. The Y-axis displays the number of pixels that have a given intensity value. A PDF is obtained by dividing the occurrence of each intensity by the total number of pixels. as shown in Eq. 1

$$\begin{aligned} P=\frac{H}{\sum H}=\frac{[h_{1},h_{2},\ldots ,h_{255}]}{\sum H} \end{aligned}$$
(1)

where H represents histogram values and P is a set of PDF values. The PDFs of blocks in a given colour channel have been concatenated in order to make one PDF for the LBP face in that colour channel. The obtained PDF has been compared to the PDFs of faces in the database using the Kullback–Leibler divergence (KLD) in the corresponding colour channel, finding the most likely match as shown in Eq. 2

$$\begin{aligned} \zeta _{C}(P,Q)&=\sum _{i}q(i){\textit{log}}\frac{q(i)}{p(i)} \nonumber \\ C&=\{Y, U, V\} \end{aligned}$$
(2)

where P is a set of PDFs of the training images and Q is the PDF of the query image subject to match to an existing one in the database.

In order to make the real-time real-world scenario, the algorithm is divided into two parts—training and testing. In the interim of training the robot will take several photographs per person in YUV which is the default colour space for NAO, ascertains their faces [37] and produces respective histograms for each colour channel that are thereupon used to generate probability distributions and saved to the database, as illustrated in Fig. 1.

For testing, the course of action is initially identical—images are taken and histograms composed in the aforementioned method. Anon, histograms of individual colour channels are correlated to the database and leading matches are found via KLD, as illustrated in Fig. 2.

Fig. 1
figure 1

The flow chart of the training phase

Fig. 2
figure 2

The flow chart of the testing phase

The optimum matches for each channel are then fused by majority voting principle [38] such that when majority of channels agree on a match, it is regarded as the correct one. There are three versions of majority voting (MV), where the choice is made by one of the following options: (1) that everyone agrees on (unanimous voting); (2) that receives more than 50% of the votes (simple majority); and (3) the one with the maximum number of votes regardless of whether or not they exceed 50% (plurality voting). In this work the plurality voting is adopted.

In the case when all channels give a different answer, the result with the best score (BS) is used. BS is identified by finding the minimum KLD value between corresponding KLD values of chosen classes in each channel, as shown in Eq. 3.

$$\begin{aligned} \zeta _{BS}={\textit{min}}(\zeta _{Y}, \zeta _{U}, \zeta _{V}) \end{aligned}$$
(3)

where \(\zeta _{Y}, \zeta _{U}\) and \(\zeta _{V}\) are corresponding KLD values. MV, as expected, is boosting the results as in the worst case scenario, in which each colour channel results in to different class, BS method has been adopted. In the next section, the recognition rates obtained by BS and MV are shown.

Finally, the robot presents its conclusion as to who the person in consideration is.

Algorithm 1 describes the summary of proposed real-time face recognition system.

figure d

3 Experimental results and discussion

In this work two different setups were used. Firstly, the proposed method has been adopted in a way to achieve recognition rate of well-known databases. In this part the effectiveness of the proposed method is investigated. Secondly, a real-time setup is prepared in which a new database have been produced using NAO humanoid top camera. Then, real-time setup has been used in order to conduct recognition of the people in front of the robot. The details of the setup and the recognition rates are described in the below subsections.

3.1 Experimental results on well-known databases

For comparison purposes the proposed face recognition algorithm has been tested on Essex University face database [39], the facial recognition technology (FERET) database [40], and the Head Pose (HP) face database [41]. As NAO humanoids are acquiring the images in YUV directly, in order not to introduce additional computational complexity to the algorithm, all the recognition process is conducted in only Y, U and V channels.

Essex face database includes 150 different classes with varying illumination and background and each class has ten different samples. In each experiment the training face samples of a class have been selected randomly and the recognition results shown in Table 1 are the average of 500 iteration of the proposed algorithm. In order to fuse the decisions obtained in each of the aforementioned colour channels best score algorithm and majority voting have been adopted. Table 1 also represents the recognition rates of the best score and majority voting.

Table 1 The correct recognition rate in percentage of the proposed face recognition algorithm for Essex University Face database in YUV colour space

The FERET database includes 50 different classes, each class having ten samples, with varying facial expressions and poses, including frontal and profile poses. Unlike the Essex University database which mostly includes pictures of only the facial area of a person and in less than ten classes some shoulder area along with the face, the FERET database images always have some shoulder area and therefore less of facial area per image. Most importantly, the portraits in the FERET database have up to \(90^{\circ }\) of rotation to either side, making them extremely difficult to recognise or even detect [37]. As expected, the results, shown in Table 2.

Table 2 The correct recognition rate in percentage of the proposed face recognition algorithm for the FERET database in YUV colour space

The Head Pose database includes 15 different classes with up to \(90^{\circ }\) on head pose rotation to either side, each class has ten samples. The results were superior to the results of the FERET database, due to smaller number of classes, but not as good as with the Essex University database because of existence of changes in head poses per class. The results, shown in Table 3.

Table 3 The correct recognition rate in percentage of the proposed face recognition algorithm for the Head Pose database in YUV colour space

3.2 The real-time scenario’s setup

In order to be able to handle the real-world scenarios, which is the main goal of this research work, a new database has been created. The database consists of two main parts: (1) 310 images of 31 people with ten different poses each, which have been acquired by NAO upper camera, as the set up is shown in Fig. 3; (2) series of video streams where randomly selected people from the database started to walk on a predefined marked path starting from 6 m away from the NAO humanoid till they reach the robot, as the setup is shown in Fig. 4. Both the image database and videos were created using NAO’s camera.

Fig. 3
figure 3

The setup for capturing still images of people with NAO’s camera

Fig. 4
figure 4

The setup of acquiring video streams of people walking towards the robot

The aim of the video streams is to not only validate the proposed real-time face recognition technique, but also to analyse the effect of the distance on the recognition performance. The lightening condition in the video streams is approximately the same as any typical room lightening. While making the image database the background of the faces were uniform but for the video streams the background was scenes in a laboratory. Figures 5 and 6 are showing some samples of faces in the prepared NAO’s camera based database and some frames of one of the video streams, respectively.

Fig. 5
figure 5

Samples classes of the prepared database acquired by NAO’s camera

Fig. 6
figure 6

Samples frames of one of the video sequences taken by NAO’s camera

3.3 The experimental results for the real-time scenario

The correct recognition rate for the prepared face database is given in Table 4. The videos depicted people walking towards the robot, while the robot was sitting on a table. The person started approaching from a distance of 6 m and stopped right in front of NAO, in which case the person’s face was usually no longer in frame. Because Viola–Jones was unable to detect the face when the person was further than 2 m, the recognition rate has been calculated only for the interval of 2–0.5 m from NAO. The recognition results for six classes that existed in both databases are shown in Table 5.

Table 4 The correct recognition rate in percentage of the proposed face recognition algorithm for our own database in YUV colour space
Table 5 The correct recognition rate of real-time scenarios when the distance is less than 2 m away from NAO

In order to boost the recognition rate on the real-time scenario, we are also employing majority voting on intra-sequence decisions with the window size of 5. Here we are voting among the last 5 decision made by the robot in order to make the final decision. By introducing this MV on intra-sequence decisions, NAO declares the class as soon as the 5th decision has been made. This boosting increases the recognition rate and the results are shown in Table 6.

Table 6 The correct recognition rate of real-time scenarios when the distance is less than 2 m away from NAO after employing MV on intra-sequence decision by window size of 5

4 Conclusion

In this work a real-time face recognition system for NAO humanoid was introduced. The proposed method was using block processing of local binary patterns of the face images captured by NAO humanoid. The correct recognition rate was boosted by using majority voting and best score ensemble approaches on decision obtained in Y, U and V colour channels. Also a database was made which was acquired by NAO humanoid’s camera. In this paper also the effect of distance on recognition of people by NAO humanoid was studied and it was sown that the robot cannot recognise people who are more than 2 m away from the it. The recognition results was boosted in the real-time scenario by using MV on intra-sequence decisions with window size of 5.