1 Introduction

The “Internet of Things (IoT)” technology, which connects sensors, actuators, and processor-equipped objects together and communicates with each other to achieve meaningful goals, has become a major trend in information technology. However, the IoT system using RFID or non-contact wireless technology as their sensor cannot be applicable in some situation. Besides, the cost of RFID label should be taken into consideration when there are huge amounts of objects. To overcome these limitations, Visual IoT, a technique for extracting and using scene object information directly from images, has been proposed. With the help of cameras, the Visual IoT can get the object location via image information of the scene, attach the visual label to the object, and then return information of scene objects to the network [14].

As personal security becomes a critical issue, biometric recognition systems for self-authentication based on face or voice have received great attention in recent decades. In particular, face recognition is one of the most popular techniques because it is less repulsive and user-friendly due to its non-contact, non-aggressive, and non-intrusive nature [12]. In addition, since facial features are unique to humans, facial recognition is one of the most appropriate tools for Visual IoT. Despite significant recent advances in the field of face recognition, however, implementing face verification and recognition to the Visual IoT efficiently at scale is very challenging due to limited computing power. Moreover, existing face recognition methods applicable to Visual IoT only worked well in limited environments such as constrained illumination condition and approximately frontal posture.

Deep convolutional neural network-based methods have recently achieved significant improvements in face recognition. For example, Facebook’s DeepFace [22] and Google’s FaceNet [20] have achieved human-level face recognition performance. However, face recognition techniques based on a huge deep neural network are not applicable in IoT environment due to such devices’ limited processing and storage capabilities.

The goal of this work is the development of a face recognition system for Visual IoT using deep neural network. Our main consideration is the development of a practical system that can give high accuracy and real-time performance using deep neural network in embedded environment for Visual IoT.

The paper is structured as follows: Sect. 2 briefly overviews previous studies on face recognition in general and embedded environments. Section 3 focuses on our design consideration and implementation details. Section 4 describes the collection, analysis of dataset, and preprocessing steps for training and evaluation. Our accuracy and performance are evaluated and compared to other face recognition techniques in Sect. 5, and finally, Sect. 6 summarizes this work.

2 Related work

Face recognition has been an active research topic in recent decades. Given an input image with one or more faces, a typical face recognition system consists of four stages: face detection, preprocessing, representation, and matching. The face detection stage isolates the faces and gives a list of bounding boxes. The preprocessing stage is required because the face is not a rigid object and images of the face can be taken from many different perspectives. The preprocessing stage normalizes the faces so that the eyes, nose, and mouth appear at similar locations. The representation stage translates the preprocessed face image into a low-dimensional representation (or embedding). Finally, the matching stage identifies or verifies enrolled users.

Numerous studies have been proposed in the literature. Among the huge number of proposed methods, we distinguish the ones prior to deep learning, which we call “non-deep,” from those that do, which we call “deep.” Non-deep methods can be categorized into a local feature-based and holistic approaches. Local feature-based approaches first extract hand-crafted local image descriptors such as SIFT, LBP, HOG and then aggregate them into an overall face descriptor [7, 8], whereas holistic approaches are based on statistics such as principal component analysis (PCA) and independent component analysis (ICA), which represent faces as a combination of eigenvectors or features that characterize or separate two or more classes. Eigenfaces [23] and FisherFaces [5] are the most well-known techniques in PCA-based methods. Jafri and Arabnia [12] provided a nice survey of the face recognition methods developed up to 2009.

Deep methods are based on convolutional neural networks. Facebook’s DeepFace [22] and Google’s FaceNet [20] achieved the highest accuracy in the LFW dataset [17], which is a standard benchmark in face recognition research. VGG Face Descriptor [18] and Lightened Convolutional Neural Networks (CNNs) [25] achieved comparable performance.

In mobile embedded environments, studies often use techniques with an order of magnitude less accuracy than state-of-the-art approach due to the lack of computational resources. Soyata et al. [21] proposed an Eigenfaces-based face recognition system that uses a mobile device, cloudlet, and cloud. Hsu et al. [10] introduced a cloud-based face recognition service for drones. Ye et al. [26] presented a face authentication system on a distributed computing environment. The advance of efficient GPU architecture for mobile devices such as NVIDIA’s Tegra has accelerated computational performance [24]. However, it remains incapable of executing recent top-performing deep neural network-based face recognition techniques.

3 Methodology

3.1 Deep learning framework selection

In this section, the compatibility of current deep learning frameworks will be investigated. It will not compare other features or the performance, as this is beyond the scope of this paper and already good benchmarks exist [4].

Caffe [13] was the first mainstream industry-grade deep learning framework developed by the Berkeley Vision and Learning Center and by community contributors. It remains the most popular toolkit within the computer vision community. It has command line, Python, and MATLAB interfaces. Since it was developed in C++, it can be complied on various platforms; recent unofficial ports support mobile platforms such as Android and iOS.

Torch [6] is a scientific computing framework that supports a MATLAB-like environment built on Lua; it can provide C++ and Lua interfaces. Recent unofficial ports have supported running on mobile platforms such as iOS and Android.

Theano [2] is a symbolic expression compiler that efficiently defines, optimizes, and evaluates mathematical expressions that involve multi-dimensional arrays. It only supports a Python interface, and support for other embedded platforms is unfortunately not yet considered a core feature.

TensorFlow [1] is an open-source framework for numerical computation using data flow graphs. It allows the deployment of computation to one or more CPUs or GPUs in a desktop, server, or mobile device environment with a single API, and it has APIs available in several languages such as Python, C++, Java, and Go. It is the only framework that officially supports mobile embedded environments such as Android and iOS at the time of writing this paper.

The Microsoft Cognitive Toolkit (CNTK) [27] is a unified deep learning toolkit that was recently released by Microsoft Research. It provides Python, C++, C#, and BrainScript interfaces. However, it does not yet support a mobile embedded platform.

As the development shows, the addition of support to run deep neural networks in mobile embedded environments is rising. However, many do not officially support embedded devices at the time of writing this paper. Thus, we choose TensorFlow as our main deep learning framework. The same model can be run on a dedicated server or embedded device without any code changes, and it can be easily ported to other IoT environments because it is implemented in C++. Additionally, TensorFlow has a fast-growing community of users and contributors, making it the most promising deep learning framework.

3.2 Face recognition pipeline

Our face recognition pipeline consists of four stages: face detection, face alignment (or normalization), feature extraction, and recognition. The face detection stage gives a list of bounding boxes around the detected face. The face alignment stage normalizes faces with respect to geometric properties so that the eyes, nose, and mouth appear at similar locations. The feature extraction stage extracts the facial features that represent a certain aspect of a detected face. Finally, the recognition stage verifies enrolled users. Figure 1 shows our face recognition pipeline.

Fig. 1
figure 1

Face recognition pipeline

3.3 Face detection and preprocessing

The face detection stage gives a list of bounding boxes around the face. We implement a histogram of the oriented gradient (HOG)-based face detector with a structured support vector machine (SVM), and the different pose issue is handled by the alignment stage, which normalizes the faces so that the eyes, nose, and mouth appear at similar locations. Many modern techniques such as DeepFace [9, 22] frontalize the face to a 3D model so that the image shows the face looking directly at the camera. Their computational complexity makes these 3D approaches unsuitable for mobile environments. Moreover, face posture variation in the mobile face recognition scenario is limited to the device’s viewing angle. Therefore, we align faces using a simple 2D affine transformation based on facial landmarks [15]. This is less accurate than the 3D methods, but provides a reasonable performance in limited mobile scenarios. Then, we crop the aligned face and resize it into 96 \(\times \) 96 pixels. Figure 2 shows details of the preprocessing stage. The red rectangle illustrates the detected face, and yellow dots denote the facial landmark points for alignment.

Fig. 2
figure 2

Face detection and preprocessing

3.4 Network architecture and training

Using a deep neural network on a mobile embedded device requires considering the number of parameters for the network, because it is tightly coupled with the total number operations and memory usage. For example, DeepFace [22] has 120M and VGG Face [18] requires 138M parameters, taking up more than 500MB storage, and needs 15.47 billion floating-point operations (FLOPs); however, this is unacceptable for a mobile environment. Thus, we need a more compact but powerful network model. We choose nn4.small2 [3] for mobile environments, which is a variation of the nn4 model from FaceNet [20] that has fewer parameters for mobile execution. The network model for compact deep learned feature is shown in Table 1. Each row indicates a layer in the deep neural network. The total number of parameters for our model including batch normalization is 3.74 million and needs 208.16 million FLOPs. It requires only 14.28 MB of memory with single-precision floating-point format.

The convolution layer is the core building block of deep convolutional neural networks. A set of filters in each convolution layer produces activation maps that are the responses to some types of visual feature such as an edge or color in the lower layers, or some complex pattern in the network’s higher layers. The stack of these activation maps along the depth dimension is passed to the next layer.

The max pooling layer between successive convolution layers reduces the spatial size of the feature map along both the width and height. It discards 75% of responses, which reduces the number of parameters and computation in the network. Similarly, the average pooling layer downsamples every depth slice in the feature map by a factor of 3 via average operation. The “pool proj.” column of the inception layers in Table 1 describes the pooling type, the kernel size, and stride.

The local response normalization (LRN) layer diminishes responses that are uniformly large for their neighborhood and make large activations more pronounced within a neighborhood. In other words, the LRN layer regularizes the responses obtained by different kernels. The LRN layer is used both before and after inception (2).

The inception layers [20] performed cross-channel correlations while ignoring spatial dimensions through a \(1 \times 1\) convolution; this dramatically reduced the dimensionality in the filter dimension. Then, cross-spatial and cross-channel correlations were conducted via \(3 \times 3\) and \(5 \times 5\) kernels. Concatenating the responses from convolutional filters with different sizes covers different clusters of information. In addition, two types of pooling operation, max and \(L_{2}\), were used to reduce the dimensionality prior to convolutions, which allowed both deeper and larger convolutional layers and more efficient computation. Our network uses four types of inception layer with small variation. The last seven columns describe the parameters of the inception layers from [20] and the number of parameters for each layer. The columns starting with “#N \(\times \) N” denote the depth of the output feature map, and “#\(3 \times 3\) reduce” and “#\(5 \times 5\) reduce” represent the number of \(1 \times 1\) filters that were used in the reduction layer before \(3 \times 3\) and \(5 \times 5\) convolutions, and “pool proj.” describes pooling type and the size of the dimensions into which it is reduced.

The embedding layer is a composition of the fully connected layer and the \(L_{2}\) normalization layer. A fully connected layer linearly combines \(1 \times 1\) \(\times \) 736 feature maps into 128-dimensional vector. Then, the following \(L_{2}\) normalization layer constrains the vector to the unit hypersphere.

Table 1 Details of the network model for the compact deep learned feature

A network is trained with 500 K images from two of the largest public face recognition datasets, CASIA-WebFace and FaceScrub. Triplet loss [20] is used to provide embedding on the unit hypersphere and to effectively represent the similarity between faces.

4 Dataset

4.1 Dataset acquisition and analysis

We have collected a face dataset with which to evaluate the performance in the mobile face recognition scenario and embedded environment. In addition, the distribution of face postures is analyzed to understand the mobile face recognition scenario and develop a non-self-aware face recognition system on mobile.

The dataset was captured using a Samsung Galaxy S6, and the video capture resolution is 720\(\times \)960. The dataset consists of distinct sessions that were usually separated by one or two weeks, and the data were captured over two months. In total, 10,360 images from 10 identities are captured.

Capturing the dataset on a mobile device is inherently uncontrollable because the devices are given to users. Specifically, capturing data from a mobile device allows high variability in face poses and illumination conditions. Ensuring that the captured data are meaningful and useful requires enforcing minimal constraints upon participants and validating the captured data.

Two constraints were placed upon users when recording their data; we asked that they ensure that they were able to read the text on the screen and that most of their face was in the captured image. We provided simple random text and live video feedback to assist with this. Additionally, we asked that the user be seated in an indoor office environment. In addition to these constraints, we validated the captured images.

Figure 3 visualizes estimated poses from images. Red dots represent facial landmarks, and blue dots represent the estimated camera location. Interestingly, the horizontal and vertical face rotation of the collected dataset follows a Gaussian distribution centered at (0, \(-\,5\)) degrees. The rotation ranges are, respectively, \(-\,30\) to 30\(\,^{\circ }\) and \(-\,20\) to 10\(\,^{\circ }\) for horizontal and vertical directions. Figure 4 shows the density of vertical and horizontal face rotations in the mobile face recognition dataset.

Fig. 3
figure 3

3D camera location with respect to face

Fig. 4
figure 4

Rotational distribution of the mobile face recognition dataset

4.2 Dataset preprocessing

Face detection and alignment are conducted on the collected mobile face recognition dataset. Faces are detected using a HOG-based detector as described in Sect. 3.3, but some of the collected faces were blurry because of the motion of users and/or camera focus. We use a simple focus measure to filter out blurred faces; if the detected face has a focus measure response of less than a certain threshold \(\tau _{\mathrm{f}}\), it is discarded. For the sake of simplicity, the variance of Laplacian (LAPV) [19] is used as focus measure. Face alignment is performed after blurry faces are filtered. Consequently, 9798 faces from 10,360 images are successfully aligned. Figure 5 shows example faces in our mobile face recognition dataset.

Fig. 5
figure 5

Example images in the mobile face recognition dataset

5 Evaluation

In the following, we first test the proposed system using the LFW verification dataset to evaluate its accuracy in comparison with other techniques. Then, we evaluate the accuracy on the mobile face recognition dataset and the performance on mobile embedded devices.

5.1 LFW verification

The LFW dataset consists of 13,233 images from 5750 people; the verification test provides 6000 pairs separated into ten equally sized fold. The LFW verification test [11] predicts whether given pairs of images are of the same person.

Table 2 shows the LFW verification accuracies of other techniques. The accuracy is obtained by tenfold cross validation; the ninefolds are used for training the threshold and the remaining onefold is used for testing, and the testing is done 10 times.

The pair is labeled as the sample person if the Euclidean distance on the pair is less than a certain threshold \(\tau _{\mathrm{d}}\); otherwise, it is labeled as different people. The best threshold of the training fold is used as the threshold on the remaining fold. In nine out of ten experiments, the best threshold was 1.01. These results demonstrate that our accuracy is close to the accuracy of state-of-the-art deep learning-based techniques.

Table 2 LFW verification accuracies

5.2 Mobile face recognition dataset

The recognition accuracy on a mobile face recognition dataset using 20 samples per person is 98.88%. The false accept rate (FAR) and false rejection rate (FRR) were 1.03 and 1.25%, respectively.

Figure 6 shows the accuracy variation with respect to the training sample size. As can be seen in Fig. 6, the accuracy rapidly increased in up to 10 samples, and the accuracy slowly increased with more than 20 samples. Thus, we used 20 samples for the enrollment stage of the mobile implementation for efficiency and usability.

Table 3 shows the comparison accuracy by classifiers. As can be seen in Table 3, the nearest neighbor with Euclidean distance achieves comparable performance to the others. For the sake of simplicity and ease of implementation on a mobile platform, we use the nearest neighbor classifier as our baseline classifier.

Fig. 6
figure 6

Accuracy variation with respect to the training sample size

5.3 Device performance

The evaluation was conducted on a Samsung Galaxy S6 (1.5 GHz octa-core CPU, 3 GB RAM) running Android 6.0.1. The camera resolution was 480 \(\times \) 320 pixels, and a cropped 320 \(\times \) 320 pixel square was used as the initial input. Table 4 shows the execution time for all tasks. The total execution time on the device including all tasks was 277 ms (76 ms for the first two tasks, 200 ms for feature extraction, and 1 ms for authentication), i.e., 3.6 fps.

Table 3 Comparison of face recognition by classifier
Table 4 Comparison of execution time by device

6 Conclusion

This paper proposes a face recognition system for Visual IoT using compact deep learned feature. A compact deep neural network is adopted to enable execution in mobile embedded environments while achieving higher accuracy. We show competitive accuracy and performance results on the LFW verification benchmark; furthermore, we show promising results using a mobile face recognition dataset and the proposed system runs in real time in a mobile embedded environment despite using deep neural network. Additionally, the face posture in the mobile face recognition scenario was analyzed. In future, more efficient and compact deep neural network-based face recognition methods for Visual IoT can also be investigated.