1 Introduction

Nowadays, with the aggravation of the aging problem in the world, caring for the health problems of the elderly has become a field of increasing importance, especially the health problems of the elderly living alone. Falling is the primary reason of death among seniors due to injuries, as increasing age, weakened muscle strength and the emergence of chronic diseases all increase the risk of falls [16]. When living alone seniors lose their ability to save themselves because of falls, they may miss the best opportunity for treatment, which can put their lives at risk. Therefore, a real-time intelligent surveillance system is demanded for the fall detection task in order to ensure the health of people, especially the seniors.

These considerations have attracted many researchers to propose the intelligent surveillance system which can automatically detect falls and keep real-time performance [14]. With the arrival of 5G era, Internet of things presents a paradigm growth [9, 18, 19], which makes it possible to reduce the internal cost of installing the intelligent fall detection environments at home. The real-time indoor fall detection methods based on computer vision only need to install some cameras indoor, which can automatically locate, track the objects and detect falls by analyzing the motion of people. Although wearable sensor-based methods bring high accuracy [15], the elderly often forget or are unwilling to wear the sensors, which makes them difficult to popularize in practice. The biggest advantage of the static vision system is that people do not have to wear any sensor, many fall detection methods are based on the video, and they always achieve good accuracy performance. For example, Xu et al. [34] proposed a fall detection method based on 3D skeleton data obtained from the Microsoft Kinect and adopted long short-term memory (LSTM) networks for fall detection. Yao et al. [38] also applied 3D skeleton data, which was obtained from the Microsoft Kinect, and built a fall detection model named the human torso motion model. Zerrouki et al. [40] detected falling incidents based on the human silhouette shape variation in the video monitoring. Min et al. [13] firstly applied scene analysis in fall detection based on a deep learning method and then detected the spatial relation between the human and the furniture to detect some falls in a special scene.

At present, the fall detection method based on computer vision has a low accuracy in similar activities. In addition, deeper structures lead to high computational complexity and consume much time. Therefore, a novel real-time indoor fall detection method based on computer vision by using geometric features and convolutional neural network (CNN) is proposed. The main contributions of this work are summarized as follows.

  1. 1.

    The method segmenting the head and the torso is proposed to extract the geometric features of them, respectively, which is applied to solve the instability of traditional methods based on geometric features.

  2. 2.

    We propose a shallow CNN, which can learn enough features to achieve a satisfied accuracy. Meanwhile, a shallow structure can converge fast and keep real-time performance with low computational cost.

The rest of this paper is organized as follows. In Sect. 2, we give an overview of the current state-of-the-art fall detection methods related to our method. The head segmentation method and the feature extraction method are shown in Sect. 3. Section 4 describes the fall detection method based on CNN. The experimental results are discussed in Sect. 5. Finally, we provide the conclusion, and the future research directions are discussed in Sect. 6.

2 Related works

In recent years, deep learning [37, 41], cloud computing [20,21,22,23,24,25], big data [26, 41] and computer vision [10, 29] have been the hot research topics.

Deep learning attracts board attention because it can achieve good performances in different visual tasks [30]. Wang et al. [31] applied an algorithm composed of three networks to improve the segmentation performance from synthetic data to real scenes. Zhou et al. [43] applied a new deep neural network, which improved the performance of semantic segmentation. Real-time performance is an important ability in many deep learning tasks. Zhou et al. [42] applied a lightweight network for the real-time image semantic segmentation. Yang et al. [36] adopted a shallow network to ensure real-time performance and high recognition rate. In the field of fall detection, real-time performance is also very important. Once the fall detection loses real time, the person who falls cannot be found in time, and the fall detection also loses significance. Therefore, increasingly researches focus on real-time fall detection task, and many methods based on computer vision are proposed. However, due to the complexity of visual content and the similar properties of falls with other ordinary human activities, there are still many challenges in fall detection [7]. The main challenges include how to reduce the computational complexity and improve the accuracy. Many algorithms can achieve high accuracy in fall detection, but they need to consume much time [2, 28].

A classical idea of fall detection method is to analyze the shape variance of people in each frame of the video. Normally, a 2D global geometric shape is used to represent the person’s motion and then extract some geometric features to judge the fall. There are two classical geometric shape representations in the previous literature: Minimal External Bounding Box and Approximated Ellipse. 1. Minimal External Bounding Box: Min et al. [12] extracted the geometric features of people by the bounding box such as the ratio of the height to width to detect the falling in different directions. Liu et al. [11] fused three features—the human aspect ratio, the effective area ratio and the center variation rate. This method reduced misjudgment and increased the fall detection rate. Chen et al. [1] detected the height and the aspect ratio of the human body in multiple falls which could reduce the false alarm rate by fusing multi-features. 2. Approximated Ellipse: Rougier et al. [28] applied the approximated ellipse to represent the shape of pedestrian in each frame of a video sequence and detected falls by analyzing the human shape deformation. Yu et al. [39] extracted the human contour from the video and fitted with an ellipse. The information of shape and position was combined to describe the features of shape contour. Then, they used an online one-class support vector machine to distinguish different activities. Fan et al. [5] located the human body by a minimum area-enclosing ellipse and then developed a normalized directional histogram around the center of the ellipse to represent the human posture by multi-directional statistical analysis. Last, a set of features was extracted to feed into a directed acyclic graph support vector machine to distinguish different human postures. There is also another geometric representation: Chua et al. [3] used three points to present a person and extracted motion features to detect falls. This method achieved high accuracy for human fall detection in real-time indoor video sequence.

The geometric features are useful in fall detection, which provide a lot of information about human posture. Min et al. [12] applied the bounding box to represent the human shape. However, the feature information extracted from the bounding box is not as accurate as the ellipse [27, 32]. Meanwhile, it cannot distinguish some similar activities. Because the bounding box will make a great shape change when pedestrians suddenly stretch out their arms in the course of normal walking, which will lead to misjudgment. Although the ellipse fitting can effectively reduce this problem and remove the slender objects carried by pedestrians, some special activities like sitting down brutally and squatting down brutally can easily be misjudged as falling. Furthermore, the human motion is a highly non-rigid activity, and the rules of movement in different parts of the human body are different. For some highly similar activities, it is inaccurate to use one traditional whole geometric shape to represent the whole shape of a human. To address this problem, a novel real-time fall detection method based on the head segmentation is proposed.

Considering that the amplitude of the head motion is huge during the fall, we extract the head motion as new features. Therefore, the method segmenting the head and the torso is proposed, and the two different ellipses are applied to represent the head and the torso, respectively. Then, three features including the long and short axis ratio, the orientation angle and the vertical velocity are extracted from the two different ellipses, respectively, and fused into a motion feature based on time series. Lastly, a shallow CNN is trained to find out the correlation between the two elliptic contour features to detect falls and distinguish some similar activities.

3 Head segmentation and motion features extraction

3.1 Foreground detection

In the foreground detection part, Gaussian mixture model (GMM) [33] uses multiple Gaussian models to represent the features of each pixel. Each pixel is regarded as a variable. Before the foreground detection, the background is trained at first, and the GMM is used to simulate the background in each frame. Then, in the test stage, the GMM is updated after a new frame of image is obtained, and each pixel in the current image is used to match the GMM. If the matching is successful, it is considered as the background, otherwise it is the foreground. Then, the foreground extraction, the shadow suppression method is applied to suppress the shadow. After that, there may be some voids and noises in the image, and the dilation and the corrosion operations are used to solve this problem.

3.2 Head segmentation

When one traditional ellipse is used to fit the whole human contour, it cannot effectively reflect the difference between some similar activities. It would increase the false alarm rate and lead to misjudgment. In order to improve the distinguish ability of our method to the similar activities, the importance of the head motion is considered, because the amplitude of the head motion is huge during the fall. Therefore, the head segmentation method is proposed, and the two different ellipses are applied to fit the head and the torso of the human, respectively.

3.2.1 Head pre-location

A head pre-location method is proposed to approximately locate the head position, which is described in Fig. 1.

Fig. 1
figure 1

The pre-location of the head

In Fig. 1, the foreground is obtained from the input frame. Then, the head is segmented by the proportion of the head to height. Lastly, the approximate position of the head can be obtained by the bounding box fitting.

3.2.2 Head tracking

After the head pre-location, the mean shift tracking method [4] is used to track the head. This algorithm has low calculation complexity, and the target can be real-time tracked when the target area is known. Meanwhile, it is also insensitive to the edge occlusion, the target rotation, the deformation and the background motion, so the positioning of the target will be more accurate. The target model and the candidate model of this method are calculated based on the distributions of the target region and the candidate region, respectively. Then, the similar function is used to measure the similarity between the initial frame target model and the candidate model of the current frame, and the candidate model which maximizes the similar function is selected. The mean shift vector of the target model is obtained, which is the vector of the target moving from the initial position to the correct position. The mean shift algorithm will converge to the real position of the target and achieve the goal of tracking by iteratively calculating the mean shift vector.

Fig. 2
figure 2

The results of tracking head. af The tracking results under different angles and different people

Firstly, the approximate position of the head can be found out by the head pre-location. Then, the head is tracked by the mean shift tracking method. The results of our method are shown in Fig. 2.

3.2.3 Ellipse fitting

After the head tracking stage, the two ellipses are used to fit the head and the torso, respectively. As shown in Fig. 3, the traditional ellipse fitting method [17] is used to fit the torso of the human, but this method cannot effectively reflect the difference between the whole human and the torso of the human. Therefore, we modify the torso contour to achieve a compact torso elliptical contour. Firstly, the torso contour is fitted by polygon. Secondly, each side of the polygon is connected to their midpoint and this operation is repeated an odd number of times. The shape of the torso contour will be an ellipse. Lastly, the torso is fitted in this way to obtain a more compact ellipse representation. Figure 4 describes the torso ellipse extraction diagram. Figure 5 shows the results of the torso ellipse fitting. Figure 5a, c shows the results of the ellipse fitting by the traditional method, while Fig. 5b and d shows the results of the torso fitting by our method. Compared with the traditional method, a compact torso ellipse is obtained. This method is also used for the head fitting. In Fig. 6, it shows the results of the head and the torso ellipse fitting. These blue, green and red ellipses represent the head, the torso and the whole body, respectively. By observing the effect of different actions on ellipse fitting, it can be concluded that the two ellipses fit human body more accurate than one ellipse.

Fig. 3
figure 3

The whole human body and the torso fitting results based on the traditional method. a, c The whole human body ellipse fittings with the traditional method. b, d The torso ellipse fittings with the traditional method

Fig. 4
figure 4

The torso ellipse extraction diagram

Fig. 5
figure 5

The results of the ellipse fitting based on our method. a, c The ellipse fittings based on the traditional method. b, d The ellipse fittings based on our method

Fig. 6
figure 6

The fitting results of the different actions. ad The different actions scenarios. The blue ellipse represents the head, the green ellipse represents the torso, and the red ellipse represents the whole body

3.3 Motion features extraction

After the two ellipses fit the head and the torso, respectively, the silhouette features and the velocity feature are extracted from each of them. Therein, the silhouette features are the inclination angle of ellipse \(\Theta\) and the ratio of the long and short axis of the ellipse \(\rho =a/b\). When the people’s action changes, the angle \(\Theta\) and the ratio \(\rho\) both change. Once a fall occurs, the velocity in vertical direction will change rapidly. The velocity in the vertical direction of the ellipse center is extracted as Eq. (1):

$$\begin{aligned} v_v=\sqrt{(y_n-y_{n-1})^2+(x_n-x_{n-1})^2}*F*\sin \Theta \end{aligned}$$
(1)

where \(v_v\) represents the velocity in vertical direction; \((x_{n-1}, y_{n-1})\) is the coordinate center of the \(n-1\)th frame; \((x_n,y_n)\) is the coordinate center of the nth frame; F represents the number of frame per second; and \(\sin \Theta\) represents the sine value of the inclination angle of ellipse.

Fig. 7
figure 7

A feature extraction sample. a The red ellipse is the whole body ellipse fitting; the blue ellipse is the head ellipse fitting; and the green ellipse is the torso ellipse fitting. b Schematic diagram describing the characteristics of falls

A feature extraction diagram is presented in Fig. 7, where in Fig. 7b, a and b represent the long and the short axis of the ellipse, respectively; \(\Theta\) represents the inclination angle of ellipse; \(v_v\) represents the velocity in the vertical direction of ellipse contours center. These six extracted features are fused into a motion feature based on time series. The motion feature is shown as Fig . 8

Fig. 8
figure 8

The motion feature based on time series

4 Real-time CNN-based fall detection

In order to find out the correlation relationship between the head ellipse and the torso ellipse during the fall, deep learning is used to learn the motion features. However, the deeper network structures adopted by predecessors have the problems of the high computational cost and the low convergence [6, 8, 40]. Therefore, we choose a shallow CNN structure to solve these problems. With the above preprocessing methods, the target of the image is segmented from the background for the shallow CNN can learn enough features to achieve a satisfied accuracy. Compared with the deeper structures, the shallow structure can converge fast and keep real-time performance with low computational cost.

4.1 Convolutional neural network

CNN is a deep and feed-forward artificial neural network structure [35], which is widely applied to analyze image. The biggest advantage of the CNN structure is that it can optimize the weight of CNN through a large amount of training dataset without tedious manual operation, so as to achieve accurate classification. The main compositions of CNN are described briefly as below:

  • Convolutional layer Take the input raw image convoluted with a many trainable filters (or called convolutional kernel) and additive bias vectors to obtain multiple mapping feature maps.

  • Pooling layer In generally, behind the convolutional layer, it is used for down-sampling to reduce the dimension of the feature. The two most traditional pooling methods are max pooling and mean pooling.

  • Fully connected layer After the original image is processed by multiple convolutional layers and pooling layers, the output features are compressed into a one-dimensional vector and used for classification. In this layer, other features can be added to this one-dimensional vector.

4.2 CNN-based fall detection

In this paper, the shallow CNN structure shown in Fig. 9 is applied to train and learn the motion features based on time series. There are 74 training videos and 28 test videos, while the learning rate of this structure is set as 0.00001 and the number of epochs is 500. Specifically, firstly, 196 filters of size \(1 \times 12\) are used in the convolutional layer to learn the three divided feature maps based on time series to obtain a rich feature representation of the data. There is only one layer in the convolutional layer. Then, after ReLU activation function is applied to the 196 feature maps, the max pooling layer of size \(1 \times 4\) is used to reduce four times for dimensionality. The feature maps output by pooling layer is flattened and stacked together with some statistical features, (e.g., mean value) to obtain 1024 features through the fully connected. Finally, those features of the fully connected layer output are passed to the softmax function, which calculates the last classification. This model is trained to minimize the cross-entropy loss function which is augmented with the \(l_2\)-norm regularization of CNN weights. The back-propagation algorithm is used to calculate the gradient, and the modified method of stochastic gradient descent is used to optimize the network parameters.

Fig. 9
figure 9

The structure of the proposed shallow CNN

5 Experiment results and analysis

All the experiments are carried out on a laptop PC with Inter(R) Core(TM) i5-4300U CPU @ 1.9GHz and 4GB RAM. In order to test the CNN structure in this paper, we simulate falls and normal daily activities to collect lots of video frame samples. Multiple monocular cameras are used to film 102 short videos from different views and height. These videos include normal activities such as crouching down, walking, squatting down and sitting down, as well as simulating falls in different directions such as backward falls, forward falls and sideway falls. Figure 10 shows different normal activities and simulates falls in various scenes. In the test dataset, there are 30 simulated fall activities and 28 normal activities.

Fig. 10
figure 10

Different activities on our self-collected dataset. a Sitting down. b, c Crouching down. d Walking. e Squatting down. fh Different falling down

Enough train and test frame samples are collected from the self-collected dataset. The detailed description of experimental data is shown in Table 1, there are 14284 frames positive sample images and 18614 frames negative sample images in training dataset, while the test dataset has 4247 frames positive sample images and 5530 frames negative sample images.

The six feature data are fused into a motion feature based on time series as shown in Fig. 8. Then, they are regarded as the input for training and testing in CNN, and the test results are shown in Table 2. The fall detection rate of this method is as high as 90.5\(\%\), and the false alarm rate is as low as 10.0\(\%\).

Table 1 The detailed description of experimental data on the self-collected dataset
Table 2 The test results on self-collected dataset by our method

When one elliptic contour is used to fit the human and detect falls, some special activities like sitting down brutally can easily be misjudged as falling. Facing similar activities like sideward fall and crouching down, this method also has a high false detection rate. Therefore, the discriminating accuracy of these similar activities is tested with our method. Figures 11 and 12 show the two groups of similar activities and their motion features based on time series for proving the feasibility of our method in distinguishing similar activities. Figure  11a, c shows sideward fall and crouching down, while Fig. 12a, c shows backward fall and sitting down. It can be seen that the size and shape of the fitted ellipse are very similar in each group, but Figs. 11b, d and 12b, d show there are obvious differences in the change of the head and the torso features of the two activities. According to our experiments, when the two ellipses are used to represent the head and the torso, respectively, the two similar activities can be effectively distinguished.

Fig. 11
figure 11

The results analysis of sideward fall and crouching down activities. a Sideward fall. b The head and the torso features (sideward fall) based on time series, respectively. c Crouching down. d The head and the torso features (crouching down) based on time series, respectively

Fig. 12
figure 12

The results analysis of backward fall and sitting down activities. a Backward fall. b The head and the torso features (backward fall) based on time series, respectively. c Sitting down. d The head and the torso features (sitting down) based on time series, respectively

To further demonstrate the effectiveness of this method, many extensive experiments are made to compare with some classical methods. The three classical algorithms are achieved in this paper, which are the bounding box ratio analysis approach [33], the ellipse shape analysis approach [27] and Chua’s approach [3]. The experimental results of these methods on the self-collected dataset are shown in Table 3.

The ellipse shape analysis approach [27] uses an ellipse to fit the whole person, and ellipse features and motion history images are fused to detect falls. The bounding box ratio analysis approach [33] uses a traditional bounding box to represent a person and detects a fall by analyzing the aspect ratio of the bounding box. Chua’s approach [3] uses three points to represent the human body and extracts features from the three points to detect falls. As shown in Table 3, our method has achieved a detection accuracy of 90.5\(\%\) and the false alarm rate of 10.0\(\%\). The specific fall detection rates are as follows: the bounding box ratio analysis approach (60.0\(\%\)), the ellipse shape analysis approach (70.0\(\%\)), Chua’s approach (83.3\(\%\)) and our method (90.5\(\%\)). The false alarm rates include our method (10.0\(\%\)), Chua’s approach (13.8\(\%\)), the ellipse shape analysis approach (22.2\(\%\)) and the bounding box ratio analysis approach (25.0\(\%\)). Compared to other traditional geometric feature methods, the two ellipses we used have achieved a higher fall detection rate and lower false alarm rate. Because the two ellipses fitting the head and the torso, respectively, is closer to the contour of human body than other geometries, which can obtain more accurate motion features, in addition, a shallow CNN is applied to learn the correlation between the two elliptic contour features, which can be accurately distinguished some similar activities.

Table 3 The experimental results of these methods on the self-collected dataset
Table 4 The experimental results of real-time frame rate

We have also done enough experiments for real-time test. As shown in Table. 4, the used camera can take videos by the rate of 15 fps. Through testing, we have a frame rate of 17.7 fps in the head segmentation section, and a frame rate of 17.0 fps in the motion feature extraction section. In the CNN-based fall detection part, the method we proposed is also excellent enough. We use 9777 frames of the test data, and the test time is 66.3 s, so the proposed method has a good real-time performance.

6 Conclusions

A novel real-time fall detection method is proposed, which is based on the head segmentation and CNN in this paper. This method is different from the traditional single geometric representation approaches. Firstly, the head is segmented from the body, and the two different ellipses are applied to represent the head and the torso, respectively. Then, three features including the long and the short axis ratio, the orientation angle and the vertical velocity are extracted from the two ellipses in each frame, respectively, and fused into a motion feature based on time series. Finally, a shallow CNN is used to find out the relations between the two ellipses to distinguish some similar activities. Compared with other state-of-the-art methods, our method can effectively distinguish some similar activities while others cannot. Therefore, the detection rate is increased. The experiments also show that the proposed method has a good real-time performance. In future research, we will look for ways to cope with occlusion in a more realistic indoor environment and explore the possibility of applying this method outdoors.