1 Introduction

With the fast growing population of seniors, more and more elderly people in developed countries are living alone [23]. Each year, one out of three adults aged 65 years and older fall due to various reasons [10]. Fall at home is one of the major risks for elderly people, so an immediate alarming and helping is essential to reduce the rate of morbidity and mortality [20]. Since fall detection is one of the most important healthcare issues for elderly person at home, a number of research projects have been conducted for automatically detecting falls. An excellent survey on the recent developments can be found in [22].

Table 1 Image samples of three kinds of states of single frame in Multiple Cameras Fall Dataset and our dataset

Fall detection methods are normally divided into two categories: sensor based and vision-based methods. Sensor based approaches, including wearable devices and ambient device, are widely used in the past years to detect human fall. Wearable sensor-based devices, such as accelerators and gyroscope sensors, are attached to the human body. Those wearable devices collect and provide data to a computer system or an embedded system which then analyzes the data to detect fall [5, 12, 15]. Wearable sensor-based methods have advantages of lower computational efficiency and easier to be installed. However, it intrusive and often burdensome which make elderly person do not like and forget to wear those vital sensors. Ambient devices are sensors installed on the elders’ active regions [1], such as vibration sensors on the floor, to improve the performance [6, 13]. Ambient sensor based approaches eliminate the need for wear sensors all the time, however, sensors are required to be installed all over the environment. Moreover the major problem is that ambient sensor scan often generates false alarms and leads to a low detection accuracy [14]. Recent years, vision-based methods are becoming more and more popular which can be used to effectively detect multiple events simultaneously with less intrusion.

In this paper, a new method is proposed for fall detection based on computer vision and machine learning. A simple and effective deep learning network PCANet is introduced to train a single-frame human fall detection model which will be used to label every image. The structure of this paper is as follows. In Section 2, we consider the related work, whereas in Section 3, we present our human fall detection framework, which includes two stages. More in detail, Section 3.1 describes the features learning algorithm with PCANet and Section 3.2 describes the fall action detection model with SVM. In Section 4, we evaluate the performance of our model on a publically available dataset [2] and also on the dataset collected by ourselves. In Section 5, we give the final conclusions.

2 Related works

With the advances in computer vision in the last few decades, computer vision-based methods provide a promising way for detecting falls. Analysis of the person’s bounding box is a simple method to detecting a fall from the surveillance video and easy to implement [18, 19]. However, a bounding box could not efficiently discriminate fall down events from fall-like events, such as sit, squat or other normal behaviors. The accuracy of this method highly depends on the relative position of the person and the field of view of the camera, and it works effectively only when the surveillance camera is placed sideways or at the same level as the human object. To better represent human shape, ellipse shape, instead of a bounding box, was introduced and good accuracy was achieved in fall detection. Chen et al.[8] used ellipse shape represented the human in the video and combined human shape analysis with other analyses, namely motion analysis and posture estimation analysis in their approaches to detect falls. In comparison with the bounding box method, ellipse shape-based approach gives a better representation of human shape and good accuracy in fall detection. However, fall-like events was also falsely detected as fall-down events using ellipses.

3 Our approach

In our approach, we apply feature learning methods to detect a fall action. The flow chart of our fall detection framework based on PCANet is shown in Fig. 1. We obtain two models after training stage: one is a single frame detection model and another is an action model. In training stage 1, the training set is extracted from video sequences with different views. Here the training samples are formed by the sub-image including human in a specific resolution extracted using ViBe [3], which is a powerful technique for background detection and subtraction in video sequences. The training samples for the single frame detection model are labeled to three classes, namely Standing, Falling and Fallen, and then a PCANet model is trained by all the samples. Fall incident is a consistent event including many frames, so reliable fall detection should analyze a video sequence, instead of single frame. In training stage 2, based on the prediction result of the PCANet model for each frame, an action model is obtained by SVM with the predicted labels of frames in sub video clips.

Fig. 1
figure 1

Flow chart of human fall detection framework based on PCANet

3.1 Features extraction with PCANet

For image classification tasks, the hand-crafted low-level features can generally work well when dealing with some specific tasks or data process, like SIFT and HOG for object recognition [4, 11]. Yet, they are not universal for all conditions. Therefore, the thought of learning features from data of interest is proposed to make up for the limitation of hand-crafted features, and deep learning is treated as a better method to abstract high-level features which provide more invariance to intra-class variability. Compared with convolutional neural networks (CNN), a complex architecture of neural network, which requires some expertise of parameter tuning and long-time training, PCANet is superior for its easy training and better adaption to different conditions.

The basic architecture of PCANet can be seen above in Fig. 2. It is composed of three stages: the first two stages of PCA and the last stage of hashing and histogram. Assume that we have N training images of size m × n. In each image, we take a patch of size k 1 × k 2 around each pixel illustrated in Fig. 3, collect all the patches, vectorize them and combine them into a matrix of k 1 × k 2 rows and (m − k 1 + 1) × (n − k 2 + 1) columns.

Fig. 2
figure 2

The structure of the two-stage PCANet

Fig. 3
figure 3

Illustration of patch taking for a 5*5 image

For the ith image I i , we obtain a matrix X i , and we subtract patch mean from each patch and obtain:

$$ \mathrm{X} = \left[\overline{X_1},\overline{X_2},\dots, \overline{X_N}\right]\in {R}^{k_1{k}_2\times Nc} $$
(1)

where c denoting the number of rows of X i . Then, we move on to obtain the eigenvectors of XX T, and save the ones corresponding to the L 1 largest eigenvalues as the PCA filters, which can be expressed as:

$$ {W}_l^1={q}_l\left(X{X}^T\right)\in {R}^{k_1{k}_2},l=1,2,\dots, {L}_1 $$
(2)

The leading principal eigenvectors capture the main variation of all the mean-removed training patches. Thus, we finish the first stage.

At the second stage, we share similar process with stage 1. The input images I l i of stage 2 should be:

$$ {I}_i^l={I}_i*{W}_l^1,i=1,2,\dots, N $$
(3)

the boundary of I i is zero-padded so that I l i have the same size of I i . We collect all the patches of I l i , subtract patch mean from each patch and obtain:

$$ {Y}^l=\left[\overline{Y_1^l},\overline{Y_2^l},\dots, \overline{Y_N^l}\right]\in {R}^{k_1{k}_2\times Nc},l=1,2,\dots {L}_1 $$
(4)

and, we combine the Y l together as a matrix:

$$ Y=\left[{Y}^1,{Y}^2,\dots, {Y}^{L_1}\right]\in {R}^{k_1{k}_2\times {L}_1Nc} $$
(5)

After that, we obtain the eigen vectors of YY T, saving the ones corresponding to the L 2 largest eigenvalues as the PCA filters of the second stage

$$ {W}_{\ell}^2={q}_{\ell}\left(Y{Y}^T\right)\in {R}^{k_1{k}_2},\ell =1,2,\dots, {L}_2 $$
(6)

At the final stage, for each input image of stage2, we get

$$ {T}_i^l={\displaystyle \sum_{\ell =1}^{L_2}{2}^{\ell -1}H\left({I}_i^l*{W}_{\ell}^2\right)},l=1,2,\dots {L}_1 $$
(7)

The function H(.) binaries these outputs, i.e. the value of the function is one for positive inputs and zero otherwise. For each of the L 1 images T l i , l = 1, 2, 3, …, L 1 we partition it into B blocks, whose size is k 1 k 2 × B, and we compute the \( {2}^{L_2}\times B \) histogram matrix in each block ranging from \( \left[0,{2}^{L_2}-1\right] \), followed by vectorizing the matrix into a row vector Bhist(T l i ). Finally, we concatenate the Bhist(T l i ) of T l i , l = 1, 2, 3, …, L 1 as the feature

$$ {f}_i={\left[ Bhist\left({T}_i^1\right),\dots, Bhist\left({T}_i^{L_1}\right)\right]}^T\in {R}^{\left({2}^{L_2}\right){L}_1B} $$
(8)

According to specific application, the local blocks can be either overlapping or non-overlapping, and in our work, we keep it as non-overlapping [7].

As we use the PCANet to extract the features of the images of falling down, we take a set of frames of standing and fallen people from the video recording a person’s falling down, and label the standing frames with 0 and fallen frames with 1. The model parameters of PCANet include the patch size k 1, k 2, the filters number L 1, L 2, the number of stages and the block size for histograms. In our experiments, we set our image size 60 × 60, and we set the patch size 10 × 10, the stage number 2, L 1 = L 2 = 10, and the block size 25 × 25. The extracted features from PCANet then are put into SVM for classification with the attached labels.

3.2 Fall action detection with SVM

In our work, we trained two linear SVM models. The single-frame SVM model is used to predict the labels of individual frames generated from video sequences while the action model is to predict the label of a video.The single frames are normally classified to two states: “Fall” and “Not Fall”, but for some frames it is hard to tell the label. For example, falling ends on the knees is a confused case, because in this case people is still able to move, i.e. they would stand up, or would consequently lie down. Thus, here we use three states for a single frame: Standing, Falling and Fallen. The frame with human that is clearly standing up has the state of “Standing”, and the frame with human that is completely fallen down has the state of “Fallen”, and the one between fallen down and standing has the state of “Falling” (See Table 1). Then the features obtained from the PCANet model can be fed into a linear SVM classifier to train a single frame fall detection model.

Considering the fact that the fall behavior is a consistent action including many frames, the reliable fall detection should analyze a video sequence but not just a frame. Thus, we take every 30 consecutive frames that can capture a complete fall incident as a sub video sequence. The sub video sequences for training are classified to two states: “Fall” and “Not Fall”. We use the single-frame SVM model to predict the frames of each sub video, and we will obtain a sequence with 30 prediction labels. Combined with the sub video’s state “Fall” or “Not Fall”, each sub video can produce a training sample. The prediction results of all the sub videos form the training set to train the second stage SVM classifier.

4 Experiments

To evaluate the performance of the proposed method, we apply our model on a publicly available Multiple Cameras Fall Dataset and a dataset collected by ourselves . The previous dataset recorded simulated falls and normal daily activities from eight cameras that are mounted on the walls and in the oblique settings at 30 fps of 720×480 pixels. It consists of 24 kinds of fall incidents (11 crouching, 9 sitting and 4 lying on a sofa) from eight cameras. We extensively evaluate the proposed method on a dataset collected by ourselves. Our dataset includes 192 fall videos from four cameras mounted on the walls and in the oblique settings at 30 fps of 352×288 pixels. In the progress of collect fall videos, we choose four different directions simulating fall activities. We simulated fall activities in four different postures in every direction. For every posture we simulate 3 times fall activity.

We carry out eight experiments on the multiple cameras fall dataset and four experiments on our dataset, and one experiment tests the videos captured from the same camera view. For each experiment, we circularly choose one video as a test set at one time, and the rest of videos as training set of single frame model. The multiple cameras fall dataset record 24 kinds of short videos of fall and normal activities, and our dataset record 48 kinds of fall events. Table 1 shows the image samples of three kinds of states of single frame in multiple cameras fall dataset and our dataset. The three states are “Standing”, “Falling” and “Fallen” and the corresponding labels are 0, 1, and 2.

The procedure of generating the training set and the test set is shown in Table 2. First, the background model is obtained with ViBe, and then the foreground mask including human is extracted. After some morphological operations and connected component analysis, we can locate human in the foreground mask image by a bounding box. With a serial of post-process operations we can get the sub foreground images. Then the images are normalized to a resolution of 60×60 pixels to be predicted in PCANet model. The last row of Table  2 shows the predicted label.

Table 2 A sub video sequence of 30 frames to train action model in second stage. Second row is the original image sequence; third row is the foreground mask; the forth row is sub foreground image; the last row is the predicted label of single frame detection model

We used sensitivity and specificity, which were commonly used in the fall detection literature [13], as indices to evaluate our proposed method.

$$ \mathrm{Sensitivity}:\;{S}_1=\frac{\mathrm{TP}}{\left(\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{N}\right)} $$
(9)
$$ \mathrm{Sensitivity}:\;{S}_2=\frac{\mathrm{TN}}{\left(\mathrm{T}\mathrm{N}+\mathrm{F}\mathrm{P}\right)} $$
(10)

where:

  • TP is the number of falls correctly predicted by the system. It corresponds to the value in the first row and the first column of a confusion matrix.

  • FN is the number of actual falls missed by the system

  • FP is the number of false detections of falls.

  • TN is the number of normal activities which are correctly predicted as ‘not falls’.

The proposed method was compared with five state-of-the-art approaches quantitatively on the same publicly available dataset provided in [2, 17]: bounding box ratio analysis [21], Chen’s approach, MHI based approach [16], Biomechanics approach and Chua’s approach [9] based on three-point representation. Experiments show that the proposed method has achieved reliable results compared with other common methods on the Multiple Cameras Fall Dataset (See Table 3), Biomechanics approach can achieve 100 % sensitivity and specificity, but people need to wear burden sensors all the time. The experiments result in our dataset achieves the 93.81 % sensitivity and 98.4 % specificity in camera 1, and it is higher than other vision-based methods (See Table 4). The average performance of 88.87 % sensitivity and 98.9 % specificity is also better than other methods in general.

Table 3 Comparison of the proposed method with other state-of-art approaches
Table 4 Results of the proposed method on the dataset collected from four cameras in different views

5 Conclusions

In this paper, we proposed a new framework for fall detection based on automatically feature learning methods. We extracted features with PCANet, and trained two SVM classifiers to detect fall incidents. With the two models, the experiments show that our proposed method can achieve a nearly equal performance in the Multi-Camera Fall Dataset compared with other state-of-the-art methods. Moreover, we obtained a better performance when increasing the training samples in our dataset. We believe the performance can be further improved if a larger scale dataset is used and more number of camera views is involved.