Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Today, smartphones have become a natural part of our daily life; we rely on it more than ever. Its functionalities are diverse. Yet, its full potential is to be unleashed; one of these is the power of its in-built sensors. Figure 1 illustrates several of the many embedded sensors commonly found in modern smartphones. These sensors include the accelerometer, gyroscope and magnetometer, all of which are commonly found in modern COTS (Commercial off-the-shelf) smartphones. Besides smartphones, sensors (and in particular, motion sensors) play an important role in the design of many “smart” products. Though they have various applications such as security and games, there are also perhaps less obvious ones such as human activity recognition by detecting changes in movement of different human body parts. Smartphones are a widely available commercial device and using it as a basis for human activity recognition creates the possibility of widespread usage and potential applications. This can also allow for large-scale data mining and significantly accelerate research in the fields of behavioural and social sciences.

Fig. 1.
figure 1

Embedded sensors in a smartphone

Prior research has already been devoted to determining the effectiveness of sensors in the field of activity recognition. One of the first work in this area, Bao and Intille [4], investigated the performance of recognition algorithms using five accelerometers attached around the body, achieving a high overall accuracy rate of 84 %. Further research has expanded on this field of research, adding in different methods of data collection (varying the number and types of sensors used as well as their position on the body), different models for data classification as well as different lists of activities to cover.

To date, these efforts cover a wide range of scenarios in which such a system may be used. For example, Kwapisz et al. [13] explore this in an outdoor environment with common actions that reflect different changes in body movement and posture, whilst Hung et al. [10] focus on recognising social actions at informal gatherings. Ermes et al. [6] take an interesting approach to this area by classifying a wide range of actions related to sports, including rowing, running, football and cycling. Some works are more focused on drawing conclusions regarding human behaviour. The work by Hung et al. [9] is one example of this. They use a worn accelerometer to track body movement with the aim of detecting conversing groups in a dense social setting, and from this, analyse social behaviour in these groups, such as dominance, leadership and cohesion.

However, much less work has focused on targeting this research to practical everyday situations. For example, Yang et al. [25] required the sensors to be distributed all around the body, and also require them to be strictly oriented in their proper positions in order for accuracy to not be compromised. It is clearly impractical for such a system to be incorporated for use in our everyday activities. If such a system were designed for widespread or commercial purposes, the more important criteria would include availability, accessibility, flexibility and ease of use.

Therefore, this paper extends beyond prior work by simplifying the activity recognition process in order for this field of research to be practical for everyday applications. We aim to accomplish the task of activity recognition whilst relying on widely available non-research based devices with minimal intrusion to our everyday activities, yet maintain an acceptable level of classification accuracy and efficiency. Hence, the smartphone stands out as being the most appropriate for this task. This goal has been investigated with much success by Kwapisz et al. [13], where they also used a smartphone in activity recognition. However, they only accessed the accelerometer, and only explored a few outdoor motion-based activities. In this paper, we develop an Android system that can accurately and efficiently recognize basic human actions and postures using a few embedded sensors in a smart-phone, including accelerometer, gyroscope and magnetometer. We intend to use more of the smartphone’s sensors, and explore a wider coverage of possible human daily actions. Our main contributions are summarized as the following:

  • We address a real-time human activity recognition (HAR) problem using a COTS smartphone. Our approach is light-weight, low-cost, and unobtrusive in the sense that only a smartphone is put in the pocket. Our proposed approach relaxes the requirement that people need to wear multiple devices (e.g., sensors or transceivers) for daily activity recognition.

  • We compare a series of classification methods including k Nearest Neighbors, Decision Tree and LDA in terms of recognizing accuracy and computation efficiency, which paves a way to deploy the machine learning technique for a practical, daily using and less computation-demanding human activity recognition.

  • We conduct extensive experiments to validate our proposed approach. The experimental results demonstrate our system can achieves up to 100 % accuracy in real-world environments. In particular, we implement the system in an Android smartphone and release the APK (android application package) file and the Demo video, making an important step forward for a real-time, practical HAR system.

The rest of the paper is organized as follows. In Sect. 2, we illustrate our HAR system in terms of hardware and activity list. We describe our proposed approach in Sect. 3. In Sects. 4 and 5, we report the experimental results. We overview the related work in Sect. 6 and wrap up the paper in Sect. 7 with conclusion and some future research discussions.

2 System Overview

This section provides an outline of our proposed system. We first give a brief overview of the embedded smartphone sensors, then define an appropriate scenario in which the activities list for this system is devised.

2.1 Built-In Sensors

The sensors used in this paper include the accelerometer, the gyroscope and the magnetometer. These are commonly found in almost all modern smartphones, and provide a decent baseline for distinguishing between actions. An Android phone will be used as supplier of the required sensors because it is very popular, open-source, widely available and most importantly, easily accessible. Unlike many prior work, our system does not require any separate sensors, thus there is less freedom in deciding the location to attach the sensor. In an attempt to keep our work practical, we only consider realistic and common places to carry a smartphone. Some of these include pockets (shirt, front or rear pants), carry bag, cases attached around the waist or simply in the hand. Weighing up the commonness of each of these together with their perceivable effectiveness at being used in the process of action recognition, it is believed that shirt pockets is the best option for this system; it is a reasonably common place for a smartphone and this location is very effective at differentiating almost all postures since it can easily detect changes and movements in the upper body. Hence, shirt pockets will be the chosen placement of the smartphone in our work.

Fig. 2.
figure 2

List of actions and postures to be recognized

2.2 Defining Activity List

Prior efforts have investigated numerous activities in many different scenarios. In this paper, the aim is not to improve the existing models or methods, but rather to move this research area towards a more practical focus. Thus, whereas past works defined a list of related activities primarily suited for the research purpose, our system attempts to cover the range of possible actions one may be performing. We first consider all the possible states the human body may be in, and then define the five following basic actions that can cover all these states:

 

Walk::

Body in motion

Stand: :

Stationary vertical position with 180\(^{\circ }\) straight knees

Sit: :

Stationary vertical position with 90\(^{\circ }\) bent knees

Crouch: :

Stationary vertical position with knees bent less than 90\(^{\circ }\)

Lie: :

Stationary horizontal position

 

Although these actions may have clear precise definitions in a normal everyday context, here we loosen the definition slightly in order for each action to act as a class and encompass many more similar actions providing that the above definition is satisfied (for example, under the above definition, a squat would also be considered as a crouch). This eliminates the need to specify unnecessarily many actions, yet allow for coverage of the all the possible human actions.

Given a particular action class, it is possible to separate the encompassed states into what we define as posture. In this context, postures are simply states that are variations of the same action. In this paper, we consider the postures for standing, sitting and lying.

Standing postures considered include:  

Backward: :

Standing position with backwards lean

Straight: :

Straight standing position

Forward: :

Standing position with forwards lean

Bend: :

Standing position with body bent forwards to about 90\(^{\circ }\)

 

Sitting postures considered in this paper include:

 

Lean: :

Sitting position with backwards lean

Upright: :

Upright sitting position

Slouch: :

Sitting position with forward slouch

Rest: :

Sitting position with forward lean to rest on some surface

 

Lying postures considered in this paper include:

 

Back: :

Lying position with chest facing upwards

Side: :

Lying position with chest perpendicular to bed

Stomach: :

Lying position with chest facing downwards

 

Given any one posture, a person can also perform a range of activities. For example, a leaning posture whilst sitting may include activities such as watching, reading, and many others. However, preliminary testing indicated that distinguishing activities with the same posture was very hard to achieve given the setup of our system (with only one sensor at chest position). Thus, we will only consider recognizing the different actions and postures. Figure 2 illustrates the set of defined actions and postures that are to be trained for use with the system.

Fig. 3.
figure 3

Sensors readings for each of the basic actions

3 Methodology

3.1 Collection of Training Data

An intermediate version of the final system was developed to perform the training data collection. This intermediate system involves setting up the sensors outlined in Sect. 2.1, and then periodically accessing these sensors and writing the sensor data to a CSV file (Fig. 3).

To collect the required training data, a number of people were asked to perform a sequence of actions and postures specified in Sect. 2.2. They had the training system installed onto an Android smartphone and this was placed in their shirt pockets.

We treat each data point as an instantaneous reading from each of the three sensors in all three x-, y-, z-dimensions; this gives a 9-dimensional data point. For each of the specified actions and postures, we collected almost 1000 data points each or in some cases, a sufficient number for that particular action to be distinguished. The sensor readings of these actions do differ in one or more of the 9-dimensions based on our pilot experiments, supporting our assumption.

3.2 Classification Algorithms

This paper explores three well-known classification algorithms, namely k-Nearest Neighbor, Decision Tree Learning and Linear Discriminant Analysis.

k Nearest Neighbour. k Nearest Neighbour (kNN) is a non-parametric classification algorithm, and one of the simplest to implement. It assigns the output class as the majority vote of k of its neighbours. The neighbours can be computed through a variety of distance functions; the one used for this paper is euclidean distance. Given a set of training sensor data and a testing sensor data, the action label is estimated from the training samples whose observation sensor reading has the minimal distance when compared with the testing observation. Assuming we have a training dataset \(\mathbf {T} = \{(\mathbf {s}_1,y_1),(\mathbf {s}_2,y_2),...,(\mathbf {s}_N,y_N)\}\) with N samples, where \(\mathbf {s}_i\in \mathbb {R}^{D}\) is the sensor readings, \(y_i\in \mathbf {l} = \{l_1, ..., l_{J}\}\) is the corresponding action label. Then, given a distance measuring method and a testing sensor readings \(\mathbf {o}\), we can search its k nearest neighbors, represented by \(N_k(\mathbf {o})\). Finally, the testing data is classified by a majority vote of its neighbors, being assigned to a most-common location label \(y^{*}\) among its k nearest neighbors:

$$\begin{aligned} y^{*} = \text {arg}\max _{l_j}\sum _{\mathbf {s}_i\in N_k(\mathbf {o})}\mathbb {I}(y_i=l_j) \end{aligned}$$
(1)

where \(i=1,2,...,N; j=1,2,...,J\); \(\mathbb {I}\) is an indicator function that equals to 1 if \(y_i=l_j\), otherwise 0. In the case of tied votes, we choose the nearest neighbor among the k nearest neighbors to break the tie when using an even k value.

Decision Tree Learning. Decision Tree Learning (DTL) is a very popular classification algorithm based on inductive inference. A decision tree or a classification tree is a tree in which each internal (non-leaf) node is labeled with an input feature. The arcs coming from a node labeled with a feature are labeled with each of the possible values of the feature. Each leaf of the tree is labeled with a class or a probability distribution over the classes. A decision tree is built using features of the training data. New instances are classified by traversing the tree from root node to a leaf (where each node represents one feature). However, this paper aims to accurately distinguish multiple activities, which substantially is a supervised learning problem with several outputs to predict. When there is no correlation between the outputs, a very simple way to solve this kind of problem is to build J independent models, i.e. one for each output, and then to use those models to independently predict each one of the J outputs. However, because it is likely that the output values related to the same input are themselves correlated, an often better way is to build a single model capable of predicting simultaneously all J outputs. First, it requires lower training time since only a single estimator is built. Second, the generalization accuracy of the resulting estimator may often be increased. With regard to decision trees, we adopt this strategy to support multi-output problems.

Linear Discriminant Analysis. Linear Discriminant Analysis (LDA) is another classification method based on features. The idea is to find linear combinations of features of the training data that produces an optimal separation of the classes. It is most commonly used as dimensionality reduction technique in the pre-processing step for classification applications. The goal is to project a dataset onto a lower-dimensional space with good class-separability in order avoid over-fitting (i.e., curse of dimensionality) and also reduce computational costs. This gives a lower dimensionality, yet retain the important information that is used to distinguish between the data.

4 Evaluation

4.1 Comparison of Different Methods

These three algorithms were analyzed in terms implementation, accuracy of classification and efficiency of the algorithm (i.e., time of model training and testing). This was accomplished in MATLAB.

Fig. 4.
figure 4

Comparison of different classification algorithms

Fig. 5.
figure 5

Comparison of the average classification time

Figure 4 demonstrates overall performance comparison of each algorithm using the training data collected in each of the four detection modes. kNN appears to give the best overall performance compared to DTL and LDA. DTL has lower accuracy and higher training and classification time (in all four detection modes) compared to kNN, thus is concluded to be inferior to kNN in all aspects (Fig. 5). Comparing with kNN, LDA has slightly lower accuracy but a considerably lower classification time and hence, may be useful for speeding up the classification. However, classification time was not perceived to be causing any performance issues in this system. Therefore, kNN is the best option of the three (Fig. 6).

Fig. 6.
figure 6

Comparison of the average training time

Fig. 7.
figure 7

Effect of varying k on classification accuracy

4.2 Optimal Selection of Parameters

From Sect. 3.2, it was decided to use kNN for the classification process in the final system. The main parameter, k (i.e., number of neighbours to use), can significantly affect the accuracy of the prediction, and hence, will require some analysis to determine the optimal value for our paper. This was achieved with MATLAB using 10-fold cross-validation. The parameter, k, was varied from 1–500, and the prediction accuracy was recorded for each of these values. Among four detection modes, there exhibits a clear trend that accuracy decreases as the value of k increases, with the maximum accuracy achieved at \(k=1\) (Fig. 7).

This is quite an important observation. The fact that all cases show the same trend and that the optimal value of k is always 1 suggests that this may be a recurring feature for our system (providing that the data continues to be generated in the same way, thus maintaining approximately constant size and noise). Although this is not sufficient evidence to draw a definitive conclusion, it can nevertheless still form a basis for further exploration. The implication of this observation is that expansions can be made to our system and/or the data sets without the need to re-calculate the optimal value for k.

4.3 Development of Real-Time HAR System

The final stage of the system involved integrating the material in the previous sections and completing the development of the system.

Figure 8 presents the system in its completed stateFootnote 1. Every second, the system retrieves the sensor readings and using kNN on the stored training data, it predicts the action performed, and automatically updates the word and renders the image as well as output the appropriate sound corresponding to that action. The implementation of the classification algorithm is sourced from a modified version [15] of the open-source library, WEKA.

Fig. 8.
figure 8

Developed system as Android app

A setting menu was added to enable the user to change the detection mode, mute sound, and/or switch to training mode. The first was accomplished by replacing the current training data set with the new training data set from memory. The last option simply brings the APP to the intermediate state developed in Sect. 3.1 and is used to obtain further training data.

5 In-situ Experiments

With a completed system, we performed numerous tests, similar to the process in which the training data was collected, except that we are now interested in the output action of the system. We evaluated the system’s performance in each of the detection modes using the precision and recall metric. The results of this analysis are shown in Tables 1, 2, 3 and 4.

Table 1. Basic actions
Table 2. Standing postures
Table 3. Sitting postures
Table 4. Lying postures

Overall, the prediction and recall values for all the models are very high, indicating that the system is able to accurately recognize the specified actions and postures with minimal error. Of the four detection modes, the basic actions category appeared to be the least accurate, with some slight confusion between the similar actions of walking, standing and sitting. Nevertheless, the achieved accuracy is acceptable for most applicationsFootnote 2.

6 Related Work

The goal of activity recognition is to detect human physical activities from the data collected through various sensors. There are generally three main research directions: (i) attaching multiple extra sensors and RFID tags on human body, (ii) deploying sensors or trans-receivers in the environment and people do not have to carry them, and (iii) utilizing a COTS smart-phone that almost everyone has without add extra cost and location constraints.

6.1 Wearable Sensors Based HAR

Wearable sensors such as accelerometers and gyroscopes are commonly used for recognizing activities [4, 12]. For example, the authors in [11] design a network of three-axis accelerometers distributed over a user’s body. Activities can then be inferred by learning information provided by accelerometers about the orientation and movement of the corresponding body parts. Bao and Intille [4], investigated the performance of recognition algorithms using five accelerometers attached around the body, achieving a high overall accuracy rate of 84 %. Apart from sensors, RFID has been increasingly explored in HAR systems. Some research efforts propose to realize human activity recognition by combining RFID passive tags with traditional sensors (e.g., accelerometers) [5, 17, 23]. Other efforts dedicate to exploit the potential of using “pure” RFID techniques for activity recognition [16, 26]. For example, Wang et al. [24] use RFID radio patterns to extract both spatial and temporal features, which are in turn used to characterize various activities. Asadzadeh et al. [3] propose to recognize gesture with passive tags by combining with multiple subtags to tackle uncertainty of the RFID readings. However, all of these efforts usually require people to carry the sensors or tags, even RFID readers (e.g., wearing a bracelet). In summary, wearable sensor based approaches have obvious disadvantages including discomfort of wires attached to the body as well as the irritability that comes from wearing sensors for a long duration.

6.2 Environmental Sensors Based HAR

As a result, some research efforts of exploring environmental sensor based HAR (also called Device-free HAR) have emerged recently [19, 20]. Such approaches exploit radio transmitters installed in environment, and people are free from carrying any receiver or transmitter. Most device-free approaches concentrate on analyzing and learning distribution of received signal strength or radio links. For example, Ichnaea [21] realizes the device-free human motion tracking by exploring several installed wireless networks, in which it first uses statistical anomaly detection methods to achieve its detection capability and then employs an anomaly scores-based particle filter model and a human motion model to track a single entity in the monitored area. Zhang et al. [29] develop a tag-free human sensing approach using RFID tag array. More recently, the authors of [8] and [22] propose device-free activity recognition approaches using sensor arrays. RF-Care [27, 28] proposes to recognize human falls and activities in a device-free manner based on a passive RFID array. WiVi [1, 2] uses ISAR technique to track the RF beam, enabling a through-wall human posture recognition. Though promising, such HAR systems however require extra hardwares and also put a strict constraint to human mobilities (i.e., being limited to the area that environmental sensor are deployed).

6.3 Smartphone Based HAR

Recently, smartphone-based HAR systems are also very popular due to its low-cost and being less intrusive [6, 10, 14, 18]. These methods aim to utilize the accelerometers and gyroscopes embedded in smarphones to recognize human activities [13]. The HAR system present in this paper belongs to such technique category. Comparing to other two techniques, smartphone-based approach has two advantages: (i) it does not need more hardware hence without adding any financial burden; (ii) it substantially relaxes the requirement of human motion areas, unlike the environmental sensor based systems that assume the target user always locates in a specific area; and (iii) human daily activity contexts recognized by a smartphone can be much easier to be integrated into modern advanced IoT (Internet of Things) infrastructures considering the built-in Internet-connectivity, computation and storage capabilities in smartphones.

Until today, many smartphone-based attempts have been exploited. For example, Kwapisz et al. [13] introduce a system that uses phone-based accelerometers to perform activity recognition, which first collects labeled accelerometer data from twenty-nine users and then uses the resulting training data to induce a predictive model for recognition. While Hung et al. [10] propose to adopt the accelerometer to automatically recognized socially relevant actions, including in speaking, stepping, drinking and laughing. Henpraserttae et al. [7] proposed a method (using a transformation matrix to project sensor data) that allows data collected from different positions around the body to rectify into one universal coordinate system. Ermes et al. [6] take an interesting approach to this area by classifying a wide range of actions related to sports, including rowing, running, football and cycling. Some researchers also aim to mine useful social behaviors such as Hung et al. [9] use a worn accelerometer to track body movement for detecting conversing groups in a dense social setting, and from this, further analyzing social behavior in these groups, including dominance, leadership and cohesion.

However, much less work has focused on targeting this research to practical everyday situations. It is clearly impractical for such a system to be incorporated for use in our everyday activities. If such a system were designed for widespread or commercial purposes, the more important criteria would include availability, accessibility, flexibility and ease of use. Therefore, our system extends beyond prior work by simplifying the activity recognition process in order for this field of research to be practical for everyday applications. We aim to accomplish the task of activity recognition whilst relying on widely available non-research based devices with minimal intrusion to our everyday activities, yet maintain an acceptable level of classification accuracy and efficiency.

7 Conclusion

In this paper, we have demonstrated the possibility of using sensors embedded in smartphones to accurately perform the task of action recognition. We defined the actions and postures to cover the range of possible states the human body may be in, and have differentiated between these with very high precision and recall values using k Nearest Neighbour as the classification algorithm.

Our approach towards practicability has proven feasible in this paper. The fact that we can now obtain action and posture data without any human effort can allow for large-scale (and possibly more accurate) data collection. This can open up a number of possibilities and applications, including human activity monitoring (for health and/or research purposes) as well as enhanced features for leisurely apps such as games involving body part movements and automatic action identification systems to be incorporated into social networking apps.

Yet, there is still much room for improvement. One limitation of this system has a requirement that the phone must be placed in the shirt pocket. This is clearly impractical in widespread usage since different people may carry their phone in different ways, which may be a worthwhile exploration to our system.

Another area to improve upon is to devise a method to recognize activities associated with each posture. As mentioned earlier, this is very difficult (perhaps even impossible) given the current setup. This is because given the same posture, activities mainly differ in arm movements; this cannot be detected with simply a sensor on the chest. One possibility may be to access additional sensors in the smartphone such as sound or light, to provide more information in classifying certain activities. Overall, the ways to extend this system is vast, yet the potential applications are even greater.