1 Introduction

Recent smartphone not only serves as the communication device, but also provides a rich set of embedded sensors, such as an accelerometer, digital compass, gyroscope, GPS receiver, microphone, and camera. As mobile phones contain increasingly various sensors, sensing information is used for a wide variety of domains such as healthcare, social networks, safety, environmental monitoring, and transportation [1].

Because current mobile devices include built-in sensors that provide real-time data gathered from the surroundings of the devices, it is possible to infer its user’s current status [2]. Activity is useful context to provide personalized services like health monitoring, user adaptive interface, and content recommendation. Because of the importance of the activity, many researchers investigated activity classification using smartphones. In most cases, machine learning techniques are used to classify activities on smartphones [3].

Recognizing human activities from data sequences is a challenging issue [4]. In order to implement practical activity-aware systems, the underlying recognition module has to handle the real world’s noisy data and complexities [5]. Furthermore, the systems must consider some important constraints such as relatively insufficient memory capacity, lower CPU power (data-processing speed), and limited battery lives [6]. A lot of classification methods have been investigated. Some research incorporated the idea of simple heuristic classifiers [7]. On the other hand, the other studies used more generic methods from the machine learning techniques including decision trees, Bayesian networks [6, 810], support vector machines [11], neural networks [12], and Markov chains [1315].

Probabilistic models such as Bayesian network and hidden Markov models (HMMs) are appropriate for dealing with vagueness and uncertainty of data in real life for context-aware services. However, it is difficult to apply them to mobile devices because it requires a lot of memory and CPU time. Hierarchical probabilistic model is the combination of some separate classification modules. The modular structure is suitable to overcome the limitations in mobile environment. This paper proposes an activity-aware system using a hierarchical probabilistic model, layered HMMs, to recognize a user’s activities in real time. The activity recognition system is developed on an Android smartphone, and has two layers of HMMs to deal with both short-term and long-term activities. To evaluate the usefulness of the system, we collected mobile sensing data from graduate students and compared the accuracy of the system with the alternative methods. Experimental results demonstrate the superior performance of the proposed method over the alternatives in classifying both long-term and short-term activities. The proposed method showed better performance up to 10 % than the alternatives.

2 Related works

2.1 Activity recognition with mobile sensors

There are many studies to recognize a user’s activities with mobile sensors. Kwapisz et al. proposed activity recognition system using an accelerometer on Android phone [16]. Longstaff et al. presented a semi-supervised learning method to train a classifier for monitoring patients [3]. Maguire et al. classified human activities using k nearest neighbor and decision tree [17]. They extracted some features such as mean, standard deviation, energy, and correlation from acceleration and heart rate data. Győrbíró et al. developed a system to recognize a person’s activities from acceleration data using a feed-forward neural network [18]. Song et al. proposed an activity recognition system for the elderly using a wearable sensor module including an accelerometer [12]. They used multi-layer perceptron to recognize activities of daily life. Zappi et al. used HMMs to recognize a user’s activities from acceleration data [13]. Berchtold et al. proposed a fuzzy inference based classifier to recognize a user’s activities [19]. Recently, many researchers attempt to incorporate various sensors with different places to attack the problem [2022]. Table 1 summarizes the studies to recognize human activities using mobile sensors.

Table 1 Activity recognition using accelerometers

Most studies in activity recognition have been using discriminative approaches such as support vector machines (SVM) and decision trees, ignoring the time-series characteristics of sensor signals. Although these models are easy to implement, they require the use of a rich set of features, which in turn increases computational costs, in exchange for algorithm simplicity. To take advantage of this inherent characteristics of sensor signals, we propose an activity recognition system using layered HMMs to deal with multi-modal sensors efficiently.

2.2 Hierarchical probabilistic models

There are many probabilistic models, including hierarchical structure, such as hierarchical Bayesian network (HBN), hierarchical dynamic Bayesian network (HDBN), hierarchical hidden Markov model (HHMM), and layered HMMs (LHMM). The model with a hierarchical structure is an effective solution to the problems which can be divided into smaller units. For instance, to recognize human activities, the model of human is divided into smaller parts of body, head, and arms, respectively. The hierarchical models can improve accuracy and speed of activity recognition.

Min and Cho [11] proposed a method to recognize activities by combining multiple classifiers such as support vector machine (SVM) and Bayesian network (BN). Park and Aggarwal presented a method for the recognition of two-person interactions using a hierarchical Bayesian network. They divided the overall body pose into separate body parts. The pose of body parts is modeled at the low level of the BN, and the overall pose is estimated at the high level of the BN [8]. Mengistu et al. used hierarchical HMM for spoken dialogue system [27]. Wang and Tung recognized dynamic human gestures using dynamic Bayesian networks which represented multi-level probabilistic process like hierarchical HMM [9]. Du et al. proposed a dynamic Bayesian network based method to recognize activities. They divided the features for human activities into global and local features, and built a hierarchical DBN model to combine the two features [10].

Table 2 summarizes examples of hierarchical models. In most cases, hierarchical model is used to recognize an activity that consists of smaller behaviors or time-series patterns.

Table 2 Related works using hierarchical model

3 Activity recognition using layered HMM

The proposed structure consists of two steps to analyze sensor data and recognize a user’s behaviors. First, after sensor data are collected from sensors on a smartphone, the data are transferred to HMMs and preprocessing units to classify a user’s short-term activities. In the second phase, the second-level HMMs are used to recognize a user’s long-term activities from the temporal pattern of the short-term activities and the other features. Figure 1 shows the process of the entire system.

Fig. 1
figure 1

Activity recognition system overview

3.1 Feature selection for HMM state recognition

In many applications of activity recognition on a mobile device, the problem of high dimensionality of data appears. Since high dimensional data often require a large amount of memory and CPU time to analyze, dimensionality reduction is essential to recognize behaviors in mobile environment. In order to reduce the dimensionality of raw data, most classifiers extract features from the data. However, it is difficult to extract good features to reduce dimensionality without any degradation to the performance for activity recognition. Usually, the feature extraction involves a function that measures the capability of the feature set to discriminate the classes [29].

This section presents the feature selection for the activity recognition system. Firstly, the window size for collecting the raw sample is reviewed to make into features. Then, the information gain value of each feature is calculated as a measure for feature selection. Finally, classification methods for features are chosen for the activity recognition with the criteria.

A window of 5 s is used for the period of activity recognition. There are various window sizes depending on sensor types and activities to recognize in a mobile environment. Hyunh and Schilele compared diverse window sizes, 0.25, 0.5, 1, 2, and 4 s, to recognize various behaviors such as ‘jogging,’ ‘walking,’ ‘skipping,’ ‘hopping,’ and ‘standing’ [30]. Bao and Intille determined each window size for activity recognition with 6.7 s [31]. Kern et al. used 1 s to detect human activity using body-worn sensors [32]. The smaller window size is not effective to consider certain long-term activities and the larger window size may include noises since multiple activities could exist. The window size of 5 s was used by our previous work [33] in classifying the classes of activities that this work targets.

To measure the “value” of each feature, the information gain (or the predictive power) of each feature is calculated [34]. Suppose that F be the set of all features and X the set of all training samples, value(x, f) with xX defines the value of a specific sample x for feature fF. E specifies the entropy. The information gain for a feature fF is defined as follows:

$${\text{IG}}(X,f) = E(X) - \sum\nolimits_{{v \in {values}(a)}} {\frac{{|\{ x \in X|{value}(x,f) = v\} |}}{|X|} \cdot E(\{ x \in X|{value}(x,f) = v\} )}$$
(1)

The information gain is the difference between the total entropy and the relative entropies when a specific feature value is determined. It can be used as a score to measure the power of prediction [35].

As Table 3 shows, orientation, pitch, roll, acceleration, and magnetic field show relatively high information gain score. On the other hand, the other sensors such as light sensor, location, and proximity sensor have lower information gain score. This result implies the need to analyze acceleration, magnetic field, orientation values more carefully. To consider temporal patterns delicately, hidden Markov model (HMM) is applied to the values of the three sensors of high information gain score. The other features are processed using simple rules with pre-defined thresholds.

Table 3 Information gain scores for feature selection

3.2 First layer HMMs for short-term activity recognition

HMM is a probabilistic model based on Markov chains, and it is suitable to handle time series data such as speech processing and stochastic signal processing [36]. The HMM λ consists of three elements as follows:

$$\lambda = (A,B,\pi )$$
(2)

where λ represents a HMM model, A is a state transition probability distribution, B is an observation distribution, and π is an initial state distribution. Let us assume that we have M observation symbols and N states for this model. A = {a ij }, including transition probability from state i to state j, is defined as follows:

$$a_{ij} = P(q_{t + 1} = S_{j} |q_{t} = S_{i} ), \quad 1 \le i, j \le N.$$
(3)

where a ij  > 0 for all i, j. The observation probability in state j, B = {b j (k)}, is defined as follows:

$$b_{j} (k) = P(x_{k}\, {\text{at}} \;t|q_{t} = S_{j} ), \quad1 \le j \le N, \quad1 \le k \le M.$$
(4)

The initial state probability in state i, π = {π i }, is defined as follows:

$$\pi_{i} = P(q_{1} = S_{i} ), \quad1 \le i \le N.$$
(5)

The probability of the observation sequence, X = x 1, x 2, …, x T , given the model λ, P(X|λ) is calculated through enumerating every possible state sequence of length T (the number of observations). A fixed state sequence Q is defined as follows:

$$Q = q_{1} , q_{2} , \ldots , q_{T}$$
(6)

where q 1 is the initial state. The probability of the observation sequence X for the state sequence of Eq. (6) is

$$P(X|Q,\lambda ) = \prod\limits_{t = 1}^{T} {P\left( {X_{t} |q_{t} ,\lambda } \right) = b_{{q_{1} }} (x_{1} ) \cdot b_{{q_{2} }} (x_{2} ) \cdot \ldots \cdot b_{{q_{T} }} (x_{T} ).}$$
(7)

The probability of such a state sequence Q can be written as

$$P(Q|\lambda ) = \pi_{{q_{1} }} a_{{q_{1} q_{2} }} a_{{q_{2} q_{3} }} \ldots a_{{q_{T - 1} q_{T} }} .$$
(8)

The joint probability of X and Q, the probability that X and Q occur simultaneously is simply the product of the above two terms as follows.

$$P(X,Q|\lambda ) = P(X|Q,\lambda )P(Q|\lambda ).$$
(9)

The probability of X (given the model) is obtained by summing the joint probability in Eq. (9) as follows.

$$P(X|\lambda ) = \sum\limits_{all\;Q} {P(X|Q,\lambda )P(Q|\lambda ) = \sum\limits_{{q_{1} ,q_{2} , \ldots ,q_{T} }} {\pi_{{q_{1} }} b_{{q_{1} }} (x_{1} )a_{{q_{1} q_{2} }} b_{{q_{2} }} (x_{2} ) \ldots a_{{q_{T - 1} q_{T} }} b_{{q_{T} }} (x_{T} )} }$$
(10)

There are two types of HMMs, continuous HMM and discrete HMM, according to data types. While the discrete HMM deal with discrete data from a categorical distribution, the continuous HMM uses a single Gaussian or a mixture of Gaussians as the continuous observation distribution. Using Gaussian mixtures for observation distributions requires evaluation of the probability densities in the mixture for each feature vector at each state in the HMM. Since the evaluations are computationally complex, they account for much of the time spent in activity recognition [37]. In order to recognize activities more efficiently, we consider to quantize the feature vectors into a finite set of symbols prior to activity recognition as shown in Fig. 2.

Fig. 2
figure 2

First layer HMMs for short-term activity

K-means clustering is performed to quantize the feature vectors into finite symbols. Clustering is a method to assign a set of samples into groups according to a distance metric [30]. K-means clustering aims to partition n observations into k groups in which each observation belongs to the group with the nearest mean. The algorithm uses an iterative refinement technique. Given an initial set of k means m 1, …, m k , the algorithm proceeds by iterating the following two steps:

  • Assignment step: assign each observation to the cluster with the closest mean

    $$C_{i}^{t} = \{ x_{p} :||x_{p} - m_{i}^{(t)} || \le ||x_{p} - m_{i}^{(t)} ||,\forall 1 \le j \le k\}$$

    where each x p goes into exactly one C (t) i , even if it could go in two of x p : an observation, m t i : mean for the ith cluster at time step t C (t) i : a set of observations assigned to the ith cluster at time step t

  • Update step: calculate the new means to be the centroid of the observations in the cluster as follows

    $$m_{i}^{(t + 1)} = \frac{1}{{|C_{i}^{(t)} |}}\sum\limits_{{x_{j} \in C_{i}^{(t)} }} {x_{j} }$$
    (11)

As mentioned in the previous section, discrete HMMs are trained using orientation, acceleration, and magnetic field to analyze a user’s activity. The data are discretized into 20 states by k-means clustering with k = 20. Our previous work [33] used continuous HMM to recognize activities from acceleration, where the observations were assumed to follow a Gaussian distribution. Although the acceleration data follow a Gaussian distribution, continuous HMMs for magnetic field and orientation, not following Gaussian distribution, cannot show good performance in the experiments. The HMMs at the first layer recognize short-term activities: stay, walk, run, vehicle, and subway, and they have five hidden states.

3.3 Second layer HMMs for long-term activity recognition

After the first layer HMMs, we can get the inference results of the short-term activities among stay, walk, run, vehicle and subway. In addition, the features of the light, proximity, and time, and the locations visited at that time constitute a feature vector, which is passed to the next layer of activity recognition as shown in Fig. 3. The models at this level are also discrete HMMs, with one HMM per long-term activity to classify. This layer of HMMs gets the sequence of the feature vectors for about 5 min to handle the concepts that have longer temporal patterns. Long-term activities recognized by the system include: relaxing, moving, working, and eating.

Fig. 3
figure 3

Second layer HMMs for long-term activity

The final goal of the system is to decompose in real-time the temporal sequence obtained from the sensors into concepts at different levels of abstraction or temporal granularity. At each level, we use the forward–backward algorithm to compute the likelihood of a sequence given a particular model. The HMM model with the highest likelihood is selected to perform inference with LHMMs. In the approach, the most probable short-term activity is used as an input to the HMMs at the next level [28].

Let us suppose that we train K HMMs at level L of the hierarchy, λ L k , with k = 1, …, K. The log-likelihood of the observed sequence X 1:T for model λ L k , L(k) LT is defined as follows:

$$L(k)_{T}^{L} = \log \left( {P\left( {X_{1:T} |\lambda_{k}^{L} } \right)} \right) = \log \sum\limits_{i} {\alpha_{T} \left( {i;\lambda_{k}^{L} } \right)}$$
(12)

where α T (i; λ L k ) is the alpha variable of the standard Baum-Welch algorithm at time T, state i and for model λ L k . Equation (13) shows a recursive function α T (i; λ L k ).

$$\begin{aligned} \alpha_{T + 1} \left( {j;\lambda_{k}^{L} } \right) = \sum\limits_{i = 1}^{N} {\alpha_{T} \left( {i;\lambda_{k}^{L} } \right)a_{ji}^{{\lambda_{k}^{L} }} b_{j} \left( {x_{T};\lambda_{k}^{L} } \right)} \hfill \\ a_{ji}^{{\lambda_{k}^{L} }}{:}{\text{ the transition probability from state}}\;j\;{\text{to state}} \,i\, {\text{from model}}\;\lambda_{k}^{L} \hfill \\ b_{j} \left( {x_{T}; \lambda_{k}^{L} } \right){:}{\text{ the probability for state}} \,j \, {\text{in model}}\;\lambda_{k}^{L}\; {\text{of observing}}\;x_{T} \hfill \\ \end{aligned}$$
(13)

At that level, we classify the observations by declaring class Class(T)L as follows.

$${\text{Class}}({\text{T}})^{L} = \arg \mathop {\hbox{max} }\limits_{k} L(k)_{t}^{L} , \quad k = 1, \ldots , K$$
(14)

The window size varies with the granularity of each level. At the first level of the hierarchy, the samples of the time window are extracted from the raw sensor data. At the second level of the hierarchy, the inference outputs of the previous level are used as part of samples.The other sensors except acceleration, orientation, and magnetic field are analyzed to extract three features in Eqs. (15), (16), and (17).

$${sum}_{X} = \sum\limits_{i = 1}^{S} {\sqrt {\left( {x_{i} - x_{i - 1} } \right)^{2} } }$$
(15)
$${mean}_{X} = \frac{{\sum\nolimits_{i = 1}^{S} {\sqrt {\left( {x_{i + 1} - x_{i} } \right)^{2} } } }}{S}$$
(16)
$${std}_{X} = \sqrt {\left( {\sum {\sqrt {\left( {x_{i + 1} - x_{i} } \right)^{2} } - {\text{mean}}_{X} } } \right)^{2} }$$
(17)

where x i is the value from a specific sensor at time step i, and S is the total number of samples in a window.

3.4 Mobile interface for visualization

The recognized activity is displayed by using mobile applications. We develop two applications to manage personal information such as visited places, short-term activity, and long-term activity. Figure 4 shows an application for short-term activity visualization. It has three interfaces: activity representation by text, activity representation by graph, and place representation on a map. In addition, the other program summarizes long-term activities in a day as shown in Fig. 5.

Fig. 4
figure 4

Short-term activity visualization on a mobile phone

Fig. 5
figure 5

Long-term activity visualization on a mobile phone

4 Experiments

4.1 Data collection

Mobile sensor data were collected from four graduate students who are 26–32 years old for over a week. Figure 6 shows a part of the logs, which illustrates the correlations between the temporal pattern and the short-term activities. In this paper, attending a lecture and studying were regarded as a work, one of long-term activities, since the users were students. Table 4 summarizes the details of collected data from a mobile phone.

Fig. 6
figure 6

Acceleration and magnetic field data for each short-term activity

Table 4 A summary of the data sets used

4.2 Evaluation of HMMs for short-term activity recognition

To compare the proposed k-means clustering-based HMM (KMC + HMM) with other classification methods, we conducted an experiment using the collected data set. In the experiment, we applied naïve Bayes (NB), multi-layer perceptron (MLP), j48 decision tree (J48), support vector machine (SVM), and Bayesian network (BN) as well as a single layer HMM with quantization by k-means clustering. Those models were from the Weka (http://www.cs.waikato.ac.nz/~ml/weka). Some features such as mean, standard deviation, and summation in a window are used to recognize short-term activities (stay, walk, run, vehicle, and subway) by the classification methods because most of them cannot handle time series data. Figure 7 shows a part of the short-term activities.

Fig. 7
figure 7

Short-term activities to be recognized by KMC + HMM

We tested the precision and the recall of the recognized short-term activities by comparing them with the actual labels acquired from the users, and achieved about 80 % for the short-term activities as shown in Table 5. Here, the proposed first-level HMM shows the comparable performance with other classification methods, even though it is not the best for both criteria. However, we can see the proposed model produces the results consistently. Tables 6 and 7 show the performance for each short-term activity. There are some subjects who used subway during data collection. Although location information is not used at this level, using subway is recognized well because of the magnetic field sensor.

Table 5 Performance comparison of classification methods for short-term activity
Table 6 Comparison of precision for short-term activity
Table 7 Comparison of recall for short-term activity

4.3 Evaluation of LHMMs for long-term activity recognition

We attempt to recognize four kinds of long-term activities (move, eating, relaxing, and working) as shown in Fig. 8. It is assumed that the activities have different temporal patterns correlated to sensing values. For instance, when a user is moving at a department store, he or she is going to repeat walking and standing regularly. On the other hand, if a user works at the office, the acceleration pattern may be similar to ‘staying’ for sitting in a seat, and his location is fixed to ‘office.’ The long term activities can be recognized by considering location, and time as well as a sequential pattern of short-term activities. Especially, location is important information to estimate a user’s activities. Figure 9 shows the locations for each long-term activity.

Fig. 8
figure 8

Long-term activities to be recognized by HHMM

Fig. 9
figure 9

Locations for each long-term activity

In Table 8, the precision and the recall of the proposed layered HMM are compared with other classification methods such as BN, NB, MLP, SVM, and J48. The layered HMM shows better performance in average than the other methods. For precision, Relax is the only long-term activity that the proposed method was not perfect, and for recall, the proposed method was perfect except the activity of Eat. It also shows that naïve Bayes classifier has worse performance than all the other methods. The difference between NB and BN implies that hierarchical structure is suitable to recognize the long-term activities.

Table 8 Performance comparison of classification methods for long-term activity

Tables 9 and 10 depict the performance of the classifiers for each long-term activity. The experiment was done with tenfold cross validation. J48 and LHMM show better performance for all activities. However, it is difficult to classify ‘relax’ and ‘eat’ activities sometimes because they have similar short-term activity patterns, location, and time.

Table 9 Comparison of precision for long-term activity
Table 10 Comparison of recall for long-term activity

5 Concluding remarks

In this paper, we attempt to recognize short-term/long-term activities in real time using mobile sensors on the Android platform. The layered HMM structure is used to model the temporal patterns with multi-dimensional data. As HMM is a Markov chain with both hidden and unhidden stochastic processes, for activity recognition, the unhidden or observable components are the sensor signals, while the hidden element is the user’s activity. Looking at the results of short-term activity recognition, it shows comparable performance with other classification methods. LHMM has the best accuracy among other classification methods for recognizing the long-term activities.

There are still many problems to be solved for the activity recognition on a mobile phone. In our experiment, ‘relax’ and ‘eat’ activities have some difficult patterns to classify, and we need to use more wearable sensors. It is necessary to consider modeling a user’s variations for personalized services as well. Moreover, the comparison with more diverse classification methods such as dynamic time warping (DTW), hierarchical HMM (HHMM), and hierarchical dynamic Bayesian network (HDBN) is also a very crucial issue to be considered as a future work.