Introduction

In twenty-first century, geriatric care has attracted unprecedented attention in almost all research fields to address the potential challenges related to elderly people. The elderly population (aged 60 years or above) in India accounted for 7.4% of the total population in 2001, 8.6% (104 million; 53 million females and 51 million males) in 2011 and has been projected to increase to 19% by the year 2050 [1]. Moreover, due to rapid changes in social trends, elderly peoples are living independently. The above said facts demand an assisted living solution [2], where remote monitoring on the activities of the residents is feasible through the support of intelligent systems.

In today’s era, the most prominent form of assisted living solution is Sensor-based Adaptive Monitoring System for identifying the mild cognitive impairments within the geriatric residents [3]. The cognitive impairments detection is done through the necessary analysis on activities of daily living. Thus activity monitoring or activity detection becomes the prerequisite for the said analysis. Sensors are placed in some strategic locations within the apartment for capturing data; the challenge is to recognize the activities through necessary processing on sensory data.

Activities of Daily Living (ADL) [4] are classified into two types: Basic ADL, such as Sleeping, Walking etc (BADL) and Instrumented ADL (IADL). Two types of sensors, ambient and wearable, are also used for detecting the activities. Basic activities such as Cooking, Sleeping, Eating, Bed to Toilet can be detected using ambient sensors. However, the wearable sensors are needed to detect activities, such as Walking, Running, etc. The use of hybrid sensors may be a blended approach to monitor a person for all possible activities [5]. In general, Ambient sensors are non-obtrusive, whereas wearable sensors may cause some extent of annoyances to geriatric people.

Human activity detection mostly relies on supervised learning. This approach uses classification techniques on a large number of training data for necessary recognition. The existing works can be classified on the approach used for classification. One of the approaches, generative approach models the underlying distributions of the classes, whereas the discriminative techniques focus on learning the class boundaries. Naive Bayes, Decision Tree, KNN, Hidden Markov Model are some example of generative approach [5,6,7,8] and the Support Vector Machines (SVM) [9], linear-chain CRF [10] are commonly used as discriminative approaches for activity recognition.

The existing recognition approaches may not be suitable for some typical geriatric care problem domains. The supervised classification always assumes that a person performs a scripted set of activities several times for preparing the training data set. Sometimes it becomes impractical for an aged person to perform the activities repeatedly as per requirement. Moreover, the uses of wearable sensors may cause some extent of annoyances to the old people. It’s a common incident that the resident have forgotten to wear the sensor (as dementia patients often do this). Thus it becomes more realistic and feasible solution provided the solution can be offered without training data set preparation. To cope with the above said issues, we have defined the scope of the activity detection problem for single inhabitant apartment with the assumptions that only ambient sensors can be used for detecting the Basic ADL without training data. Most of the existing solutions in this activity discovery domain are offline. The basic assumption for most of the the discovery approaches is the data set, as collected through the sensors, with significant volume is available beforehand for necessary analysis. In contrast, in this work an attempt has been made to offer an online discovery approach that will be able to identify the activities based on the most recent data stream.

The paper is structured as follows. “Related Work” describes the related research in the activity discovery field. “Scope of the Problem” describes the scope of the problem. In “Proposed Solution”, the solution of the problem, i.e., the proposed method of activity discovery is discussed in detail. “Experiment and Result” describes the experiment for necessary validation and describes the experimental findings. In “Comparative Study”, a comparative study is presented and “Conclusions” concludes the discussion highlighting the specific contributions of the work.

Related Work

The objective of activity discovery is to identify the activities from the sensor data without a training data set. The discovered activities are annotated and used subsequently for recognition. The activity discovery is not an untouched research area. The general practice of activity discovery is to find the motifs [3, 11, 12]. Motif implies the creation of patterns and plays a crucial role to detect activities. Pattern implies frequently occurred, repeated sequences within a data set. Similar patterns form a specific activity. Sequence Mining is the key approach for finding the patterns. It searches event sequences in order by increasing length. Sometimes, the number of sequence patterns grows exponentially based on the input data. The conventional approaches of sequence mining are not appropriate for the activity discovery domain as the existing approaches search for patterns without considering the order among the events, participating in a pattern. Here, order implies triggering of events one after another. The more recent work [12] also consider the changing behaviour using probabilistic approaches. Human behaviour is characterized by small changes in every day. They incorporate minor activity changes in their work as opposed to traditional data mining approaches. The discovery module is created using offline training data. The enhancement of those solutions is used in the geriatric care domain [3, 11].

It is quite obvious of getting a set of similar patterns instead of getting a single pattern for an activity. Similar patterns with minor variations imply the same activity. Some of the works [11, 13] addressed the issue. Analysis shows that the pattern-based activity discovery fails to discover an infrequent activity though it could be significant information for a specific domain, such as geriatric care. The execution time is comparatively longer as it searches for patterns repetitively. The existing pattern-based approaches [3, 11, 12] are offline, in general, as the complete data set needs to be present beforehand.

Another widely used approach is feature-based activity discovery. In this approach, similar data points are included in a cluster. The widely used approaches are K-Means, DBScan, Agglomerative, etc [14, 15]. The similarity among data points is measured using the different features [16] and similar data points form a cluster. The feature set varies based on sensor type. In general, mean, median, standard deviation, skewness, kurtosis, root mean square, etc. are used as features for wearable sensors, while the location and time are the most important features for ambient sensors [16]. In the case of wearable sensors, finding the similarity among sensor events is not a challenging issue as distinct values trigger different activities. But, in an ambient sensor environment, especially in the case of binary motion sensors, finding the similarity among motion sensors becomes a challenging task. Location and time act as major features in the existing works [17,18,19].

The domain knowledge is needed for measuring the correlation among motion sensors. Most of the existing attempts [3, 11] do not consider the temporal characteristics of the activities and as a result, fail to detect the number of occurrences of an identified activity. Temporal clustering [15] is a process of grouping sensor events or data points that are similar both in space and time. The recent [20] work also consider the latent knowledge along with the domain knowledge. Latent Knowledge is learnt from explicit activity sequence in an unsupervised manner using HMM or Bayesian Network. It is used to improve the performance of activity detection. It is comparable with deep learning approaches.

Most of the existing methodologies in activity discovery [3, 11] are restricted to offline, where the entire data set must be available beforehand. In general, the methodology of existing discovery approaches is to collect the data (maybe throughout the day) and execute the analysis offline; find out the typical patterns and hence discover the activities. Online activity discovery is very complex as opposed to the offline approach. In online approach, activities are needed to be discovered based on most recent stream of data. There is a solution that had addressed this issue and discovered the activities online [16]. However, the existing one uses wearable sensors (numerical data) for the necessary detection. The challenge is to offer an online discovery algorithm for detecting the basic activities of a single resident using only ambient sensors.

Fig. 1
figure 1

Activity discovery framework

Fig. 2
figure 2

Floor plan as mentioned in ARUBA

Fig. 3
figure 3

Network structure of the floor plan as given in ARUBA

Scope of the Problem

Consider the problem scenario at first. It is assumed that there is an aged person in an apartment. She is capable of doing all daily activities. The person is monitored through the generated spontaneous signals from the ambient (motion) sensors which are placed in strategic locations within the house. There is no such training data on the activities performed by the resident. Our concern is to detect basic activities, such as Sleeping, Bed to Toilet, Eating, Meal Preparation, etc with the respective frequency. The frequency of an activity is the occurred instances of a specific activity and may vary in various days. Also, we assume that the activities are done in a sequential way. The challenge is to discover the activities (as the training data preparation by the several time executions of the same activities for an old person is not acceptable) online. Here, the term “online” implies that the proposed solution will be able to discover the activities based on most recent data stream. Moreover, the solution should be capable of discovering the infrequent activities for geriatric care; it could be critical information for geriatric care.

The ARUBA [21] baseline data, as created by Washington State University, is considered as one of the benchmark data set in this domain. The scope includes the implementation of the algorithm on ARUBA baseline data and the subsequent comparison with the ground truth for measuring the effectiveness of the proposed algorithm. The same experimentation needs to be done on other benchmark data sets such as TULUM, KYOTO [22] for ensuring the effectiveness of the algorithm irrespective of the data set.

Proposed Solution

In general, activity discovery framework is conceptualized as given in Fig. 1. The sensors record the data as and when an activity occurs; generate a sequence of events as a data stream. An event signifies the triggering of a sensor with status either on or off at a specific time. Data streams are to be analyzed online for activity discovery. The concern of the paper is to design the activity discovery block as depicted in Fig.  1. The proposed solution discovers the activity through the creation of clusters. The generated stream data, as captured by the sensors, is treated as a collection of events. The homogeneous data points (events) are mapped into a cluster that leads to the detection of an activity. The feature selection is certainly a prerequisite of cluster creation. Similar events are clustered based on identified features. The clusters, formed through the said process, are treated as a probable activity. The clusters with fewer and scattered data points are treated as an outlier. Human intervention is needed at this stage for necessary annotation and subsequently the activities are considered as “Identified Activities”. When an instance of an identified activity is determined, the corresponding features are extracted. In later case, whenever another instance of an identified activity occurs, it can be identified using the derived feature set of the corresponding activity. No further human intervention is required to identify each instance of an activity. A detailed discussion on each phase of the proposed the discovery process is discussed in this section.

Data Set

We have considered the floor plan as given in ARUBA [21] and is depicted in Fig.  2. The Table  1 shows the snapshot of data (from motion sensors) as given in ARUBA [21]. A typical row of data is treated as an event. Each event is described by the date, time, and the value (On/Off) of a specific motion sensor at that time instant. For example, the first row in the data set implies that the M018 sensor is “on”. Here the data set is annotated with “Meal_Preparation Begin”. The benchmark data set with necessary annotation are considered as ground truth data. The data set assumes that all the events are generated sequentially. A specific collection of events create an activity. For example, the first ten events (in the Table 1) imply a typical activity Meal Preparation. As a whole, ARUBA [21] considers eleven activities within the data set. The activities of interest are Sleeping, Meal Preparation, Eating, Work and Relax, etc.

Table 1 Raw data as triggered in ARUBA data set

Feature Selection

Feature selection is the key to identify the homogeneous events. Here the events, associated with a specific activity, are considered as homogeneous. In this work, the selected features are location and time. The absolute time (ex.08:11:43) is considered as a time feature. Each activity have special characteristics based on their occurred location and triggered time. The proposed solution assumes that if the sensors within close proximity are triggered and continued for a significant time duration, an activity occurs. Moreover, the inclusion of time feature helps to identify the time duration of each identified instances of an activity.

We have prepared a network as mention in Fig.  3 from the floor plan given in Fig.  2 and use it to measure the location similarity between two sensors. The concept behind building the network is as follows. Each sensor is having its own detection radius zone [23] and is treated as a vertex. If a person is within the zone, the corresponding sensor is fired. Several motion sensors could be changed their state, while a person is changing her position from one location to another. As an example, Bed to Toilet activity is not just about triggering M003 and M004, but all intermediate sensors are fired, such as M005, M007, and M004, in a sequence. One of the possible routes [for Bed to Toilet] of the person may trigger the sensors in a sequence [M003, M005, M007, M004]. As a result, the edges are formed between the pairs [M003, M005], [M005, M007], and [M007, M004]. Similarly, all possible routes are explored and accordingly the edges are formed. Thus the network is considered as a Graph (V, E), where V denotes all sensor nodes and the edge set implies the probable connection between the motion sensors. The graph, based on the floor plan, provides a correlation between the triggered sensors and their corresponding locations.

Each activity forces to trigger a specific set of sensors from a specific region. As an example sleeping activity triggers sensors M003 and M002. However, it is observed that each activity must have dominating sensors. Dominating sensors are sensors that are mostly triggered, while a specific activity is going on and it is treated as an identifier for identifying an activity. Activities are categorized into two types: steady activities, and scattered activities. In “Steady activities”, the set of dominating sensors is almost unchanged, whereas for scattered activities (the activities, those are not confined in a particular location) no such consistent dominating sensors are found. For example, In ARUBA, the sensors M002 and M003 are considered as dominating sensors for a steady activity “Sleeping”. The proposed concept of dominating sensors is used as one of the key components for cluster formation.

Online Event-Based Activity Discovery (OEAD)

The online clustering approach is used here to create the clusters based on location and absolute time features. Online k-means algorithm is used as the number of clusters are not known in advance. The process starts with a single cluster with the first event as received from stream data. The next event is mapped in the same cluster depending on the similarity measure between two events in terms of location and time. The location similarity is measured using the proposed similarity metric as given in Eq.  1.

The similarity measure between the dominating sensors and the new incoming event is computed and new incoming event is placed accordingly. Another feature, the time similarity is calculated comparing between the mean time of the cluster (mean time of the existing events in that cluster) and the time of the newly occurred event. The mean time of the cluster is the expected time interval between two consecutive events and is treated as a preset time for a specific cluster at a given time. Each cluster is active for a pre-set time duration. The said time duration is dynamically determined for each activity based on the data pattern. As a result, the derived time duration of a specific activity such as sleeping may vary for various data sets, such as ARUBA and TULUM. Moreover, the derived time duration also varies for different activities but for a specific data set. In the cluster formation approach, a cluster is considered as an activity instance provided there is no incoming event within the derived pre-set time duration for that cluster. The subsequent annotation identifies the activity with a typical feature set. The other instances of the same activity are computed based on this knowledge. The duration of an activity is nothing but a difference between the first arrival of the event and the last arrival of the event in that cluster. A minimum distance-based similarity measure is used for computing the location similarity. As the floor plan can be viewed as a network, the distance between two nodes (dist (i, j)) is calculated in terms of hop count. The distance matrix D is the collection of distances between every pair of nodes (i, j). Then, the similarity between two sensor nodes is as mentioned in Eq.  1.

$$\begin{aligned} Sim(i,j)= {\left\{ \begin{array}{ll} 1 &{} \text {if } i=j\\ 1-\frac{dist(i,j)}{Max(D)} &{} \text {if } i\ne j\\ \end{array}\right. } \end{aligned}$$
(1)

The similarity decreases as the distance increases between two nodes. As an example, for the given ARUBA floor plan, the distance between M001 and M002 is 1 and max (D) =10. Thus the similarity between M001 and M002 is 1−(1/10) = 0.9. It implies the said sensors are strongly connected and they could be involved in a single activity.

As clusters are growing dynamically through the generation of events; the cluster statistics (dominating sensors, meantime) are changing every time. Each dynamic cluster has a waiting period, say t. The waiting time is the time for a cluster to be matured; after waiting time, a cluster is going to be outputted as a probable activity. The value of “t” is selected in the experiment phase. The clusters form with fewer events are considered as outlier. The algorithm of the stated discovery mechanism is depicted in Algorithm 1 .

figure a

Annotation: Automatic Labeling to the Instance of the Preidentified Activities

Each discovered instance is represented as a sequence of sensor events. Lets consider \([E_{1}, E_{2}, E_{3}...,E_{n}]\) is a set of sensor events, triggered for a particular activity instance. Several features are extracted to label /annotate the later occurrences of the same activity. A feature vector is formed considering three features, in general. These are triggering sensors, frequency of the triggering sensors and, duration of the activity instance. This works well when activities rarely share a common set of motion sensors. However, sometimes, it can be seen that the same set of sensors are used for more than one activity. As an example in ARUBA [21], Wash Dishes and Meal Preparation cause to fire mostly identical sets of sensors. Therefore, only identification of sensors and their frequency is not sufficient to annotate such instances. Instead, one should infer the triggering pattern of the sensors that is nothing but an order based on frequency. We have considered the relative frequency of the sensors in a pattern. lets five sensors are fired in two different ways as following order :\([S_{5}, S_{2}, S_{3}, S_{1}, S_{4}]\) and :\([S_{4}, S_{3}, S_{2}, S_{1}, S_{5}]\). It is assumed that the ordering is done based on their frequencies and these two sequences represent two distinct activities. From the first sequence, it is apprehended that for an activity \(S_{5}\) triggers most of the time and follows by \(S_{2}\), \(S_{3}\), \(S_{1}\), \(S_{4}\). We are not concentrating on exact frequency value rather the relative natures of the frequencies. After extracting the feature set from the pre-discovered activity, the next step is to match the feature set with the unannotated matured clusters. Based on the matching, the unannotated matured cluster is labeled as annotated one. The details of the algorithm is described in Algorithm  2.

figure b

Experiment and Result

The proposed discovery algorithm needs to be validated for measuring effectiveness. The same experiment is done with ARUBA [21], TULUM [22], and KYOTO [22] data set. The details of the experiments including findings are stated below.

Experiment on ARUBA

The ARUBA [21] data set consists of 9 months of data. The last 6 month data has been considered for discovering the activities as well as dynamically annotate the prediscovered activity instances. Here, the transitions between the activities are removed before the discovery process starts. We have considered only the data generated from motion sensors. The given results are found considering the threshold values of “loc_sim” (locationwise similarity) and “wait_time” as 0.7 and 300 s, respectively. The threshold values are selected through experimentation. As an outcome of the discovery algorithm, two thousand and one hundred ninety-nine (2199) numbers of clusters are found initially. These clusters indicate all the instances of the identified activities that occurred within 6 months. The identified activities are Sleeping, Bed to Toilet, Meal Preparation, Eating, Wash Dishes, Relax, Work, Housekeeping, Enter Home, Leave Home, and Respirate.

The proposed algorithm is effective to remove short outliers. We set a minimal waiting time for tracing a quick switching over activities and it works in most of the cases. However, due to the use of this small waiting time, some activities with events, separated by long time intervals, are split into several consecutive clusters, such as Sleeping. Thus it is treated as two different occurrences of Sleeping activity, but basically, it is the same activity with a long duration. As a result, to overcome the said problem, the algorithm is enhanced with a refinement. The matured cluster is stored in a temporary buffer instead of declaring it as an activity. The next matured cluster, evolved as an output, is compared with the previous cluster, and accordingly, the two clusters could be merged. It merges clusters if the dominant sensors of the clusters are locationwise highly correlated and no other sensors event has been triggered (as a single residence is present) in between occurrences of two clusters.

The proposed OEAD algorithm discovers all the steady activities but fails to detect the scattered activities, such as Housekeeping in most of the cases. Here, we successfully able to find the clusters of infrequent activities, such as Bed to Toilet. An infrequent activity indicates an activity with low frequency (it may be only once also). We are capable to detect infrequent activities having a satisfactory number of data points (events). At least a minimum number of events are needed for identifying the dominating sensor. Some instances of the short span activities are merged with long activities. This is due to the high locationwise similarity between the two activities. For example “Wash Dishes” activity is merged with Meal Preparation as these two activities have occurred consecutively in almost the same locations. The annotated 6 month data of ARUBA [21] has been taken as ground truth for the validation of the proposed algorithm. Ground truth data means as available in an annotated benchmark data set. The proposed OEAD algorithm is executed on said 6 month data and the activity instances are detected as follows: Sleeping (145), Bed to Toilet (72), Meal Preparation (394), Eating (58), Wash Dishes (4), Relax (1037), Work (53), Housekeeping (2), Respirate (2), Leave Home(216), Enter Home(216). The corresponding detection ratio is given in Table  2. The precision, recall, and F1 measure are also calculated for the successful cases to measure the quality of the detected activities(clusters). It is given in Table  3.

Another major observation is the detected instances of the steady activities are comparatively less compared to the other activities in terms of ground truth data. The reason behind this finding may be as follows. In the offered solution, due to the absence of training data, we don’t have the exact duration of the concerned activities. In the proposed algorithm OEAD, we have assumed that if the sensors within close proximity are triggered and continued for a significant time duration, an activity instance occurs. The time duration is experimentally determined and is fixed for each activity. In general, the static activities such as sleeping, meal preparation are continued for a longer time span compared to the other dynamic activities such as bed to toilet, housekeeping, etc. Moreover, the time gap between two events in these activities may be large enough and as a result, the events are split within various clusters. In OEAD, a cluster is formed as a potential activity whenever there will be a minimum number of events (a threshold value as used in the algorithm) occurring. Some of the activity instances are treated as outliers due to the absence of a minimum number of events. Thus, the algorithm fails to detect a significant number of instances for all these static activities that in turn reduces the number of instances of the static activities compared to the ground-truth data set ARUBA.

Table 2 Detection ratio of OEAD on ARUBA
Table 3 Accuracy measure of OEAD on ARUBA

Enhancement of OEAD Algorithm

The experimental result shows that the proposed OEAD fails to distinguish consecutive activities having the same set of dominant sensors. As per example, in ARUBA [21] data set, most of the time Meal Preparation and Wash Dishes are performed in the same place and in a consecutive manner. As a result, OEAD fails to discriminate such typical scenario. This, in the enhancement of OEAD, the sequence of sensor triggering is used for betterment. This is needed when the same set of sensors is responsible for more than one activity and triggered in quick succession. The sequence of sensors is nothing but triggering the pattern of the sensors. This pattern is identified by the entropy. In the case of ARUBA, meal preparation and wash dishes are the two activities responsible for triggering almost the same set of sensors. OEAD fails to discriminate provided the above-said activities are performed in quick succession.Thus, an information gain-based solution [24] has been used for discriminating these consecutive activities. The said method is able to detect the change point in time series data. Here, the assumption is the number of activities to be segregated is known beforehand. The said method divides a given time series data into a prespecified number of nonoverlapping segments. The method segments the data points, where information gain is maximum. Information gain is measured through entropy value. The cost function \((\mathcal {L}_k)\) is used to compute the entropy at a given point. Assuming the time series data X to divide into k segments the cost function is defined as:

$$\begin{aligned} \mathcal {L}_k=H(X)-\sum _{j=1}^{k} \frac{s_j}{X}H(s_j) \end{aligned}$$
(2)

where \(s_j\) is the jth segment, |.| is the length operator, H(X) is the entropy of the whole time series as a segment and \(H(s_j)\) is the entropy of the segment \(s_j\) measured as:

$$\begin{aligned} H(s_j)=-\sum _{i=1}^{m} p_{ji} \log p_{ji} \end{aligned}$$
(3)

where m is the number of the distinct sensor events and \(p_{ji} (p_{ji} \le 1, \sum _{i=1}^{m} p_{ji}=1)\) is the mass ratio of the sensor event \(c_{i}\) in segments \(s_{j}\) (i.e., frequency \(c_{i}\) in segment \(s_j\) divided by the sum of all the frequencies under all sensor events in segment \(s_j\)) defined as:

$$\begin{aligned} p_{ji}=\frac{\sum _{q \in s_j} q(c_i)}{\sum _{d=1}^m \sum _{q \in s_j} q(c_d)} \end{aligned}$$
(4)

where \(q(c_i)\) denotes the value of the qth point of the ith time series dimension within the segment \(s_j\).

figure c
Table 4 Raw data for meal preparation and wash dishes as triggered in ARUBA data set

Let us illustrate the scenario through an example. Consider the snippet of data set as given in ARUBA [21] Table 4. The snippet contains two consecutive activities having an overlapping set of dominant sensors: Meal Preparation and Wash Dishes. The proposed OEAD algorithm fails to distinguish these two activities and consider as a single one. It can be shown from Table 4 that M019, M018, M017, M015, M014 are the top five dominating sensors of both activities. For the specific case, the maximum information gain is found at the change point 04-11-2010 09:56:25 M014 OFF. The change point divides the time series data into two halves. The first half indicates meal preparation, whereas second half indicates wash dishes.

An experiment has been conducted on the ARUBA [21] data set to show the effectiveness of the proposed information gain-based segmentation technique. Table 4 shows the result of the said technique. As an example from Table 4, the first row shows the above said technique split the data points at time 09:26:29, whereas in original data file the splitting time is 09:28:19. Naturally, to elaborate the performance of the segmentation technique we introduced errors related to shifted time. It is calculated considering the ground truth. It is measured as the total squared difference between the estimated meantime of the predicted sequence and meantime of the actual sequence. The error is defined in Eq. 5 (Fig. 4).

Fig. 4
figure 4

Information gain based activity segmentation

$$\begin{aligned} Error=\sum _{i=1}^{n}(y_{i} - x_{i})^{2} \end{aligned}$$
(5)

where n is the number of activities. \(y_{i}\) is the mean time of the predicted sequence and \(x_{i}\) is the mean time of the ground truth sequence.

Experiment on TULUM

The proposed algorithm is also executed on the TULUM [22] data set to observe the performance of the algorithm. The motivation is to verify the performance of the proposed method on various benchmark data sets. The TULUM data set represents sensor events collected in the WSU smart apartment testbed from April to July of 2009. A married couple resided in the apartment. They performed their normal daily activities. There are 18 motion sensors (M001 through M018) and two temperature sensors (T001 and T002). We have considered only motion sensor data for our experiment. We focused on identifying the activities irrespective of the person associated with the activity.

We have taken 3 months of data to validate the proposed OEAD algorithm. The detection ratio of the activities is given in Table  5. Here also, the accuracy is measured in terms of precision, recall, and f1 measure: The accuracy of the successful occurrences is given in Table  6. Activities with short duration are not detected as we have failed to identify the dominating sensor for short activities. Moreover, the accuracy measure of some activities such as R1 Breakfast is poor as the transition data is also included in activity data within the data set.

Table 5 Detection ratio of OEAD on TULUM
Table 6 Accuracy measure of OEAD on TULUM

Experiment on KYOTO

The proposed algorithm is further executed on the KYOTO data set [22] for the validation purpose. This data set represents sensor events collected in the WSU smart apartment testbed during the summer of 2009. The apartment housed two residents, R1 and R2, and they are performing their daily normal activities. The sensors consist of primarily several types of motion sensors, item sensors, and temperature sensors. We include only motion sensors in our experiment as we intend to detect activities using only motion sensors. Thirty-Nine days of data have been taken to discover the activities. The number of activity instances and their accuracy measure is in Tables  7 and 8, respectively. The accuracy is poor compare to the ARUBA data set as only motion sensors are considered. In the case of the KYOTO [22] data set, motion sensors and item sensors are having equal contributions to perform an activity.

Table 7 Detection ratio in KYOTO
Table 8 Accuracy measure of OEAD using KYOTO

Experiment on atmo1 Data Set

The experiment is also performed on a relatively newer data set (2015). This data set [25]contains a sensor data identified by the attributes described by date, time, sensor identifier, sensor state, and optional activity label. The proposed algorithm OEAD is executed on the said data set and the relevant findings are stated in Table  9 in terms of the metric detection ratio.

Table 9 Detection ratio of OEAD on atmo1

Validation of Proposed Approach

We have used another approach to validate the quality of the outcome of the proposed discovery method. Activity recognition identifies the activities based on training data. Thus to validate the quality of the discovery method OEAD, we consider the result of OEAD as training data of some standard recognition approaches. If the accuracy of the proposed discovery algorithm is good enough, the recognition output should be more accurate that in turn proves the efficiency of the discovery algorithm. Here, the standard approaches are Random forest, SVM are used for recognition. The output of recognition algorithm is compared with ground truth and is depicted in Fig.  5. The output shows that the recognition is satisfactory and it also ensures the quality of the proposed algorithm.

Fig. 5
figure 5

Validation through activity recognition

Comparative Study

The existing discovery approaches are classified into two groups: pattern-based and feature-based. Thus, for the comparative study, we have selected one solution from each group. [11] is an offline pattern-based approach, where the entire data set need to be present beforehand. The patterns, obtained after executing the algorithm on 1-day data of ARUBA [21] given in Fig. 6. In pattern-based, the methodology is to find out the existing pattern in the data set that in turn reflect the activities. We have executed the discovery as described in [11] and OEAD for all three data sets and is depicted in Fig.  6. The figure shows that the proposed solution performs well in the case of ARUBA [21], while the performance is equally comparable for the TULUM [22] and KYOTO [22] data sets.

Fig. 6
figure 6

Comparitive analysis on number of detected activities

Table 10 Identified patterns in ARUBA

However, there are significant observations those are listed below.

  • It has been found that we have identified the infrequent activity “Bed to Toilet” but the existing work referred in [11] failed to detect all infrequent activities. This is due to the low occurrence of the specific activity. For example the activity “Bed to Toilet” occurs only one time in the data set, the activity becomes undetected.

  • In contrast, we have failed to identify the scattered activities, such as “Housekeeping”, whereas the [11] can identify such scattered activities.

  • The [11] failed to discriminate between two activities having almost similar patterns and considered as a single activity. As in the case of Meal Preparation and Wash Dishes, there is a slight variation in the usage of sensors. Thus, applying edit distance the two activities are merged as a single one. As for example, the pattern [‘M016’, ‘M016’, ‘M019’, ‘M019’], and [‘M019’, ‘M015’, ‘M018’, ‘M019’] are quite similar and treated as a single activity. The obtained patterns are showed in Table  10.

In contrast to the offline approach, the paper [18] deals with online activity-based segmentation with the help of training data. The segment as generated through the execution of approach [18] is considered as an activity instance. The method is executed on stream data. We have compared the method with the proposed discovery method OEAD through necessary experimentation. The comparative results for the data sets ARUBA [21], KYOTO and TULUM [22] are given in Fig.  7.

Fig. 7
figure 7

Comparison of detection ratio

Conclusions

The overall attempt is to propose a solution for identifying the activities through analyzing sensor stream data, especially for geriatric care applications. Keeping in mind the typical requirements of geriatric care, the contributions of the work is given below.

  • The proposed solution is able to identify the activities without any training data set. It is expected that preparing training data through repetitive execution of an activity by an aged person may not be feasible always.

  • The proposed solution offers a method considering only ambient sensor data. Offering a solution considering wearable sensor data makes the situation easier but the typical scenario of geriatric care restricts us to consider wearable sensor data. The domain knowledge tells that an aged person may be reluctant to wear a specific device and the method should offer a solution without considering these data.

  • The proposed solution offers an online approach. A significantly large data set on which analysis is done to discover the activities are not needed. The solution considers stream data and discovers the activities with the same perfection.

  • Another major contribution is an exhaustive experiment. We have executed the proposed solution on three benchmark data sets to form an overall idea regarding the performance of OEAD irrespective of a specific data set. Moreover, the comparison with the notable accepted approaches, offline and online both, is also done and the experimental results establish the importance of the proposed OEAD algorithm.

Scattered activity is an activity that may be related to more than one location. As our target domain in geriatric care, the occurrence of scattered activity is normal. Another issue as comes out through the experimentation is the transition between the activities needs to be identified more accurately. Otherwise, the transitions are merged with the activities that generate activity discovery with low accuracy. The limitations of the the enhancement of OEAD algorithm is that it can not be applied in realtime as it is not possible to know whether an activity cluster is merged or not. However, later in offline annotation phase merge clusters are identified through manual intervention. The above mentioned issues remain unaddressed and certainly can be considered as future work.