Keywords

1 Introduction

Over the last few years, facilitated by the development of smart spaces, researchers and manufacturers have shown interest in analysing human behaviour via data collected by Internet of Things (IoT) devices. This information is then used to get insights about the behaviour of the user (e.g., sleep tracking), or to perform automated actions for the user (e.g., automatically opening the blinds).

While both PM and smart spaces have been evolving quickly as separate fields of study during the last years, researchers have recently explored combining both disciplines and obtained interesting results. Applying PM techniques to smart spaces data, enables modelling and visualising human habits as processes [19]. However, even though process models could be extracted from smart spaces data, multiple problems arose when applying techniques designed for BPs to human habits [19].

This paper studies how current approaches deal with well-known challenges in applying PM to smart spaces data and human behaviour [19]: modelling formalism for representing human behaviour, abstraction gap between sensor and event logs, and logs segmentation in traces. The main contribution of this article to the research community is therefore threefold: (1) providing an overview and comparison of PM techniques applied to smart spaces, (2) analysing how these techniques currently deal with the three challenges identified, and (3) providing an outline for future work.

The remainder of this paper is structured as follows: Sect. 2 introduces some background concepts and commonly used terminology in the fields of smart spaces and PM. Section 3 describes the related work. The methodology followed to perform the survey is defined in Sect. 4. Results are reported in Sect. 5. Section 6 discusses the results and provides an outline for future work. Lastly, Sect. 7 concludes the paper with an overview of the key findings.

2 Background

2.1 Smart Spaces

Smart spaces are cyber-physical environments where an information system takes as input raw sensor measurements, analyses them in order to obtain a higher level understanding of what is happening in the environment, i.e., the current context, and eventually uses this information to trigger automated actions through a set of actuators, following final user preferences. A smart space produces at runtime a sequence of sensor measurements called sensor log in the form shown in Table 1.

The following terminology is usually employed [21]:

  • Activities, i.e., groups of human atomic interactions with the environment (actions) that are performed with a final goal (e.g., cleaning the house).

  • Habits, routines, or behaviour patterns, i.e., an activity, or a group of actions or activities that happen in specific contextual conditions (e.g., what the user usually does in the morning between 08:00 and 10:00).

Human Activity Recognition (HAR) is a common task in smart spaces that aims at recognizing various human activities (e.g., walking, sleeping, watching tv) using machine learning techniques based on data gathered from IoT environments [16]. [24] argues that HAR is part of a bigger picture with the ultimate aim to provide assistance, assessment, prediction and intervention related to the identified activities.

2.2 Process Mining in Smart Spaces

The main goal of applying PM in a smart space is to automatically discover models of the behaviour of the user(s) of the smart space based on a log of the sensors present in the environment. Models can represent activities (or habits) that users perform in the smart space, e.g., eating, working, sleeping. It is important to highlight the following differences between PM and smart spaces:

  • Whereas smart spaces techniques usually take as input sensor logs, process mining techniques use event logs. Events in event logs are execution of business activities, while sensor logs contain fine grained sensor measurements.

  • The term business process in PM may correspond to the terms activity, habit, routine, or behaviour pattern in the smart space community.

  • While event logs are typically split in traces (process executions), sensor logs are not segmented and may contain information related to different activities or habits.

Smart spaces usually produce and analyse data in the form of sensor logs. According to [27], in order to apply techniques from the PM area, the sensor log must be converted into an event log. The entries of an event log must contain at least three elements: (i) the case id, which identifies a specific process instance, (ii) the label of the activity performed and (iii) the timestamp. The conversion from a sensor log to an event log usually consists of two steps, respectively (i) bridging the granularity gap between sensor measurements and events and (ii) segmenting the event log into traces, i.e., to assign a case ID to each event.

Table 1. Example of a sensor log used in smart spaces

3 Related Work

This section provides a short summary of the surveys and reviews that have previously been performed on the application of PM on human behaviour discovery.

 [21] surveyed the modelling and mining techniques used to model human behaviour. They studied the model lifecycle of each approach and identified important challenges that typically came up when performing HAR. However, they reviewed all sorts of techniques used in HAR, not focusing on PM techniques.

 [24] performed a literature review and created a taxonomy on the application of HAR and process discovery techniques in industrial environments. While focusing on PM for HAR, this study is restricted to one application domain.

 [13] analysed how classic PM tasks (i.e., process discovery, conformance checking, enhancement) have taken advantage of artificial intelligence (AI) capabilities. The survey specifically focused on two different strategies: (1) using explicit domain knowledge and (2) the exploitation of auxiliary AI tasks. While [13] briefly covers the application of PM to smart spaces, this section is rather short as their focus lies on PM in general.

No recent survey has identified which existing PM approaches were applied to smart spaces and how these approaches deal with the challenges identified in [19].

4 Methodology

To perform the survey, a systematic literature review protocol was followed to maximise the reproducibility, reliability and transparency of the results [17]. The protocol consists of six phases: (1) specify research questions, (2) define search criteria, (3) identify studies, (4) screening, (5) data extraction and (6) results. Figure 1 shows the number of studies reviewed and excluded in each phase and the reasoning behind the exclusion.

Fig. 1.
figure 1

Search methodology: included and excluded papers.

4.1 Research Questions

In this article, we will study the following research questions (RQs), focusing on the challenges identified in [19]:

  • RQ-1: how do primary studies represent human behaviour? One of the challenges when applying PM to smart spaces data is to choose an appropriate formalism that can model human behaviour.

  • RQ-2: how do PM techniques address the gap between sensor events and process events? The low-level sensor logs from smart spaces have to be translated to higher-level event logs [32, 35].

  • RQ-3: how do PM techniques tackle logs that are not split in traces? PM requires the log to be segmented into traces, which is typically not the case of sensor logs.

4.2 Search Criteria and Studies Identification

Since this paper is about using PM to model human behaviour from smart spaces data, three groups were identified: group 1 represents PM, group 2 represents human behaviour modelling and group 3 represents the smart space environment. Frequently used synonyms were added to ensure full coverage of the relevant literature on each topic, yielding the following search query:

(“process mining” OR “process discovery”) AND (“behaviour pattern” OR “behavior pattern” OR “habit” OR “routine” OR “activity of daily living” OR “activities of daily living” OR “daily life activities” OR “daily-life activities” OR “daily behaviour” OR “daily behavior”) AND (“smart space” OR “smart home” OR “smart environment” OR “smart building”)

The base set of papers was identified by searching the title, abstract and keywords using the Scopus and Limo online search engines, providing access to articles published by Springer, IEEE, Elsevier, Sage, ACM, MDPI, CEUR-WS, and IOS Press. The final set of articles was retrieved on 05/04/2022.

4.3 Screening

The papers identified by the search string must pass a quality and relevance assessment in order to be included in the survey. The assessment consists of exclusion and inclusion criteria.

The exclusion criteria EQ-x are defined as follows:

  • EQ-1: the study is not written in English.

  • EQ-2: the item is not fully accessible through the university’s online libraries.

  • EQ-3: the paper is a duplicate of an item already included in the review.

  • EQ-4: the study is a survey or literature review primarily summarising previous work where no new contribution related to the research topic is provided.

The inclusion criterion IQ-x is defined as follows:

  • IQ-1: the study is about discovering and modelling human behaviour using PM techniques using smart spaces data and answers at least one research question.

Table 2. Overview of included primary studies

The first set of primary studies was formed by all articles that remain after the inclusion and exclusion criteria screening. Once these studies were selected, forward and backward snowballing was performed. Articles identified through snowballing were screened using the same criteria.

4.4 Data Extraction

First, generic information was extracted such as title, authors, year of publication, and the environment in which the included study is situated. Afterwards, the research questions were answered based on the content of each article.

5 Results

Table 2 gives an overview of the studies included in the survey, and provides general information about each study. Figure 2a shows the publication trend over the years.

5.1 Modelling Formalisms

An overview of the modelling formalisms used by the papers surveyed is shown on Fig. 2b (note that some papers used more than one modelling language). Petri Nets are by far the most used formalism, consistent with the fact that it is a very popular process modelling formalism and the output to several state-of-the-art discovery algorithms.

Petri Nets is followed by weighted directed graphs, mostly as the output of the fuzzy miner algorithm [14], which allows to mine flexible models.

A third noteworthy modelling language is timed parallel automata, a formalism introduced in [12] that is designed to be particularly expressive. Other formalisms are less spread, only used by at most two studies. In addition, only S20 uses a modelling formalism that incorporates the process execution context. Also note that S9 only derived an event log from the sensor log and did not mine a model, hence no formalism is used.

Fig. 2.
figure 2

Statistics about the studies.

5.2 Abstraction Gap Between Sensor Events and Process Events

This section gives an overview of the techniques that the primary studies use to convert sensor events into process events. Among them, S14, S15, S20 and S21 do not require any conversion step because they already work with event logs instead of sensor logs. In particular, S20 and S21 make use of synthetic event logs produced by a simulator. All the other studies have validated their approaches with real-life datasets, as shown in Table 2. Six studies (S1, S2, S9, S11, S15, S19) have performed the validation step on datasets they generated themselves, all the other ones have applied their methodologies on state-of-the-art datasets, namely [6, 25, 30, 31, 38].

Two general approaches to make groups of sensor measurements that correspond to higher-level events can be identified from the literature: (i) classical window-based, time-based or event-based segmentation, and (ii) more complex time-series analysis.

In order to translate raw sensor measurements into proper event labels, the most common method is to derive information from the sensor’s location, as in S1, S5, S11, S12, S15, and S19. E.g., if the triggered sensor is above the bed then the activity “sleeping” is derived. However, this method has its drawbacks, acknowledged in S4: the information provided by motion sensors is not always detailed enough to derive activities accurately. These ambiguities could be addressed by introducing other types of sensor in the environment (e.g., cameras), but making the approach more intrusive.

In S13, authors perform the conversion task by adapting an already existing algorithm to automatically segment and assign human actions’ labels (i.e., MOVEMENT, AREA or STAY), combined with their relative location inside the smart environment (e.g., STAY Kitchen\(\_\)table).

Using a labelled dataset facilitates this conversion task. Studies S8, S10 and S16 have used such labelling to manually deduce event names. However, this approach can be very time consuming and error prone, and labels often corresponds to activities at a higher level of abstraction with respect to atomic events.

5.3 Log Segmentation into Traces

PM techniques typically need a log to be segmented in traces with a case ID [27], a requirement that is often not met by sensor logs (only the sensor log in S10 meets this requirement). To account for this, most of the included studies use a form of segmentation to obtain an event log made of distinct cases, as shown in Table 2, where T is time-based and A is activity-based. We assume that all studies, even those that do not state it explicitly, at least segment the sensor log in one trace per day to meet the requirement posed by PM techniques.

There are two types of segmentations applied in the studies: manual vs automatic. The following studies perform a manual activity-based segmentation:

  • S7 performs activity-based segmentation to segment a log by creating one trace per day. Their approach uses the ‘sleeping’ activity to determine when two consecutive days should be split.

  • S12 uses activity-based segmentation to segment a day into activities. Based on the annotations added by the user, artificial trace start and end events are added to the sensor log (e.g., when a user indicates that he or she is starting the ‘cooking’ activity, a start event is added to the sensor log).

Alternatively, some approaches try to automatically segment the log. This solution appears more feasible in real scenarios than manual labelling, which is time-consuming and error-prone. In the analysed works, automatic segmentation is performed according to the time dimension following different strategies:

  • Using the time-based technique to split days using midnight as cut-off point; such as in S4 and S5.

  • Segmenting each day into activities or visits by measuring the gap between two sensor events. When the gap is larger than a predefined threshold, the log is split in two traces; such as S11 or S21.

In addition, if the sensor log contains different human routines a clustering step is usually implemented, such as in S21.

6 Discussion

This section discusses the invistagated challenges and identifies future lines of research.

6.1 Modelling Formalisms

As discussed in Sect. 5.1, papers applying PM to smart spaces data must explicitly or implicitly choose a formalism to represent human processes.

Interestingly, while it is suggested in [19] that human routines are volatile and unpredictable, the most used formalism in the reviewed studies is Petri Nets, an imperative modelling language. This may simply be because Petri Nets are one of the most widely used languages in PM, which allow, a.o., process checking, simulation and enactment.

A certain number of studies opted for more flexible formalisms, e.g., weighted directed graphs. This enables the discovery of clearer and potentially better fitting models, though less precise and actionable. A solution to make those more actionable is to implement prediction techniques, as in S8. It is also remarkable that none of the studies mined declarative models, a widespread flexible paradigm that could be able to cope with the volatility of human behaviour. This may be explained by the fact that declarative models are usually harder to understand than imperative models, making it more complex for the users to interact with the smart space system.

Finally, another important aspect in smart spaces is context-awareness: the process model should be context-aware to adapt to the changes in the environment [1]. This is surprisingly still neglected in current research about PM applied to smart spaces. Only S20 supports the modelling of context adaptive routines by using context-adaptive task models and process trees.

6.2 Abstraction Gap Between Sensor Events and Process Events

The abstraction gap has been recognized as one of the main challenges in BP applied to IoT data [40].

The main challenge here is that the solutions proposed in the literature are dataset- and/or sensor-specific. In most cases only PIR sensor data are available, witnessing the human performing actions in specific areas of the house. This also makes the techniques proposed very sensitive to the distribution of sensors across the environment. In addition, the scarce availability of datasets makes it difficult to evaluate the proposed approaches across multiple scenarios. In most cases, datasets from the CASAS projectFootnote 1 are used. This does not provide a sufficient heterogeneity to ensure a reliable evaluation.

Finally, input from the broader PM literature could help address this issue. More specifically, generic event abstraction techniques used in PM could also be used to abstract sensor events into process events (see [39]). In addition to this, IoT PM methodologies also propose techniques to extract an event log from sensor data such as, e.g., in S17; a deeper dive in this literature could identify relevant abstraction techniques for smart spaces.

6.3 Log Segmented into Traces

The proposed approaches for segmentation are usually naive (e.g., automatic daily based segmentation) or relying on extensive output from the user (i.e., in manual activity-based segmentation). From this point of view, the open research challenge is to perform segmentation by using the process semantics and the context. An initial proposal has been given in [10] where process model quality measures are used to iteratively segment the log.

In addition to this, segmentation is only a part of the problem, as traces must be clustered in order to produce event logs that are homogeneous from the point of view of instances, which is a prerequisite for PM. This is analogous to the general issue of case ID definition in PM, i.e., pinpointing what an instance of the process is.

6.4 Future Work

First of all, the study of the best modelling formalism for human behaviour is to be continued, as many different languages are used and some languages showing potentially useful characteristics have not been used yet (e.g., declarative models). The choice on the formalism may need to be adapted to the specific application, and transformations between formalisms may also be a viable option to meet diverse needs (understandability, actionability, expressiveness, flexibility, etc.). In addition, the use of contextual information to create more meaningful models remains for a large part unexplored.

Another issue that stands out is the frequent usage of the same datasets by the included studies. A large portion of the included studies use one of the most common datasets from smart homes to perform their research (see Table 2). The scarce availability of these datasets may explain the trend of studies focusing on the home environment (see Table 2). While the use of a common dataset makes it easier to compare the different methods, it might make some of the techniques less generalisable to other data and other environments.

Another suggestion for future work is to source datasets from more varying environments. Diversifying the application scenario could benefit the research community as this might lead to new insights or techniques. Additionally, simulators could also be developed to generate labelled datasets that can be used to develop and validate PM techniques for different kinds of smart spaces and types of sensors.

7 Conclusions

In this paper, we surveyed the application of PM to smart spaces data. A total of 21 studies were included in the survey and classified according to how they handle three main identified challenges PM techniques need to deal with when analysing smart spaces’ data [19]: 1) use of a suitable formalism to represent human behaviour; 2) abstraction gap between sensor events and process events; 3) log segmentation into traces.

The results showed that there are already some suitable solutions for these challenges, achieving the mining from sensor measurements to activities, and sometimes going a step further by identifying habits. However, some important issues still need to be addressed in future work, such as the selection of an appropriate modelling formalism for human behaviour mining, the exploitation of context information, the generalisability of the developed techniques or the challenge of multi-user environments.