Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Home medical devices (e.g. infusion pumps, inhaler, nebulizers, etc.) which are used by patients at home on their own, are becoming more and more prevalent, due to their cost-saving advantages. However, non-professional users, especially for elderly people with cognitive decline, may sometimes wrongly operate a medical device (e.g. not following required operating procedures). More seriously, the device-use error can lead to fatal results. For example, according to [13], during 2005–2010, 710 deaths linked to the use of one kind of home medical devices, the infusion pumps, which intravenously deliver life-critical drugs, food and other solutions to patients. Therefore, it’s critical to have some external mechanisms to supervise patient’s use of these devices and keep the use-error from happening. One straightforward solution to this problem would be let a professional person play the supervisor’s role. However, because it contradicts to the main objective of home medical devices, i.e. reducing the cost, this solution is obviously infeasible. Instead of using “expensive” human’s supervision, in this paper, we propose a cognitive assistive system whose objective is to automatically monitor the use of home medical devices.

The cognitive assistive system has two-fold functionalities: perception and recognition. On one hand, the system should be able to perceive user’s operations. Since various advanced sensors been developed in the past decades, we are able to perceive user’s operations from many different aspects. For example, the KinectFootnote 1 that was invented recently can provide us not only the RGB information but also the depth that cannot be captured by traditional cameras. On the other hand, another important requirement for building a successful cognitive assistive system is to be able to recognize the operations so that use-errors can be identified [9]. However, unlike the well-developed perception module, it is still unknown whether the current techniques are mature enough to fulfill this requirement. Even though many techniques have been proposed to recognize human actions [1, 6, 15, 16, 20], none of them has been applied to recognize operations involved in using home medical devices and therefore, we are not sure if they are adequate in such scenario.

Since the lack of corresponding database is the main reason causing the situation, we construct a database (called PUMP) which was specially designed for studying the use of home medical devices and present it in this paper. Particularly, we take the example of patients using an infusion pump as a typical type of home medical device operation for collecting this database. An infusion pump is a device that infuses fluids, medication or nutrients into a patient’s blood stream, generally intravenously. Because they connect directly into a person’s circulatory system, infusion pumps are a source of major patient safety concerns. Because of this significant impact, we used an infusion pump as the sample device. To collect the data we first define an operation protocol for correct use of an infusion pump [7]. Then, each user was asked to simulate the use of infusion pump for several times. Their operations were recorded by three Kinect cameras from different views as shown in Fig. 1. Seventeen volunteers participated in data collection and 68 multi-view operation sequences were generated respectively. The operation sequences were then manually annotated by locating the temporal intervals of all operations in each sequence. The database will be released to public for research purpose.

Fig. 1
figure 1

Examples of data recorded in PUMP database. The first row is the RGB data while the second row is the corresponding depth data. All user operations were captured by three Kinect cameras from different views

After building the database, we then evaluate the recognition performance of the existing approaches on the database. Even though using state-of-the-art approach which demonstrates near perfect performance in recognizing general human actions, we observe significant performance drop when applying it to recognize device operations. A subtle and overlooked unique characteristic of actions involved in using devices restrains the performance of the existing action recognition algorithms.

The uniqueness is illustrated in Fig. 2. It shows four actions and the corresponding extracted MoSIFT features [4]. The regions that are relevant to the target actions are indicated by green boxes. Figure 2a–c show the actions of “running”, “jumping” and “hula hoop” selected from three popular action recognition datasets [3, 11, 18]. For all three cases, most feature points in the whole frame are inside the green box. Because the “noise” points outside the box are relatively few, it is safe to use all features in the whole frame to model an action. However, as shown in Fig. 2d, for the action “turning a device on”, a typical action in using a home medical device, only a very small part of the features lies in the green box compared to all the features extracted from the whole frame. In this case, it is no longer reasonable to use all the features to represent actions, since the representation will be contaminated by substantial amount of essentially random noise. Such differences in feature distributions can be attributed to the fact that the relevant motion of the action in Fig. 2d is non-dominant and with a relatively small area compared to co-occurring non-relevant motion in the frame. We call this type of actions as tiny actions. Most of device operations are tiny actions, because we usually operate a device only with a body part, such as hand or foot, instead of the whole body.

Fig. 2
figure 2

Comparison between four actions and corresponding extracted MoSIFT features [4]. Only features in green box are relevant to actions by definition. (a) is “running” from [11], (b) is “jumping” from [3], (c) is “hula hoop” from [18] and (d) is “turning a device on” recorded by ourself (Color figure online)

To recognize tiny actions, it’s critical to focus on the local area where the target action happens, namely the region of interest (ROI). Therefore, in the second part of this paper, we introduce a simple but effective approach to estimating ROI for recognizing tiny actions. Specifically, the method learns the ROI for an action by analyzing the correlation between the action and the sub-regions of the frame. The estimated ROI is then used as a filter for building more accurate action representations. The experiments show performance improvements over the traditional methods in terms of recognition precision. Note that, the proposed ROI estimation method can be used as a preprocessing step before applying any number of existing methods in the literature of action recognition.

The paper is organized as follows. We first give a review of the related works in Sect. 2 and then introduce the PUMP database in Sect. 3. In Sect. 4, we describe the ROI estimation method for recognizing tiny actions, followed by the experiments on the PUMP database in Sect. 5. We conclude our work and discuss the future work at Sect. 7.

2 Related Works

2.1 Existing Action Databases

We give a review of current existing action databases and compare them to our proposed one using three taxonomies: (1) RGB videos or RGB+Depth videos, (2) single camera or multiple cameras and (3) significant action or tiny action.

2.1.1 RGB Videos Versus RGB+Depth Videos

The videos of most current databases are RGB videos captured by traditional cameras, such as UCF50 [18], Hollywood2 [12], HMDB [10], KTH [21], Weizmann [3], UT-Interaction [19] and IXMAS [24]. Thanks to the greater availability of RGB+Depth cameras (e.g. Kinect), recently a few 3D databases which contain RGB+Depth videos have been proposed. For example, the MSR3D [23]. Compared to traditional RGB videos, the RGB+Depth videos preserve the additional depth information which could be useful for action analysis. PUMP is a RGB+Depth database.

2.1.2 Single Camera Versus Multiple Cameras

The single camera database refers to those recording actions only use one camera each time, while each action in multi-camera databases was simultaneously recorded by multiple cameras with overlapped views. Again, most of current databases are single camera ones (e.g. UCF50, Hollywood2, HMDB, KTH, Weizmann, UT-Interaction and MSR3D). There are only a few multi-camera action databases have been published, such as IXMAS where each action was captured by five cameras from different views. Since we use three cameras in PUMP, it is therefore a multi-camera database.

2.1.3 Significant Actions Versus Tiny Actions

We classify actions into two types in terms of their motion strength and relative area compared to whole motion region. The motion of significant actions is strong and dominant one compared to other co-occurred motion in a frame. In contrast, tiny actions’ motion is weak, non-dominant and with relative small area. By this definition, the actions in KTH, Weizmann, UT-Interaction, MSR3D and IXMAS are mainly significant actions. For UCF50, Hollywood2 and HMDB, they contain videos with both significant actions and tiny actions. As shown in Fig. 4, in PUMP database, all actions are about hands operations, whose movement are weak and only taking a small area compared to co-occurred motion on body, head and etc. Therefore, they are all tiny actions.

2.2 Recognizing Human-Object Interaction

Accurately speaking, the action recognized in cognitive assistive system can be further categorized as human-object interaction which is an important group of actions recognition problems [1]. As [1] indicated, existing methods for recognizing interactions between human and object can generally be classified into two classes by judging if recognition of objects and actions are independently or collaboratively. For methods falling into first categories, objects recognition serves the following action recognition. For example, objects are usually recognized first and then actions are recognized by analyzing the object’s motion. As for methods in second class, object and action are recognized in a collaborate fashion and the recognition of objects and actions serve each other. This work falls into the first category. However, different from previous works, we novelly use the object (device) as a cue for estimating the ROI.

2.3 Detecting Salient Region

Similar to ROI detection, salient region detection also finds a sub-region in an image or video that is considered to be salient. However, despite the similarity, their difference also worths noting. The saliency of a region is defined by its visual uniqueness, unpredictability, rarity and is caused by variations in image attributes, such as color, gradient, edges, and boundaries [5]. In other words, the saliency detection is not task-dependent by relies on the above rules. However, the ROI detection in visual-based coaching system is task-dependent. For example, the region including device operations may not be salient, due to the weak motion, but is of interest. Due to this difference, existing approaches for salient region detection cannot be directly applied to solve our problem.

3 PUMP Database

3.1 Data Collection Methodologies

We selected the Abbot Laboratories Infusion Pump (AIM Plus Ambulatory Infusion Manager) as an example of home medical devices for research. With the help of a medical devices expert, we first defined an operation protocol for correct use of infusion pump, as shown in Table 1. Then, as illustrated in Fig. 3, we set up a “workplace” for data recording. Specifically, we used three Kinect cameras to record user’s operation from three views: front, side, above. Before each time of recording, an infusion pump with off-state, two refilled syringes and a box of alcohol pads were prepared on the table.

Fig. 3
figure 3

An illustration of data collection setting for PUMP database [8]

Table 1 The proposed operation protocol for correct use of infusion pump

Each user was asked to perform the operation following certain procedures for several times. In each time, they followed either the exact same procedures as described in Table 1 (correct operation protocol) or the predefined wrong procedures (to simulate the use-errors). Table 2 listed four types of predefined wrong procedures. They were different from the correct operation protocol by including steps disordering and steps missing. During the data recording, there were videos where users unintentionally deviated from the operation protocol they were asked to follow. Since these videos in fact reflected the use error that user made in real-life, we kept them in our database and provided additional error descriptions if the errors they made not belonged to any of the four predefined errors (Due to the limited space, we didn’t include the descriptions in our paper but kept them as an independent file in the database).

Table 2 The four types of predefined wrong operation protocols

Since some different steps in operation protocol were in fact the same actions, we therefore categorized them into one action class. We further combined the classes with same actions but operating different devices into one class, which finally leaded to seven action classes. We listed the aggregated classes in Table 3 and gave the snapshots of corresponding actions in Fig. 4.

Fig. 4
figure 4

An illustration of seven action classes generalized from operation protocol of PUMP database. (a) Turn the pump on/off. (b) Press buttons. (c) Uncap tube end/arm port. (d) Cap tube end/arm port. (e) Clean tube end/arm port. (f) Flush using syringe. (g) Connect/disconnect tube end and arm port

Table 3 The action categories generalized from operation protocol for PUMP database

Based on the generalized action categories, we manually annotated all sequences by locating the temporal intervals of all actions in each sequence.

3.2 Database Statistics and Recording Details

There were 17 volunteers participating in the data recording. Each user was asked to operate the infusion pump for four times with different appearances. Specifically, two times were correct operations while the others two came from the wrong operations. We finally constructed a database containing 68 operation sequences where each sequence had three synchronized RGB+Depth videos recorded from different views.

We adopted OpenNIFootnote 2 for Kinect recording and stored the raw data in ONI file format. To facilitate the use of this database, we also provided calibrated RGB videos and depth videos extracted from raw OpenNI data in our database. In Table 4, we show detail statistics of the PUMP database. The average video duration was 4.31 in and the total video duration was 14.64 h.

Table 4 The inventory of PUMP database

4 ROI Estimation for Recognizing Tiny Actions

4.1 Notations

Let (x, y, z) be the coordinates of the corner index of a cuboid region, A j be an action of class j, M be the number of action classes. Let \(T^{A_{j}}(x,y,z)\) be a density map that describes the probability for a region at (x, y, z) belonging to the ROI of A j . Specifically, we call \(T^{A_{j}}(x,y,z)\) as the ROI template.

4.2 Action-Region Correlation Estimation

Noticing the ROI for an action can be interpreted as regions that have strong correlation with the action, we estimate the correlation between an action A j and each region at (x, y, z) in the “3D” frame recorded by Kinect cameras.

To represent each region, we generate sliding windows starting from the origin of “3D” frame and calculate the bag-of-words (BoW) [22] representation for each cuboid window. To reduce the computation cost in following steps, we then apply the principal component analysis (PCA) on the BoW of each window and only keep the dimensions corresponding to the top K largest eigenvalues. The fisher score [2] is used to estimate the correlations between each region and an action.

Let\(D_b^{{(x,y,z)},A_j}\) and \(D_{w}^{(x,y,z),A_{j}}\) be action class A j ’s between class distance and within class distance [2] respectively for all regions at (x, y, z) among all training data. Then, \(D_b^{{(x,y,z)},A_j}\) and \(D_{w}^{(x,y,z),A_{j}}\) are defined as:

$$\displaystyle\begin{array}{rcl} D_{b}^{(x,y,z),A_{j} }& =& \sum _{k=1}^{M}\left (\mu _{\delta (k=j)}^{(x,y,z)} -\mu ^{(x,y,z)}\right )^{T}\left (\mu _{\delta (k=j)}^{(x,y,z)} -\mu ^{(x,y,z)}\right ) {}\\ D_{w}^{(x,y,z),A_{j} }& =& \sum _{k=1}^{M}\sum _{ b\in B_{k}^{(x,y,z)}}\left (b -\mu _{\delta (k=j)}^{(x,y,z)}\right )^{T}\left (b -\mu _{\delta (k=j)}^{(x,y,z)}\right ) {}\\ \end{array}$$

where δ(z) is an indicator function that outputs 1 if z is true and 0 otherwise, B k (x, y, z) is a set of BoWs of A k at region (x, y, z), μ (x, y, z) is the mean of BoWs at region (x, y, z) for all action classes and μ δ(k = j) (x, y, z) is the mean of BoWs at region (x, y, z) for action class A j or action classes other than A j (depending on if k = j). Then the fisher score \(F^{(x,y,z),A_{j}}\) for an action class A j and a region at (x, y, z) is simply:

$$\displaystyle\begin{array}{rcl} F^{(x,y,z),A_{j} }& =& \frac{D_{b}^{(x,y,z),A_{j}}} {D_{w}^{(x,y,z),A_{j}}}. {}\\ \end{array}$$

If one region (x, y, z) is highly correlated to action A j , it will then have relatively small within class distance \(D_{w}^{(x,y,z),A_{j}}\) and large between class distance \(D_{b}^{(x,y,z),A_{j}}\), which gives large fisher score \(F^{(x,y,z),A_{j}}\). Therefore, fisher score can be an indicator of correlation between an action and regions.

By normalizing the fisher score at different regions, we get the representation of the ROI template \(T^{A_{j}}(x,y,z)\):

$$\displaystyle\begin{array}{rcl} T^{A_{j} }(x,y,z)& =& \frac{F^{(x,y,z),A_{j}}} {\sum _{x,y,z}F^{(x,y,z),A_{j}}}. {}\\ \end{array}$$

4.3 ROI Adaption and Noise Filtering

For a given input video sequence, we simply attach the ROI template \(T^{A_{j}}(x,y,z)\) to each frame of the video sequence. Then, all feature points with ROI score lower than threshold λ are removed. In this way, we model the action only based on features in ROI.

5 Experiments

5.1 Experimental Setting

The total 68 videos are divided into two-fold and we in turn use one fold as training and the other one as testing data. For each video, we extract the MoSIFT [4] as low-level features and encode them into visual words [22] using a codebook with vocabulary size of 1,000. Then three different methods are used for generating the action representations (see Sect. 5.2 for details). SVM classifier with RBF kernel is adopted for action classification and two-fold cross validation is used for classification model training. Specifically, the training and testing is done independently for each view. To evaluate the effectiveness of different methods, the mean average precision (MAP) [17] which is the average precision (AP) over all actions is computed.

5.2 Action Representation Methods

The bag-of-words model (BoW) is adopted for action representation. Each high dimensional local feature point (e.g. MoSIFT) is first mapped to the closest cluster center using the pre-trained codebook and then the cluster’s id is assigned to the feature as “visual word”. After that, a pooling step is applied to calculate the statistics of all the visual words in the video segment and represent it as vector with same dimension of the codebook. This vector representation is called the BoW of the video segment. In this paper, we experiment with three different pooling methods for action representation as introduced below.

5.2.1 Whole Frame Based Pooling (WF-BoW)

Most of existing BoW-based action recognition approaches [4, 21] use this pooling method. Namely, all visual words in the whole frame are aggregated together first and the frequency for each visual word is calculated then. Finally, the normalized frequency histogram is used as the final representation.

5.2.2 Depth-Layered Multi-Channel Based Pooling (DLMC-BoW)

Since the videos are recorded by Kinect camera, then each feature point extracted from the key frame has not only the x and y coordinates but also the depth z coordinate. Based on this observation, in [14], z axis is first divided into several depth-layered channels, and then features within different channels are pooled independently, resulting in a multiple depth channel histogram representation. In our implementation, we uniformly divide the depth space into five channels.

5.2.3 ROI Based Pooling (ROI-BoW)

In order to estimate the ROI, 200 × 200 × 100 pixels (in the order of x, y and z axis) sliding cuboid windows with moving step of 40 pixels are generated and represented as BoW by aggregating all visual words inside the cuboid window. We then apply the PCA on the BoW of each window and only keep the dimensions corresponding to the top 100 largest eigenvalues. After that, the ROI template is calculated using the method described in Sect. 4.2. Note that, all these process can be done off-line. At online testing stage, all visual words with ROI score lower than 0.5 are filtered out and the final BoW representation is built only based on the left visual words belonging to the ROI.

5.3 Experimental Results and Analysis

5.3.1 ROI Visualizations

To qualitatively evaluate the proposed ROI estimation method, in Fig. 5, we visualize estimated ROIs for each action of all three views using density map where higher intensity means high probability of belonging to be the ROI. Because the cameras are put at different positions, the ROI for the same action but different views can be different. Comparing the action examples shown in Fig. 3 and the corresponding ROIs in Fig. 5, we can find the estimated ROIs make sense intuitively. For example, for the “front” view, the ROI is in the middle for action “press buttons” while is on the right for action “Flush using syringe”. This is because the “pressing buttons” always happen in the middle of the frame while for “flush using syringe” users always need to take out the syringe from syringe bag which is placed on the right side of the frame. Also, we can observe some of actions’ ROI are dense and sharp (e.g. “Connect/disconnect” in side view) while others are relative sparse and soft (e.g. “Cap tube end/arm port” in front view). This difference can be attributed to the different motion patterns of different actions. For example, “Connect/disconnect” concentrates in a narrow area but the motion of other actions like “Cap tube end/arm port” distributes in relatively larger areas.

Fig. 5
figure 5

The visualizations of estimated ROI for each action. Each row corresponds to one of the three views. The action examples of different views can be found in Fig. 3

5.3.2 Action Recognition Performance

In Table 5, we summarize the experimental results of three action representation methods. For each method, the first three columns correspond to the performance on three different views while the last column is the fusion results given by manually selecting the best performance among the three views. The fusion results can be interpreted as the best performance that the cognitive assistive system achieves using that method. Comparing the average fusion results of three action representation methods, we can see that the ROI-BoW achieves the best performance and improves the MAP for 3.33 % compared with WF-BoW. DLMC-BoW has almost the same performance as WF-BoW.

Table 5 MAP comparison of three action representation methods on PUMP database

If we further compare three methods’ performance on each single view, we observe a different performance changing pattern. For the “front” and “side” views, both ROI-BoW and DLMC-BoW show significant improvements over the WF-BoW. Specifically, on average, DLMC-BoW improves for 4.42 and 10.20 % while ROI-BoW improves for 8.84 and 19.28 % on those two views. However, for the “above” view, both ROI-BoW and DLMC-BoW don’t show improvement but in fact slightly decrease the performance compared to WF-BoW. The reason causing the inconsistent performance changing pattern on different views is illustrated in Fig. 6. It visualizes of extracted MoSIFT features in example frames with the same time stamp but different views for action “Flush using syringe”. Again, we use green boxes to indicate the regions that are relevant to the target action. We can see that for “front” and “side” views, only a very small part of the features lies in the green box compared to all the features extracted from the whole frame, while most features are inside the green box for the “above” view. Therefore, due to the difference in camera positions, actions recorded by “front” and “side” cameras are always the most typical tiny actions. Because both DLMC-BoW and ROI-BoW can be interpreted as feature location based visual word weighting methods (DLMC-BoW does it implicitly by using SVM to weight features at different depth differently while ROI-BoW does it explicitly by hard weighting features inside ROI 1 and outside 0), they are most effective when actions are typical tiny actions.

Fig. 6
figure 6

The visualization of extracted MoSIFT features in frames with the same time stamp but of different views

However, even though ROI-BoW improves the performance significantly, if looking into the absolute performance of each action, we realize only actions “Turn the pump on/off”, “Press buttons”, “Clean tube end/arm port” and “Flush using syringe” achieve reasonable high average precision while the performance of left three actions is still low. To build an applicable cognitive assistive system, we have to accurately recognize all the actions. Therefore, it still requires special effort to further boost the performance of the difficult actions.

6 An Interaction Model for Coaching Use of Home Medical Devices

Separate from the detection problem is the interaction model for dealing with the user. The detection module, as described in the above sections, handles the analysis of the camera feeds and reports whether or not a step has been detected as correctly performed, and with what level of confidence. This is reported to an interaction module. The interaction module will have prior knowledge what the typical average confidence of the detection for the particular step is, what the acceptable sequences of steps are, and what the importance is for each step based on the scale which we will describe later. The user may also interact with the system to go through the instructions by using a mouse.

6.1 Importance Levels

We assign four different levels of importance to each step in the procedure. The interaction module functions differently for each step, depending on the level of importance of that step. An harmless or obvious step would be one where is unnecessary to warn the user if not performed, such as turning the system on. This would be categorized as level 0 importance. A very important step would be placed at Level 3 importance, which would require the system to make it hard for a user to ignore the error warnings issued by the system, e.g. by combining loud, flashing audio-visual notification. Table 6 shows the different actions by the system given their importance levels.

Table 6 Example of warnings and reminders at different levels of error importance and confidence

6.2 Event Detection Certainty

One of the ways to prevent user frustration with system miss-recognitions is to have the system communicate what the certainty of its detection is, with appropriately adapted feedback. The quickest way for a user to begin doubting and ultimately ignoring the system, is to report an event as detected or not detected when (to the user) it is clearly the opposite. To combat this, the system incorporates an algorithm to decide whether an action is detected as incorrect or not detected. The warning for a user error with a high enough confidence rating will be shown to the user. Furthermore, the system adapts the presentation of the message to convey the confidence level.

  • Generally, moderate to high confidence detection of a correct step is only indicated with a green light, without otherwise disrupting the user.

  • Less confident detections of the correct step might be indicated in yellow.

  • Moderate to high confidence detection of an incorrect action triggers a strong warning

  • Low confidence detection of an incorrect action would trigger a suggestion or reminder, without commitment by the system that an actual error was made.

6.3 Warning Levels

Instead of confusing the user with percentages of confidence in the event detection, warnings will be presented appropriately to denote both error severity and confidence levels. This will prevent the system from making blatant statements that might contradict reality such as noting an event as not detected when to the user it clearly happened. These confidence and severity appropriate warnings allow the system to function usefully if it occasional events are misclassified.

7 Conclusion and Future Work

In this paper, we proposed a cognitive assistive system to monitor the use of home medical device. To build such a system, accurately recognition of user actions is one of the most essential problems. However, since few research has been done in this specific direction, it is still unknown if current techniques are adequate to solve the problem. In order to facilitate the research in this area, we made three contributions in this paper. First of all, we constructed a database where users were asked to simulate the use of infusion pump following predefined procedures. The operations were recorded by three Kinect cameras from different views. All the data was manually labeled for experimental purpose. Secondly, we performed a formal evaluation of some existing approaches on the proposed database. Because we realized current methods can hardly deal with tiny actions involved in using home medical devices, we made our third contribution by introducing an ROI estimation method and applying the ROI for building more accurate action representations. The experiments show significant performance improvements over the traditional methods by using the proposed methods.

Finally, we outline an interaction model how to handle different levels of errors recognized with varying certainty and appropriately provide feedback to the user reflecting the severity/importance of the mistake as well as the confidence of the system in the correctness of the recognition.

Currently, we treat all the actions in operating home medical devices as independent ones and recognize them separatively. However, it obvious they are mutually related because they are operations in a procedure. Therefore, in the future, we will focus on leveraging the inner relations between actions to further improve the recognition performance.