Keywords

1 Introduction

Activity Recognition (AR) is expected to be a core component in numerous future Internet of Things applications such as healthcare, smart homes, and security [5, 22, 23]. Therefore, evaluating the effectiveness of different AR algorithms is essential. Some metrics such as accuracy, observing the recall against precision are common metrics that are easy to understand and interpret even by non-experts. These metrics are well-used for discrete instances and pre-segmented data sequences [5]; where, a predicted instance is either correct or incorrect. However, concepts in AR are durative; thus, a predicted concept can be correct in one period, incorrect or partially correct in another one [26]. Accordingly, as shown in Fig. 1, previously well-defined terms used in traditional systems such as true positive (TP), false positive (FP), false negative (FN) are not suitable for durative concepts [26].

Fig. 1.
figure 1

a) Classical instances b) Durative instances. Durative one may partially correct and partially incorrect while the classical one is either correct or not.

However, it is often assumed that time-frame, event-based, or classifier performance follows the whole system performance [4, 5, 22, 23]. This assumption neglects practical scenarios and may misleadingly present convincible results (Sect. 2). Despite the importance of evaluating durative concepts, it is not well-developed even in other areas. Still, there is no universally accepted formula for evaluating the effectiveness of systems with durative concepts.

This paper proposes a novel mathematical method for evaluating different properties of AR systems. It redefines TP, FP, and FN to consider various properties such as detection, total duration, relative duration, boundary alignment, and uniformity between ground truth and predicted events. Therefore, confusion matrix based metrics such as recall, precision, and f-score, can be calculated to evaluate and compare different systems. Furthermore, it is simple, time-efficient, extensible, and customizable. It also overcomes the limitations of existing methods. Although, it can select an appropriate algorithm for a new application by prioritizing properties differently. The experiments show that our method can outperform state-of-the-art methods with enhanced generalization capability.

Table 1. The notations used in this paper.

2 Preliminaries and Related Work

Evaluating the performance of AR systems is usually done by comparing predicted events (PEs) with the ground-truth events (GTEs) [16]. It can be viewed as the matching of two time-series. However, it is not easy to determine the time boundaries of ground truth labels perfectly; moreover, the distinction between activities is not always clear [5]. Therefore, some decision functions accommodate offsets using ambiguous range [10], fuzzy event boundaries [20], time series matching techniques (such as dynamic time warping, longest common sub-sequences [7]), or categorical probability distribution [9]; however, they fail to distinguish different types of errors (e.g., fragmentation) [27]. Common approaches to evaluate AR systems include time-frame, event-based, and classifier performance [12, 17, 23]. Time-frame based methods uses fixed period interval as atomic units and facilitate comparing different AR algorithms since each frame is independent of both the GTEs and PEs [12, 17]. Nevertheless, the interpretation of errors is not the same in different applications. Hence, each frame’s error is classified to insertion (detection of an activity when nothing actually happened), overfill (time before and after the occurrence time of an activity that is incorrectly identified as part of the activity), and merge (covering multiple GTEs by a single PE) as sources of FP errors and deletion (failure to detect an activity), substitutions (wrongly detected with another class), underfill (not detected duration at the beginning and end of the activity), and fragmentation (detecting a GTE by multiple PEs) as sources of FN errors [17]. Moreover, event based methods are also essential to be considered as well as time-frame [27]. Event based errors are categorized as insertion, deletion, fragmentation, merge and fragmented-merge (occurrence of both merge and fragmentation errors) [27]. However, an expert must do a time-consuming analysis of these massive and heterogeneous diagrams, matrices, and information. Therefore, combining them as a scalar metric is complex. Besides, These approaches also consider the total duration of positional errors and do not provide an event-based tunable model for it.

From the behavior analysis perspective, evaluating each activity needs a different evaluation method [1]. e.g., duration sensitive activities need to be evaluated differently from frequency sensitive ones. Timeliness is another metric used for online and realtime prediction [24]. It is defined as the duration continuous correct prediction of an activity without switching to an inaccurate prediction. To compare different AR algorithms in a similar situation, a competition is held and time frame f\(_1\)-score, recognition delay, installation complexity, user acceptance, and interoperability are used as the evaluation criteria [8].

In sound event detection (SED) [4], video action detection [3], anomaly detection [26], and video abnormal event detection [11], etc., concepts are also durative. The IEEE Audio and Acoustic Signal Processing challenge [25] highlights the need for an appropriate metric in SED. Still, researchers mainly used collar, segment (time-frame based), and PSDS (polyphonic sound detection score) methods [4, 16]. However, they can not show the different sources of errors. Our recent work dedicated to multimodal metrics in SED system [18] provides some evaluation approaches depending on the hypothesis and constraints on SED applications. National Institute of Standards and Technology (NIST) developed a challenge for detecting activities in video (ActEV) [3]. It firstly used false alarms rate (instance based) and missed detections probability (instance based) as evaluation metrics. However, In 2019, it uses time-frame method for calculating false alarm rate [3]. Other metrics in abnormal event detection in video are false rejection rate, equal error rate, decidability index, receiver operating characteristic curves, and area under the Curve [6, 11]. However, equal error rate can be misleading in the anomaly detection setting [15]. Numenta anomaly benchmark [14] is designed to evaluate different anomaly detection algorithms. It uses a scaled sigmoidal scoring function for the relative position of each detection; however, it ignores fragmented predictions. To resolve previously mentioned issues, researchers in [26] redefine precision and recall for time-series (particularly on anomaly detection). They need some functions to be explicitly defined for a given application. Those functions are: \(\gamma \) (to consider fragmented events), \(\delta \) (to consider the positional relation between PE and GTE), \(\mathrm {overlap}\) (the rate of the correctly detected events (e.g., \( \mathrm {overlap(x,y,}\delta \mathrm {())}=\mathcal {T}\mathrm {(x}\cap \mathrm {y)}/\mathcal {T}\mathrm {(x)}\)), and \(\alpha \) which is a coefficient. They are formulated in Eq. (1) using notations of Table 1.

$$\begin{aligned}&\mathrm {exist(e,X)}\!=\!\mathrm {[e}\cap \mathrm {X}\ne \emptyset ],\qquad {\mathrm {score}}\mathrm {(e, X)}\!=\!\gamma \mathrm {(e, X)}\!\times \!\mathop {\Sigma }_{\mathrm {x}\in \mathrm {X}}\mathrm {overlap}\mathrm {(e, e} \cap \mathrm {x}, \delta \mathrm {())}, \\&\mathrm {Recall}\!=\!\frac{1}{\mathopen |\mathrm {R}\mathclose |}\!\sum _{\mathrm {r}\in \mathrm {R}}\alpha \!\times \!\mathrm {exist(r, P)} +(1\!-\!\alpha )\! \times \! \mathrm {score(r, P)}, \quad \quad \mathrm {Precision}\!=\! \frac{1}{\mathopen |\mathrm {P}\mathclose |}\!\sum _{\mathrm {p} \in \mathrm {P}}\mathrm {score(p, R)} \nonumber \end{aligned}$$
(1)

Issues in [26] (Eq. (1)) are analysed deeply in the following:

  1. 1.

    It surprisingly ignores the \(\alpha \) (coefficient) in calculating precision. Therefore, it gives inconsistent weights to overlap function in calculating recall and precision. Therefore, to prevent misled interpretation, they can not be used as complementary (e.g., in calculating f1 score).

  2. 2.

    Fragmented PEs have significant positive score in precision. e.g., in Fig. 2, the precision of (a) is much higher than (b). Similar situation happens for recall.

  3. 3.

    It normalizes the duration of events to avoid the duration impacts. Briefly, the precision calculation is \(\underset{p\in P}{avg}(\frac{TP}{\mathcal {T}(p)}\)) and the recall calculation is \(\underset{r\in R}{avg}(\frac{TP}{\mathcal {T}(r)}\)). This normalization looks well for a single PE and GTE; however, in total, it gives different values for TP in recall and precision. Therefore, they are not calculated in a similar mathematical model and they can not be used as complementary (e.g., for f1-score). Equation (2) presents these calculations for Fig. 2 (d).

    $$\begin{aligned}&\mathrm {Precision}=\frac{\frac{\mathrm {TP}_1}{\mathrm {P}_1}+\frac{\mathrm {TP}_2}{\mathrm {P}_2}}{1+1} =\frac{\Sigma \text {normalized TPs based on PEs}}{\Sigma \text {normalized PEs}}\nonumber \\&\mathrm {Recall}\!=\!\frac{\frac{\mathrm {TP}_1}{\mathrm {R}_1}\!+\!\frac{\mathrm {TP}_2}{\mathrm {R}_2}\!+\!\frac{0}{\mathrm {R}_3}}{1+1+1}\!=\!\frac{\Sigma \text {normalized TPs based on GTEs}}{\Sigma \text {normalized GTEs}} \end{aligned}$$
    (2)
  4. 4.

    Defining an appropriate cardinality function is complex. Furthermore, it is difficult to adjust and tune this formula since the dependencies between cardinality, position, and overlap are not clear [10]. e.g., in Fig. 2 (c), the first and second GTEs have the same recall (0.33) (using \(\gamma (e,X)=\mathopen |e\cap X\mathclose |^{-1}\) as suggested by authors). It is similar for calculating precision for merged PEs.

  5. 5.

    This approach can not be applied to duration-sensitive activities[1].

  6. 6.

    Adding a new property (e.g., total duration) is not straightforward.

Fig. 2.
figure 2

Example activities that help to explain the drawbacks in [26].

Fig. 3.
figure 3

Evaluation of AR systems that use different segmentation approach.

Issue in classifier metrics is the inability to compare algorithms in a unified space since AR systems may use various segmentation (windowing) algorithms. Figure 3 is an illustration of two algorithms. Activity \(A_1\) is not detected in segments \(C_1\), \(T_1\) and \(T_2\). Thus, the classifier accuracy in the first approach is 50% while it is 60% in the second one. Clearly, the difference in their performances are due to the effects of the different segmentation procedures. Accordingly, it may misleadingly present convincing results and it can not capture duration specific properties, although it is widely used in several papers [5, 7, 13, 19, 23]. Time frame accuracy is more consistent metric [12]; however, it can not displays different property of an AR system such as uniformity, detection of each event or the boundary alignment. Additionally, a long event affect the whole result.

As a result, a new metric is needed to better evaluate AR algorithms while paying attention to the peculiarities of the applications and activities.

3 Proposed Metric

An evaluation method should determine the different properties of AR algorithms. We define a measurement (in terms of recall and precision) for each property, and all together constitute our proposed metrics. A weighted combination of them can produce a scalar value, or they can be used collectively as a multi-objective metric. Because of our approach’s modularity, it can be easily extended to include a measurement for a new property. Our metric is based on the following assumptions: 1- R and P are given as input. 2- Times in concepts are durative and specified. 3- The acceptable time shift of PEs to be assumed as detected is within the GTE range. i.e., PEs and GTEs are relevant when they have some overlap. 4- Only a single activity class is exist. For multi-class cases, all classes are evaluated individually as a positive class and the rest as a negative one. This allows using different parameters for each activity class which is an necessary feature for AR [1]. 5- One instance of an activity class occur at a time.

We use ground truths as references in the normalization process because they are independent of predictions of different algorithms. Therefore, we cluster GTEs and PEs in such a way \(\mathrm {C}\!=\!\{\mathrm {(r,ps)}|\mathrm {r}\in \mathrm {R} \wedge \mathrm {ps}\!=\!\{\mathrm {p}\! \in \! \mathrm {P}|\mathrm {r}\cap \mathrm {p}\!\ne \! \emptyset \}\wedge \mathrm {ps}\ne \emptyset \} \)). Orphan PEs are considered as \(\overline{\mathrm {C}}=\{\mathrm {p}\in \mathrm {P}|\mathrm {p}\cap \mathrm {R}=\emptyset \}\).

Each instance in the classical model is either correctly predicted or not (each TP, FP or FN is either 0 or 1). However, in the durative model, a GTE may be partially covered by positive PEs. Therefore, we allow partial value for TP, FP, and FN. In the following, we present the properties which are drawn from state-of-the-art and our formulas for measuring their values.

Detection (D) Property calculates the detection of a GTE even by a small (at least \(\theta \) [10]) PE (It checks for the existence of overlaps between PEs and GTEs). A GTE is TP if it is detected at least once and is FN if it is not. PEs that don’t have any intersection with any GTEs are considered as FP. This property is useful in applications like alarm systems [26].

$$\begin{aligned}&\mathrm {TP}^\mathrm {D}\!=\!\sum \limits _{\mathrm {(r,ps): C}} \!\left[ \sum _{\mathrm {p:ps}}\frac{\mathcal {T}(\mathrm {r}\cap \mathrm {p})}{\mathcal {T}\mathrm {(r)}}> \theta _{\mathrm {tp}}\right] \!, \quad \mathrm {FP}^\mathrm {D}\!=\!\sum _{\mathrm {(r,ps): C}}\!\left[ \sum _{\mathrm {p:ps}}\frac{\mathcal {T}\mathrm {(p)}-\mathcal {T}\mathrm {(r}\cap \mathrm {p)}}{\mathcal {T}\mathrm {(r)}}> \theta _{\mathrm {fp}}\right] +\mathopen |\overline{\mathrm {C}}\mathclose | \\ \nonumber&\mathrm {FN}^\mathrm {D}=\mathopen |\mathrm {R}\mathclose |-\mathrm {TP}^\mathrm {D}, \end{aligned}$$
(3)

Therefore, a GTE is considered as TP when at least \(\theta _{tp}\) fraction of it is correctly identified; otherwise, it will be considered as FN. FP counts not detected PEs (\(\mathopen |\overline{C}\mathclose |\)) plus the PEs which the rate of its wrong prediction part is higher than \(\theta _{fp}\).

Uniformity (U) Property considers the detection of GTE by a single PE instead of multiple fragmented ones. e.g., in a taking medicine event, detecting two taking medicine events instead of one shows a disorder; therefore, the duration is not as important as the number of occurrences. Researchers in [26, 27] consider uniformity as an essential property; however, they do not formulate it. Event analysis [27] leads us to consider a GTE as a TP if it is identified by only one PE. In this case, all other PEs are considered as FP or FN.

$$\begin{aligned}&\mathrm {TP}^\mathrm {U}\!=\!\sum _{\mathrm {(r,ps):C}}\!\left[ \mathopen |\mathrm {ps}\cap \mathrm {R}\mathclose |\!=\!1\right] , \quad \mathrm {FN}^\mathrm {U}\!=\!\sum _{\mathrm {(r,ps):C}}\!\left[ \mathopen |\mathrm {ps}\cap \mathrm {R}\mathclose |\!>\!1\right] , \quad \mathrm {FP}^\mathrm {U}\!=\!\mathopen |\mathrm {P}\mathclose |-\mathopen |\overline{\mathrm {C}}\mathclose |\!-\!\mathrm {TP}^\mathrm {U} \end{aligned}$$
(4)

Thus, the recognized GTEs are considered as TP if each is detected by one PE and that PE does not identify any other GTEs; otherwise, they are considered as FN. Similarly, a PE, that is neither TP nor orphan, is considered as FP.

Total Duration (T) Property is well-known and is similar to time-frame-based methods. It divides the PEs and GTEs by their boundaries; therefore, each frame is either TP, FP, FN, or TN [12].

$$\begin{aligned}&\mathrm {TP}^\mathrm {T}=\sum _{\mathrm {(r,ps): C}}\mathcal {T}\mathrm {(r}\cap \mathrm {ps)},\qquad \mathrm {FN}^\mathrm {T}= \mathcal {T}\mathrm {(R)-TP}^\mathrm {T},\qquad \mathrm {FP}^\mathrm {T}= \mathcal {T}\mathrm {(P)-TP}^\mathrm {T} \end{aligned}$$
(5)

Relative Duration (R) Property normalizes the duration of each event individually to lessen the effect of varying durations of events.

$$\begin{aligned} \mathrm {TP}^\mathrm {R}=&\sum _{\mathrm {(r,ps): C}}\frac{\mathcal {T}(\mathrm {r}\cap \mathrm {ps)}}{\mathcal {T}\mathrm {(r)}},\qquad \mathrm {FP}^\mathrm {R}=\sum _{\mathrm {(r,ps): C}}\mathrm {min(1,}\sum _{\mathrm {p:ps}}\frac{\mathcal {T}\mathrm {(p)}-\mathcal {T}(\mathrm {r}\cap \mathrm {p)}}{\mathcal {T}\mathrm {(r)}}\mathrm {)}, \nonumber \\ \mathrm {FN}^\mathrm {R}=&\mathopen |\mathrm {C}\mathclose |-\mathrm {TP}^\mathrm {R} \end{aligned}$$
(6)

Consequently, TP (FN) is the sum of normalized durations of correctly detected (incorrectly undetected) parts of GTEs. The FP calculation is similar; however, FP of each cluster can not exceed 1.

Boundary Alignment (\(\mathbf{B} _{{\boldsymbol{t}}}\)) Property rewards TP when PEs GTE’s boundaries precisely match the boundaries of its related PEs; otherwise, it loses some score by FN (underfill errorFootnote 1), or FP (overfill error (see footnote 1)) [27]. This property concentrates only on the alignment error and is related to the needs considered in [26, 27]. The parameter t specifies the kind of alignment (start (\(\mathrm {B}_\mathrm {s}\)) or end (\(\mathrm {B}_\mathrm {e}\))).

$$\begin{aligned} \begin{aligned} \forall \mathrm {t:}\{\mathrm {start,end}\}\mathrm {:}\quad&\mathrm {fn}_1\mathrm {(r,ps)}=\ \mathbf{if} \;{ \mathrm {ps}\ne \emptyset }\mathbf{then} \; 1-\mathrm {e}^{-\beta _\mathrm {t}\frac{\mathrm {underfill}_\mathrm {t(r,ps)}}{\mathcal {T}\mathrm {(r)}}} \mathbf{else} \; 0 \\&\mathrm {fp}_1\mathrm {(r,ps)}=\ \mathbf{if} \;{ \mathrm {ps}\ne \emptyset }\mathbf{then} \;1-\mathrm {e}^{-\beta _\mathrm {t}\frac{\mathrm {overfill}_\mathrm {t(r,ps)}}{\mathcal {T}\mathrm {(r)}}} \mathbf{else} \;0 \\&\mathrm {TP}^{\mathrm {B}_\mathrm {t}}=\sum _{\mathrm {(r,ps):C}}\mathrm {max} \mathrm {(0,1}- \mathrm {fp}_1 \mathrm {(r,ps)}-\mathrm {fn}_1 \mathrm {(r,ps))}\\&\mathrm {FN}^{\mathrm {B}_\mathrm {t}}=\sum _{\mathrm {(r,ps):C}}\ \mathrm {fn}_1\mathrm {(r,ps)}, \qquad \mathrm {FP}^{\mathrm {B}_\mathrm {t}}=\sum _{\mathrm {(r,ps):C}}\ \mathrm {fp}_1\mathrm {(r,ps)} \end{aligned} \end{aligned}$$
(7)

Accordingly, TP of each cluster is justified by the alignment error between predictions and ground truths. In addition, errors increase exponentially (adjustable with \(\beta _t\)) by the distance between the boundaries of PEs and GTEs. Increasing parameter \(\beta _t\) gives more penalties to longer positional errors.

Precision, Recall, and F-Score are calculated using the following known formula using TPs, FPs, and FNs that were defined earlier for each AR properties.

$$\begin{aligned} \forall \mathrm {f} \in \{\mathrm {D},\mathrm {T}&, R,B_{s},B_{e},U\}\mathrm {:}\text { //Abbreviation of properties}\\ \quad \mathrm {Recall}^\mathrm {f}&=\frac{\mathrm {TP}^\mathrm {f}}{\mathrm {TP}^\mathrm {f}+\mathrm {FN}^\mathrm {f}},\quad \,\, \mathrm {Precision}^\mathrm {f}=\frac{\mathrm {TP}^\mathrm {f}}{\mathrm {TP}^\mathrm {f+FP^f}},\quad \mathrm {F}_{1}^\mathrm {f}=&2\frac{\mathrm {Precision}^\mathrm {f}.\mathrm {Recall}^\mathrm {f}}{\mathrm {Precision}^\mathrm {f}+\mathrm {Recall}^\mathrm {f}}\nonumber \end{aligned}$$
(8)

Computation Complexity of the presented formulas is \(O(\mathopen |R\mathclose |\times \mathopen |P\mathclose |)\) because elements of both sets of P and R are iterated. Since each element of R needs only related P; the interval tree helps us to optimize it to \(O(\mathopen |R\mathclose |log\mathopen |R\mathclose |+\mathopen |P\mathclose |log\mathopen |P\mathclose |)\). In the case that P and R are sorted by time, this complexity can be reduced to \(O(\mathopen |R\mathclose |+\mathopen |P\mathclose |)\) by considering the time relationships of P and R.

4 Experimental Results

This section presents an experimental study of our metric. The first experiment is done on small visualizable data. The second one compares two algorithms in a real-world dataset. The parameters of each property of our metric are as follows. The \(\theta _{\mathrm {tp}}, \theta _{fp}\) are needed to have an appropriate detection property. In this experiment, if a PE has any overlap with GTE (\(\theta _{\mathrm {tp}}=0\)), we consider it as TP; additionally, if an incorrect part of a PE is longer than the related GTE’s duration (\(\theta _{\mathrm {fp}}=1\)), we consider it as FP. We also use \((\beta _t=2)\) to consider near linear boundary error. The codes and datasets are existed in our repository at https://github.com/modaresimr/AR-MME-EVAL.

Fig. 4.
figure 4

Ground truths and output of two algorithms used in [27].

Table 2. Details of our metric for algorithms of Fig. 4. The spider chart (right image) shows the f1-score on each property for those algorithms.

Our Proposed Metric on Small Data is explored in this experiment for simplicity in visualization. This data contains a subset of 13 relations between two intervals in Allen’s interval algebra [21]. This data and our metrics’ outputs are illustrated on Fig. 4 and Table 2. Clearly, more PEs of Alg.a are incorrectly predicted than Alg.b in Fig. 4, while the number of undetected GTEs is the same. The precision and recall in detection measurement confirm this observation. The uniformity of Alg.b is higher than Alg.a since most of the GTEs detected with a single PE in Alg.b instead of multiple fragmented PEs. For the total duration measurement, we can see that the correctly predicted time frames (TP) in Alg.b are more than Alg.a, while it is inverse for the incorrect ones. The relative duration normalizes events independently and applies the total duration measurement. It shows Alg.b predict more part each recognized concept than Alg.a. Since the concepts’ duration are similar, the total duration shows similar result. In the boundary measurement, we can observe that almost all predictions of Alg.a cover the end boundary of GTEs. Therefore, the end part of all GTEs are well-detected (recall = 0.99); however, there are some part of predictions after end of the GTE’s boundary that are incorrectly predicted (prediction = 0.78).

Our Proposed Metric on a Public Dataset is explored in this experiment. We compare non-overlapping sliding time window of 30 s (SW)Footnote 2 with Hierarchical Hidden Markov model (H-HMM) [2] to show how our metric works. WSU CASAS Home1 dataset [13] that contains 32 sensors, 400,000 events and about 3000 durative concepts (activities) is used in this experiment. We use its first 20% for test and the remaining for training.Footnote 3 Then we evaluate the effectiveness of take medicine activity and the macro average of all classes.Footnote 4 We compare [26] and [27] metrics with ours. The classifier metric issues is discussed in Sect. 2.

Table 5 (b) shows that 50% of times, HHMM algorithm do not detect the concepts and 29% of times it can not detect the start boundary while almost none of its prediction is incorrect. For SW algorithm, it shows great performance except around 16% of times the prediction is fragmented. However, our metric (Table 3) shows this observation is not complete. Analysing the data shows that the duration of 5% of concepts is equal to the others. Therefore, they dominate the system’s quality when using the time frame metrics (e.g., Ward’s time metrics) and classifier metricsFootnote 5. Table 5(a) helps to understand more about the predictions with event analysis perspective. It displays that 28% and 40% of predictions in SW and HHMM algorithms are incorrectly predicted (in contrast to the observation from Table 5(b)). However almost all of the concepts are recognized by SW algorithm and nearly half of them are not recognized at all in the HHMM algorithm. It also shows that the predicted concepts in both HHMM and SW algorithm are mostly uniform (have few fragmented or merged predictions). These observation is clearly shown in our detection and uniformity property in Table 3. Our proposed metric also correctly shows the quality of detecting the boundaries of concepts while Table 5 (b) display these information totally. Since the duration of this class is much less than the total duration of this dataset while this class constitutes 13% of concepts in this dataset, the last four errors in Table 5 (b) are close to zero. Relative duration properties in Table 3 shows SW either recognize a whole ground truth concept (recall = 0.92) or does not recognize the concept at all; however, its prediction exceed the boundaries (precision < 0.6).

Table 3. Our metric and the spider chart of f1 over two algorithms for one class.
Table 4. Tatbul metric [26] with several parameters and its f1 chart for one class.
Table 5. Ward’s proposed metrics for evaluating two algorithms for one class

Table 4 shows the metric proposed in [26] with the different parameters. We can observe that \(\gamma \) function, which considers fragmented and merged predictions, has a small affect on the recall and precision. As it is observable from our uniformity property in Table 3, we can see the predictions of both algorithms are uniform but HHMM works better. This observation, can not be captured from Tatbul’s metric. As analysed at the end of Sect. 2, the main issue of Tatbul’s metric is that recall and precision are not calculated in similar model and can not be used as complementary (e.g., changing \(\alpha \) parameter has effect only on recall.). Lastly, \(\delta \) parameter in Table 4 is proposed by them to consider the boundary alignment errors; however, changing that does not provide significant changes in recall and precision while our boundary properties in (Table 3) clearly provide the situation of predictions. This experiment ends with Table 6 that compares the macro average of our metric across all classes of this dataset.

Table 6. Macro average of all classes by our metric over two algorithms.

5 Conclusions

In general, activity events are durative in AR. Choosing an appropriate evaluating metric is an essential step to compare AR systems. However, due to the absence of an appropriate one, researchers often use time-frame, event-based, or classifier performance, which can misleadingly present convincible performance for an AR system. This paper proposes a new mathematical model to evaluate AR algorithms which is expressive (by capturing several properties of AR algorithm such as detection, total duration, relative duration, boundary alignment, and uniformity), customizable (the adjustable parameters can support a wide range of applications and can give more weights to some properties of AR algorithms), extensible (adding a new property is straightforward and independent of others). Although our method can give more meaningful information about AR algorithms, its computation complexity remains linear on the size of predictions and ground truths. Our metric has been tested on several datasets, and its ability to measure different AR algorithm properties has been shown. One exciting outcome of this formulation is the possibility to generate a profile (in terms of properties) for each algorithm. Therefore, it can be used as a heuristic for faster algorithm selection which will be explored more in future researches. We are also interested in including fuzziness in our properties.