Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Human behaviour analysis is an important research area in computer vision. Automatically understanding what people do by analyzing visual streams recorded from surveillance cameras is a challenging task and implies recognizing the activities of the people of interest, the environment where they operate, the other people with whom they interact, the objects they manipulate and even their future intentions. While many progresses have been made in this area, recent works [1] have demonstrated as the traditional “third-person” view perspective (i.e. employing fixed cameras mounted all around in the user’s environment) may be insufficient for understanding people activities and intentions and that wearable cameras can provide an alternative or complementary source of information. Wearable cameras can be employed in many different applications, such as in driver’s assistance systems, for monitoring assembly operations in manufacturing, in ambient assisted living and, more recently, in the context of the so called “life-logging” [2, 3] (i.e. where a first-person camera continuously records a whole day of its wearer life).

Fig. 1.
figure 1

Overview of our proposed multi-task clustering approach for First Person Vision activity recognition (Figure is best viewed in color and zoom) (Color figure online).

In this paper, we focus specifically on everyday activity recognition from a “first-person” vision (FPV) perspective. This problem poses several challenges. With wearable cameras typically several hours of videos are recorded. This generates a large amount of data for which labels are not available as the annotation would require a massive human effort. Thus, for accurate recognition, algorithms which are both scalable and able to operate in an unsupervised setting are required. Moreover, designing effective visual features representations in this unconstrained FPV scenario is much more challenging than in the case of fixed cameras. In this paper, we propose to address the problem of everyday activity recognition from unlabeled visual data within a multi-task learning framework. When considering the tasks of recognizing activities of daily living of many individuals, it is natural to assume that these tasks are related. For example, people working in an office environment tend to perform the same kind of activities (e.g. typing on keyboard in front of a personal computer, reading and writing documents). Similarly, most people when they wake up in the morning use to drink coffee and brush their teeth. Thus, it is intuitive that, when performing activity recognition, learning from data of all the individuals simultaneously is advantageous with respect to considering each person separately. However, the data distributions of single tasks can be different, since visual data corresponding to different people may exhibit different features. In particular if there are limited data for a single person, typical clustering methods may fail to discover the correct clusters. In this case, using data from other people as an auxiliary source of information can improve clustering results. However, simply combining data from different people together and applying traditional clustering approach does not necessarily increase accuracy, because the data distributions of single tasks can be different, violating i.i.d. assumptions. To address this problem, we propose to invoke the novel paradigm of multi-task clustering (MTC). Specifically, we introduce two novel methods, derived by a common framework based on the minimization of an objective function balancing two terms, one which ensures the data of each single task to be clustered appropriately, the other which enforces some coherence between the clustering results of related tasks. We demonstrate the effectiveness of our approaches on two recent FPV datasets, the FPV activity of daily living dataset [3] and the coupled ego-motion and eye-motion dataset introduced in [4], comparing them with several single task and multi-task learning methods. Figure 1 depicts an overview of the proposed method.

The main contributions of this work are the following: (i) To our knowledge, this paper is the first to address the problem of everyday activity recognition within a MTC framework. While our framework can be used to analyze visual streams recorded from fixed cameras, we tackle the more challenging scenario of egocentric vision. (ii) The two proposed multi-task clustering approaches are novel and two efficient algorithms are derived for solving the associated optimization problems. (iii) Our experimental evaluation demonstrates that, independently of the adopted feature representations, a multi-task learning framework is greatly advantageous for FPV activity recognition with respect to traditional single task approaches. (iv) The proposed MTC algorithms are general and can be applied to many other computer vision and pattern recognition problems.

2 Related Works

Activity Recognition in Egocentric Videos. Analysing human behaviors from data collected from wearable devices has received considerable attention recently, not only in computer vision but also in other related research areas, e.g. ubiquitous computing [5, 6]. While many recent works are based on the use of RFID tags or inertial sensors, systems based on first-person cameras still play an important role being generally cheap and easy to deploy. Aghazadeh et al. [7] considered the problem of discovering anomalous events analysing the video stream captured from a small video camera attached to a person’s chest. In [2] a summarization approach targeted to egocentric videos is presented. Fathi et al. [8] introduced a method for individuating social interactions in first-person videos collected during social events. Some recent works have faced the multiple challenges of recognizing complex activities of everyday life from an egocentric perspective in different scenarios (e.g. kitchen, office, home) [3, 4, 9, 10]. In [3] the authors demonstrated the importance of using features based on object detectors’ output in the challenging unconstrained scenario of everyday at home activity recognition. In [9] RGB-D sensors are employed for fine-grained recognition of kitchen activities. In [4] the task of recognizing egocentric activities in an office environment is considered and motion descriptors extracted from an outside looking camera are used jointly with features modeling the eye movements of the wearer captured by an inside looking camera. In [10] activity recognition in a kitchen scenario (multiple subjects preparing different recipes) is considered. A codebook learning framework is proposed in order to alleviate the problem of the large within-class data variability due to the different execution styles and speed among different subjects.

In this paper, we address the problem of analysing activities of daily living under the perspective of multi-task learning. Multi-task learning methods have been previously investigated in the context of visual-based activity recognition from fixed cameras and in a supervised setting [1113]. In this paper, we consider the more challenging scenario where no annotated data are provided, which is typical when analyzing visual streams from wearable cameras.

Multi-task Learning. Recently multi-task learning (MTL) approaches [14] have demonstrated their effectiveness in several applications in computer vision, such as object detection [15], indoor localization [16], face verification [17] or head pose estimation [18]. The idea of multi-task learning is simple: learning from data of multiple related tasks simultaneously produces more accurate classification and regression models with respect to learning on every single task independently. While many works have introduced supervised MTL approaches, only few have considered an unsupervised setting [1921], i.e. the scenario where all the data are unlabeled and the aim is to predict the cluster labels in each single task. Gu et al. [19] proposed to learn a subspace shared by all the tasks, through which the knowledge of one task can be transferred to all the others. Zhang et al. [21] introduced a MTC approach based on a pairwise agreement term which encourages coherence among clustering results of multiple tasks. In [20] the \(k\)-means algorithm is revised from a Bayesian nonparametric viewpoint and extended to MTL.

In this paper, we propose two novel approaches for multi-task clustering. The first one is inspired by the work in [21] but it is based on another objective function and thus on a radically different optimization algorithm. Furthermore, in the considered application, it provides superior accuracy with respect to [21]. Our second approach instead permits to easily integrate prior knowledge about the tasks and the data of each task (e.g. temporal consistency among subsequent video clips). Moreover, it relies on a convex optimization problem, thus avoids the issues related to local minima of previous methods [1921].

3 Multi-task Clustering for FPV Activity Recognition

In this paper, we focus on the problem of everyday activity recognition from wearable cameras. More specifically, we consider several video clips collected by a certain number of people involved in activities of daily living. No labeled data are provided. We only assume that people perform about the same tasks, a very reasonable assumption in the context of everyday activity analysis. Considering each individual’s data as a specific task, we propose a MTC approach. To stress the generality of our method, we apply it in two different scenarios: an office environment where people are involved in typical activities such as browsing the web or writing documents and a home environment where a chest mounted camera records users’ activities such as opening a fridge or preparing tea. To perform experiments we use two publicly available datasets, corresponding to the scenarios described above: the FPV office dataset introduced in [4] and the FPV activity of daily living dataset described in [3]. Both datasets contains visual streams recorded from an outside-looking wearable camera while the office dataset also has informations about eye movements acquired by an inside-looking camera. Further details about the datasets are provided in the experimental section. In the following we describe the adopted feature descriptors and the proposed MTC framework.

Fig. 2.
figure 2

Feature extraction pipeline on the FPV office dataset. Some frames corresponding to the actions read, browse and copy are shown together with the corresponding optical flow features (top) and eye-gaze patterns depicted on the 2-D plane (bottom). It is interesting to observe the different gaze patterns among these activities.

3.1 Features Extraction in Egocentric Videos

Due to the large variability of visual data collected from wearable cameras there exist no typical feature descriptors but different representations are adopted dependently on the context. While in some situations extracting motion information by computing optical flow vectors may suffice [4], in other cases motion patterns may be too noisy and other kind of informations (e.g. presence/absence of objects) must be exploited. In this paper we demonstrate that, independently from the employed feature descriptors, MTC is an effective strategy for recognizing everyday activities. We now describe the adopted feature representations respectively for the considered office and home scenarios.

FPV office dataset. We follow [4] and extract features describing both the eye motion (obtained by the inside-looking camera) and the head and body motion (computed processing the outside camera’s stream). To calculate the eye motion features, we consider the gaze coordinates provided in the dataset and smooth them applying a median filter. Then the continuous wavelet transform is adopted for saccade detection separately on the \(x\) and \(y\) motion components [22]. The resulting signals are quantized according to magnitude and direction and are coded with a sequence of discrete symbols. To analyze the streams of the output camera, for each frame the global optical flow is computed by tracking corner points over consecutive frames and taking the mean flow in the \(x\) and \(y\) directions. Then, the optical flow vectors are quantized according to magnitude and direction with the same procedure adopted in the eye motion case. The obtained sequences of symbols are then processed to get the final video clip descriptors. We use a temporal sliding window approach to build a \(n\)-gram dictionary over all the dataset. Then each video is divided into segments corresponding to 15 s, each of them representing a video clip. For each sequence of symbols associated to a video clip, a histogram over the dictionary is computed. The final feature descriptor \(\mathbf {x}_i\) is calculated by considering some statistics over the clip histogram and specifically computing the maximum, the average, the variance, the number of unique \(n\)-grams, and the difference between maximum and minimum count. Figure 2 shows the feature extraction pipeline.

Fig. 3.
figure 3

FPV home dataset: frames depicting examples of the activities making cold food/snack and making tea and the detected objects.

FPV home dataset. We adopt the same object-centric approach proposed in [3], i.e. to compute features for each video clip, we consider the output of several object detectors. More specifically, we use the pre-segmented video clips and the active object models in [3]. Active object models are introduced to exploit the fact that objects look different when being interacted with (e.g. open and close fridge). Therefore in [3] additional detectors are trained using a subset of training images depicting object appearance when objects are used by people. Figure 3 shows some frames associated to the activities making cold food/snack and making tea: the output of the object detectors are depicted. To obtain object-centric features for each frame a score for each object model and each location is computed. Then the maximum scores of all the object models are used as frame features. To compute the final clip descriptors \(\mathbf x _i\), two approaches are adopted: one based on “bag of features” (accumulating frame features over time) and the other based on temporal pyramids. The temporal pyramid features are obtained concatenating several histograms constructed with accumulation: the first is a histogram over the full temporal extent of a video clip, the next is the concatenation of two histograms obtained by temporally segmenting the video into two parts, and so on.

3.2 Multi-task Clustering

We consider \(T\) related tasks corresponding to \(T\) different people. For each task \(t\), a set of data samples \(X^t = \{\mathbf {x}^t_{1}, \mathbf {x}^t_{2}, ... ,\mathbf {x}^t_{N_t}\}\) is available, where \(\mathbf {x}_j^{t} \in I\!R^d\) is the \(d\)-dimensional feature vector describing the \(j\)-th video clip and \(N_t\) is the total number of samples associated to the \(t\)-th task. In the following we denote with \((\cdot )'\) the transpose operator, \(N=\sum _{t=1}^T N_t\) is the total number of datapoints, while \(\mathbf {X} \in I\!R^{N \times d}\), \(\mathbf {X}=[{\mathbf {X}^1}' \ {\mathbf {X}^2}' \ \dots \ {\mathbf {X}^T}']'\) is the data matrix obtained by concatenating the task specific matrices \(\mathbf {X}^t = [{\mathbf {x}^t_{1}} \ {\mathbf {x}^t_{2}} \ ... \ {\mathbf {x}^t_{N_t}}]'\) \(\ \in I\!R^{N_t \times d}\). To discover people activities, we want to segment the entire video clip into parts, i.e. we want the data in the set \(X^t\) to be grouped into \(K_t\) clusters, where the number of required partitions can be different in different tasks. This is reasonable in the context of everyday activity recognition where people perform about the same activities. Furthermore, as we assume the tasks to be related, we also require that the resulting partitions are consistent with each other. This can be obtained by defining the following optimization problem:

$$\begin{aligned} \min _{\varvec{\Theta }_t} \ \ \sum _{t=1}^T \varLambda (\mathbf {X}^t,\mathbf {\Theta }^t)+ \sum _{t=1}^T\sum _{s=t+1}^T R(\varvec{\Theta }^t, \varvec{\Theta }^s) \end{aligned}$$
(1)

The problem (1) corresponds to the general problem of multi-task clustering, where the term \(\varLambda (\cdot )\) represents a reconstruction error which must be minimized by learning the optimal task-specific model parameters \(\mathbf {\Theta }^t\) (i.e. typically the cluster centroids and the associated assignment matrix), while \(R(\cdot )\) is an “agreement” term imposing that, since the multiple tasks are related, also the associated model parameters should be similar. Under this framework, in this paper we propose two different approaches for MTC which mainly differ for the definition of the “agreement term”. In the following subsections we present them in detail.

Notation. In the following \(\mathbf {A}_i.\), \(\mathbf {A}_{.j}\) denote respectively the \(i\)-th row and the \(j\)-th column of the matrix \(\mathbf {A}\).

3.3 Earth Mover’s Distance Multi-task Clustering

Given the task data matrices \(\mathbf {X}^t\), we are interested in finding the centroid matrices \(\mathbf {C}^t \in I\!R^{K_t \times d}\), and the cluster indicators matrices \(\mathbf {W}^t \in I\!R^{N_t \times K_t}\) by solving the following optimization problem:

(2)

The first term in the objective function is a relaxation of the traditional k-means objective function for \(T\) separated data sources. The second term, i.e. the agreement term, is added to explore the relationships between clusters of different data sources. It consists in the popular Earth Mover’s Distance (EMD) [23] computed considering the signatures \(\mathcal {T}\) and \(\mathcal {S}\) obtained by clustering the data associated to task \(t\) and \(s\) separately, i.e. \(\mathcal {T}=\{(\mathbf {C}^t_1.,w_t^1), \ \dots , \ (\mathbf {C}^t_{K_t}.,w^{K_t}_t) \}\), \(w_t^i=\sum _{n = 1}^{N_t } \mathbf {W}^t_{ni}\), and \(\mathcal {S}=\{(\mathbf {C}^s_1.,w_s^1), \ \dots , \ (\mathbf {C}^s_{K_s}.,w^{K_s}_s) \}\), \(w_s^i=\sum _{n = 1}^{N_s } \mathbf {W}^s_{ni}\). In practice \(\mathbf {C}^t_i.\) and \(\mathbf {C}^s_j.\) are the cluster centroids and \(w^s_i\), \(w^t_i\) denote the weights associated to each cluster (reflecting somehow the number of datapoints in each cluster). In practice the second term represents a sum of distances between two distributions and minimizing it we impose the found partitions between pairs of related tasks to be consistent. The variables \(f^{st}_{ij}\) are flow variables as follows from the definition of EMD as a transportation problem [23].

In (2) there are no constraints on the \(\mathbf {C}_t\) values. In this paper we define the matrix \(\mathbf {C} \in I\!R^{K \times d}\), \(\mathbf {C}=[{\mathbf {C}^1}' \dots {\mathbf {C}^T}']'\), \(K=\sum _{t=1}^T K_t\), and we impose that the columns of \(\mathbf {C}\) are a weighted sum of certain data points, i.e. \(\mathbf {C} = \mathbf {PX}\) where \(\mathbf {P}=\) blkdiag\((\mathbf {P}^1, \dots ,\mathbf {P}^T)\), \(\mathbf {P} \in I\!R^{K\times N} \). In the following, for sake of simplicity and easy interpretation, we consider only a two tasks problem. The extension to \(T\) tasks is straightforward. Defining \(\mathbf {F}=\) diag\((f_{11} \dots f_{K_1 K_2})\), \(\mathbf {F}\in I\!R^{K_1 K_2 \times K_1 K_2}\) and the block diagonal matrix \(\mathbf {W} = \) blkdiag\((\mathbf {W}^1,\mathbf {W}^2)\), \(\mathbf {W} \in I\!R^{N \times K}\), the optimization problem (2) becomes:

$$\begin{aligned} \varDelta (\mathbf {P},\mathbf {W},\mathbf {F} )= \min _{\mathbf {P},\mathbf {W},\mathbf {F} \ge 0 }&\left\| {\mathbf {X} - \mathbf {WPX}} \right\| _F^2 + \lambda \!\text{ tr }\!(\mathbf {MPXX' P' M' F}) \\ \text{ s.t. } \quad&\Vert {\mathbf {P}^t_{i.} } \Vert _1 = 1 , \ \ \forall i=1,\dots ,K \ \ \ \ \forall \ t=1,2 \end{aligned}$$
(3)
$$\begin{aligned} \\&\text{ tr }\!(\mathbf {I}_j \mathbf {F}) = \sum \limits _{i = 1}^{N } \mathbf {W}_{ij}, \ \ \forall j = 1,...,K \\&\text{ tr }\!(\mathbf {F}) = 1 \end{aligned}$$
(4)

where \(\mathbf {I}_j\in I\!R^{K_1 K_2 \times K_1 K_2} \) and \(\mathbf {M}\in I\!R^{K_1 K_2 \times K}\) are appropriately defined selection matrices as \(\tiny { \mathbf {I}_j = \left[ {\begin{array}{*{20}c} 1 &{} 0 &{} \cdots &{} 0 \\ 0 &{} 1 &{} \cdots &{} 0 \\ 0 &{} 0 &{} \ddots &{} 0 \\ \vdots &{} \vdots &{} \cdots &{} \vdots \\ 0 &{} 0 &{} \cdots &{} 0 \\ \end{array}} \right] }\), \(\tiny { \mathbf {M} = \underbrace{\left[ {\begin{array}{*{20}c} 1 &{} 0 &{} 0 &{} \cdots \\ 1 &{} 0 &{} 0 &{} \cdots \\ 1 &{} 0 &{} 0 &{} \cdots \\ 0 &{} 1 &{} 0 &{} \cdots \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ 0 &{} 0 &{} 1 &{} \cdots \\ \end{array}} \right. }_{1:K_1 }\underbrace{\left. {\begin{array}{*{20}c} { - 1} &{} 0 &{} \cdots \\ 0 &{} { - 1} &{} \cdots \\ 0 &{} \cdots &{} { - 1} \\ { - 1} &{} 0 &{} \cdots \\ \vdots &{} \vdots &{} \vdots \\ 0 &{} \cdots &{} { - 1} \\ \end{array}} \right] }_{K_1 + 1:K_1 + K_2 } }\).

Optimization. To solve the proposed problem (3), we first note that the optimal solution of (3) can be found adopting an alternating optimization scheme, i.e. optimizing separately (3) first with respect to \(\mathbf {P}\) and then with respect to \(\mathbf {W}\) and \(\mathbf {F}\) jointly. In both cases, a non-negative least square problem with constraints arises, for which standard solvers can be employed. However, due to computational efficiency, in this paper we consider an approximation of (3), replacing the constraints (4) with \(\text{ tr }\!(\mathbf {I}_j \mathbf {F}) = \mathbf {e}\), where \(\mathbf {e}\in I\!R^{{K_1K_2}}\), \(\mathbf {e}_i=\frac{1}{K_1}\), if \(i \le K_1\), \(\mathbf {e}_i=\frac{1}{K_2}\) otherwise. This approximation implies that for each task the same number of datapoints is assigned to all the clusters. In this way a more efficient solver can be devised. Specifically, we adopt an alternating optimization strategy, i.e. we optimize (3) separately with respect to \(\mathbf {F}\), \(\mathbf {W}\) and \(\mathbf {P}\) until convergence. The algorithm for solving (3) is summarized in Algorithm 1.

figure a

Kernelization. Finally, to kernelize the proposed method we consider a feature mapping \(\phi (\cdot )\) and the associated kernel matrix \(\mathbf { K_X=\phi (X) \phi (X)' }\). The objective function of (3) is:

$$\begin{aligned} \left\| {\phi \mathbf {(X)} - \mathbf {WP} \ \phi \mathbf {(X)} } \right\| _F^2 + \lambda \!\text{ tr }\!(\mathbf {MP\phi (X) \phi (X)' P' M' F}) = \nonumber \\ \text{ tr }\!(\mathbf {K}_\mathbf X - 2 \mathbf K _\mathbf X \mathbf P ' \mathbf W ' + \mathbf W \mathbf P \mathbf K _\mathbf X \mathbf P ' \mathbf W ' + \lambda \mathbf{MP }\mathbf K _\mathbf X \mathbf P '\mathbf M '\mathbf F ) \end{aligned}$$

The update rules of the kernelized version of our method can be easily derived similarly to the linear case using \(\mathbf {K_X}\) instead of \(\mathbf {X' X}\).

3.4 Convex Multi-task Clustering

Given the task specific training sets \(X^t\), we propose to learn the sets of cluster centroids \(\varPi ^t = \{\varvec{\pi }^t_{1}, \varvec{\pi }^t_{2}, ...,\varvec{\pi }^t_{N_t}\}, \varvec{\pi }^t_{i} \in I\!R^d\), by solving the following optimization problem:

$$\begin{aligned} \min _{\varvec{\pi }^t_i} \sum _{t=1}^T ( \sum _{i=1}^{N_t}\Vert \mathbf {x}^t_i-\varvec{\pi }^t_i\Vert ^2_2+ \lambda _t \sum _{ \mathop {i,j=1}\limits _{j>i}}^{N_t} w^t_{ij}\Vert \varvec{\pi }^t_i-\varvec{\pi }^t_j\Vert _1 ) + \lambda _2 \sum \limits _{ \mathop {t,s=1}\limits _{s>t}}^T \gamma _{st} \sum _{i=1}^{N_t}\sum _{j=1}^{N_s}\Vert \varvec{\pi }^t_i-\varvec{\pi }^s_j\Vert ^2_2\quad \end{aligned}$$
(5)

In (5) the first two terms guarantees that the data of each task are clustered: specifically with \(\lambda _t=0\) the found centroids are equal to the datapoints while as \(\lambda _t\) increases the number of different centroids \(\varvec{\pi }^t_i\) reduces. The last term in (5) instead imposes the found centroids to be similar if the tasks are related. The relatedness between tasks is modeled by the parameter \(\gamma _{st}\) which can be set using an appropriate measure between distributions. We consider the Maximum Mean Discrepancy \(\mathcal {D}(X^t,X^s)=\Vert \frac{1}{N_t}\sum _{i=1}^{N_t}\phi (\mathbf {x}_i^t)-\frac{1}{N_s} \sum _{i=1}^{N_s}\phi (\mathbf {x}_i^s) \Vert ^2\) [24], we computed it using a linear kernel and we set \(\gamma _{st}= e^{-\beta \mathcal {D}(X^t,X^s)}\) with \(\beta \) being a user-defined parameter (\(\beta =0.1\) in our experiments). The parameters \(w^t_{ij}\) are used to enforce datapoints in the same task to be assigned to the same cluster and can be set according to some a-priori knowledge or in a way such that the found partitions structure reflects the density of the original data distributions.

figure b

Optimization. To solve (5) we propose an algorithm based on the alternating direction method of multipliers (ADMM) [25]. We consider the matrix \(\varvec{\Pi }=[{\varvec{\Pi }^1}' \ {\varvec{\Pi }^2}' \ \dots \ {\varvec{\Pi }^T}']'\), \(\varvec{\Pi } \in I\!R^{N \times d}\), obtained concatenating the task-specific matrices \(\varvec{\Pi }^t = [\varvec{\pi }^t_{1} \ \varvec{\pi }^t_{2} \ ... \ \varvec{\pi }^t_{N_t}]'\). The problem (5) can be solved considering \(d\) separate minimization subproblems (one for each column of \(\mathbf {X}\)) as follows:

$$\begin{aligned} \min \nolimits _{\mathbf {q},~\varvec{\Pi }_{.j}}&\Vert \mathbf {X}_{.j}-\varvec{\Pi }_{.j}\Vert ^2_2+\Vert \mathbf {q} \Vert _1 + \lambda _2 \Vert \mathbf {B}\varvec{\Pi }_{.j}\Vert ^2_2 \nonumber \\ \text{ s.t. } \quad&\mathbf {E}\varvec{\Pi }_{.j} - \mathbf {q} = 0 \end{aligned}$$
(6)

where \(\mathbf {E}\) is a block diagonal matrix defined as \(\mathbf {E}\) = blkdiag(\(\mathbf {E}^1,\mathbf {E}^2,\dots ,\mathbf {E}^T\)) and \(\mathbf {E}^t \in I\!R^{|\mathcal {E}_t|\times N_t }\) is a matrix with \(|\mathcal {E}_t|= \frac{N_t(N_t-1)}{2}\) rows. Each row is a vector of all zeros except in the position \(i\) where it has the value \(\lambda _t w^t_{ij}\) and in the position \(j\) where it has the value \(-\lambda _t w^t_{ij}\). Similarly the matrix \(\mathbf {B}\in I\!R^{|\mathcal {B}|\times N}\), where \(|\mathcal {B}|=\frac{T(T-1)}{2}\), imposes smoothness between the parameters of related tasks. A row of the matrix \(\mathbf {B}\) is a vector with all zeros except in the terms corresponding to datapoints of the \(t\)-th task which are set to \(\gamma _{st}\) and to the terms corresponding to datapoints of the \(s\)-th task which are all set to \(-\gamma _{st}\). To solve (6) we consider the associated lagrangian \(L_\rho (\varvec{\Pi }_{.j},\mathbf {q},\mathbf {p})\):

$$\begin{aligned} \Vert \mathbf {X}_{.j}-\varvec{\Pi }_{.j}\Vert ^2_2 + \Vert \mathbf {q} \Vert _1 + \lambda _2 \Vert \mathbf {B}\varvec{\Pi }_{.j}\Vert ^2_2 + \mathbf {p}' ({\mathbf {E}\varvec{\Pi }_{.j} - \mathbf {q}}) + \frac{\rho }{2}\left\| {\mathbf {E}\varvec{\Pi }_{.j} - \mathbf {q}} \right\| _2^2 \end{aligned}$$

with \(\mathbf {p}\) being the vector of augmented Lagrangian multipliers and \(\rho \) being the dual update step length. We devise an algorithm based on the ADMM where three steps, corresponding to the update of the three variables \(\varvec{\Pi }_{.j},\mathbf {q},\mathbf {p}\), are performed. We summarize our approach in Algorithm 2.

Fig. 4.
figure 4

FPV Office dataset. Temporal video segmentation on the second sequence of subject-3 (13 min): comparison of different methods. (Best viewed in color) (Color figure online).

4 Experimental Results

4.1 Datasets and Experimental Setup

The growing interest in the vision community towards novel approaches for FPV analysis has led to the creation of several publicly available datasets [24, 8]. In this paper we consider two of them, the FPV office dataset [4] and the FPV home dataset [3].

FPV office dataset [4]. This dataset consists of five activities which frequently occur in an office environment (reading a book, watching a video, copying text from screen to screen, writing sentences on paper and browsing the internet). Each action was performed by five subjects, who were instructed to execute each task for about two minutes, while 30 s intervals of void class were placed between target tasks. To provide a natural experimental setting, the void class contains a wide variety of actions such as conversing, singing and random head motions. The sequence of five actions was repeated twice to induce interclass variance. The dataset consists of over two hours of data, where the video from each subject is a continuous 25–30 min. video.

FPV home dataset [3]. This dataset contains videos recorded from chest-mounted cameras by 20 different users. The users perform 18 non scripted daily activities in the house, like brushing teeth, washing dishes, or making tea. The length of the videos is in the range of 20–60 min. The annotations about the presence of 42 relevant objects (e.g. kettle, mugs, fridge) and about temporal segmentation are also provided.

Setup. In the experiments, we compare our methods (EMD Multi-task Clustering with linear and rbf kernel and Convex Multi-task Clustering denoted as EMD-MTC, KEMD-MTC, CMTC respectively) with single task clustering approaches, i.e. \(k\)-means (KM), kernel \(k\)-means (KKM), convex (CNMF) and semi (SemiNMF) nonnegative matrix factorization [26]. We also consider recent multi-task clustering algorithms such as the SemiEMD-MTC proposed in [21], its kernel version KSemiEMD-MTC and the LS-MTC method in [19]. To evaluate the clustering results, we adopt two metrics widely used in the literature: the clustering accuracy (Acc) and the normalized mutual information (NMI). For all the methods, except than for CMTC, 10 runs are performed corresponding to different initializations conditions. The average results are shown. In CMTC the parameters \(\lambda _t\) are varied in order to obtain the desired number of clusters. The value of the regularization parameters of our approaches (\(\lambda \) for the methods based on EMD regularization and \(\lambda _2\) for CMTC) are set in the range \(\{0.01, 0.1, 1, 10, 100\}\).

Fig. 5.
figure 5

FPV Office dataset. Confusion matrices using saccade+motion features obtained with (left) KEMD-MTC and (right) CMTC methods.

4.2 Results

FPV office dataset [4]. To conduct experiments on this dataset, we consider \(T=5\) tasks, as the dataset contains videos corresponding to five people. As each datapoint corresponds to a video clip in this dataset, we set the parameters \(w^t_{ij}\) in CMTC in order to enforce temporal consistency, i.e. for each task \(t\), \(w^t_{ij}=1\) if the features vectors \(\mathbf {x}_i^t\) and \(\mathbf {x}_j^t\) correspond to temporal adjacent video clips, otherwise \(w^t_{ij}=0\). Table 1 shows a comparison of the results associated to different clustering methods based on different types of features (i.e. only saccade, only motion and saccade+motion features). The last three rows correspond to methods which employ a non-linear kernel. From Table 1, several observations can be made. First, independently on the adopted features representation, multi-task clustering approaches always perform better than single task clustering methods (e.g. SemiEMD-MTC outperforms SemiNMF, EMD-MTC provide higher accuracy than CNMF, a value of \(\lambda _2\) greater than 0 leads to an improvement in accuracy and NMI in CMTC). Confirming the findings reported in [4], we also observe that combining motion and saccade features is advantageous with respect to considering each single feature representation separately. Noticeably, our methods are among the best performers, with KEMD-MTC reaching the highest values of accuracy and NMI. This is somehow expected probably due to both the use of kernels and the adoption of the multi-task learning paradigm. Moreover, CMTC outperforms EMD-MTC by up to 4 % which means that incorporating information about temporal consistency in the learning process is beneficial. Furthermore, in this case the use of Maximum Mean Discrepancy may capture better the relationship among tasks with respect to EMD. Figure 4 shows some qualitative temporal segmentation results on the second sequence of subject-3. In this case for example the CMTC methods outperforms all the other approaches and the effect of enforcing temporal consistency among clips is evident.

Table 1. Clustering results on FPV office dataset: comparison of different methods using saccade (S), motion (M) and saccade+motion (S+M) features.

Finally, Fig. 5 shows the confusion matrices associated to our methods KEMD-MTC and CMTC. Examining the matrix associated to KEMD-MTC, we observe that the void, copy and write actions achieve relative high recognition accuracies compared with the video and browse actions. It is also interesting to note that 25 % and 17 % of the video actions are recognized as browse actions for KEMD-MTC and CMTC respectively, because of the similarity among motion and eye-gaze patterns.

Fig. 6.
figure 6

Comparison of different methods using (left) bag of features and (right) temporal pyramid features on FPV home dataset. (Figure is best viewed in color) (Color figure online).

Fig. 7.
figure 7

FPV home dataset: performance variations of EMD-MTC and KEMD-MTC at different values of \(\lambda \) using (left) bag of features and (right) temporal pyramid features.

FPV home dataset [3]. In this dataset there are 18 different non scripted activities. Since each person typically performs a small subset of the 18 activities, in our experiments we consider a series of three tasks problems, selecting videos associated to three randomly chosen users but imposing the condition that videos corresponding to the three users should have at least three activities in common. We perform 10 different runs. Figure 6 shows the results (average accuracy) obtained with different clustering methods for both the bag-of-words and the temporal pyramid features representation. In this series of experiments, we did not cluster video clips of fixed size as in the office dataset, but we consider the pre-segmented clips as provided with the dataset. In this scenario, it does not make sense to set \(w^t_{ij}\) in CMTC to model temporal consistency. Therefore, we set \(w^t_{ij}= e^{-\Vert \mathbf {x}^t_i-\mathbf {x}^t_j \Vert ^2}\) if \(e^{-\Vert \mathbf {x}^t_i-\mathbf {x}^t_j \Vert ^2} \le \theta \) and \(w^t_{ij}=0\) otherwise. This is meant to enforce that the found partitions structure reflects the density of the original data distributions. Analyzing the results in Fig. 6, it is evident that the MTC approaches outperforms their single task version (e.g. CMTC outperforms CMTC with \(\lambda _2=0\), EMD-MTC outperforms CNMF, SemiEMD-MTC outperforms SemiNMF). On the other hand, our algorithms based on EMD regularization and CMTC achieve a considerably higher accuracy with respect to all the other methods. Finally, we also investigate the effect of different values of the regularization parameter \(\lambda \) in (3) on clustering performance. As shown in Fig. 7, independently from the adopted feature representation, the accuracy values are sensitive to varying \(\lambda \). Figure 7 shows that choosing a value of \(\lambda = 0.1\) always lead to similar or superior performance with respect to adopting a single-task clustering approach (\(\lambda =0\)). The value \(\lambda =0.1\) correspond to the results reported in Fig. 6. This clearly confirms the advantage of using a MTC approach for FPV analysis.

5 Conclusions

In this paper, we consider the problem of egocentric activity recognition from unlabeled data within a multi-task clustering framework. Two novel MTC algorithms have been proposed and evaluated extensively on two FPV datasets. Our experimental results clearly demonstrate the advantage of sharing informations among tasks over single tasks algorithms. Among our methods the approach based on EMD regularization achieves the best performance when used in its kernel version. On the other hand, our second algorithm is also effective as it is based on a convex optimization problem and it is particularly useful when one needs to incorporate some a-priori knowledge. In this paper we consider embedding information about temporal consistency but the CMTC method also permits to integrate a-priori knowledge about task dependencies if available (e.g. people performing the same activities in the same rooms correspond to more related tasks with respect to people operating in different rooms). This can be easily done by defining an appropriate matrix \(\mathbf {B}\). Future works include exploiting the suitability of the proposed algorithms for other vision applications, as well as investigating how to improve our MTC methods (e.g. by detecting outlier tasks).