1 Introduction

Understanding visual data captured by depth sensors such as Kinect (Shotton et al. 2013) has been receiving increasing interests in the computer vision community thanks to the recent advent of cost-effective depth sensors. In addition to RGB visual data captured by conventional RGB cameras, depth data encode rich 3D structural information of the entire scene, which is one of its key benefits. Such important information has shown to be helpful in reducing background noise, and thus has already been widely leveraged in pose estimation (Xu and Cheng 2013), layout estimation (Zhang et al. 2013), tracking (Zhou et al. 2015), image understanding (Wang et al. 2015), surface recovery (El et al. 2015), and action recognition (Kong and Fu 2015).

Action recognition from easy-to-use and low-cost depth sensors, such as Kinect sensors, attempts to predict the action labels from RGB-D videos. Thanks to the extra depth data, background noise that is frequently seen in action videos can be markedly reduced, thereby boosting the classification performance. Previous work Shotton et al. (2013), Oreifej and Liu (2013), Wang et al. (2012b) and Hadfield and Bowden (2013) showed that effective usage of 3D structural information facilitates recognition tasks as it simplifies intra-class motion variations and reduces cluttered background noise. Plenty of action descriptors specifically designed for depth action videos have been proposed, for example, action graph (Li et al. 2010), histogram of oriented 4D normals (Oreifej and Liu 2013), super normal vector (Yang et al. 2014), and depth spatiotemporal interest points (Xia and Aggarwal 2013; Hadfield and Bowden 2013).

Despite their effectiveness, those methods are limited to the scenario where depth data must be available. Depth data-based methods developed in Yang et al. (2014), Oreifej and Liu (2013), Xia and Aggarwal (2013) and Hadfield and Bowden (2013) would fail if depth data are not available due to the failure of depth sensors in RGB-D sensing devices. In addition, a working depth sensor may fail to compute a depth measurement due to fundamental physical limitations, which consequently causes missing depth data problem. For example, a human subject is too close to or too far from the depth sensor or interacting objects have reflective surfaces. Furthermore, depth data normally contain spatiotemporal discontinuous regions. These regions make the depth data very noisy, and consequently hinder the application of feature extraction methods such as surface normal (Yang et al. 2014; Oreifej and Liu 2013) and spatiotemporal interest points (Xia and Aggarwal 2013; Hadfield and Bowden 2013) in these regions. If the discontinuous regions unfortunately appear in the body parts that were supposed to provide discriminative cues, such as arms or legs, the recognition performance will be undoubtedly degraded where depth information is used as a the only cue.

Visual data and depth data can be complementary to each other. Recent work El et al. (2015), Jia et al. (2014) and Wang et al. (2015) has demonstrated that the fusion of visual data and depth data can notably improve the performance. It was also shown in Jia et al. (2014) and Kong and Fu (2015) that implicit correlations between visual and depth data can be learned to handle the case where one of them is unavailable. Moreover, RGB data are robust with no discontinuities. Numerous feature descriptors (e.g. gradient and optical flow) can be extracted from RGB data, providing abundant and robust features for recognition tasks. Furthermore, human bodies consist of multiple structural objects, and thus motions of human body parts are highly correlated. Existing work for action recognition from depth sequences (Yang et al. 2014; Oreifej and Liu 2013) attempted to capture spatiotemporal correlation information of body part movements by aggregating features from neighborhoods. However, the information would unfortunately be collapsed (Tenenbaum and Freeman 2000) if co-occurrence features are concatenated into a high dimensional vector and then linearly projected onto a subspace.

In this paper, we propose a novel max-margin heterogeneous information machine (MMHIM) for action recognition from RGB-D sequences. MMHIM treats visual and depth data as two modalities, and learns both features shared between them and private features for classification. More specifically, we project the original features of the two modalities onto a shared space, and learn cross-modal features shared between them for classification in order to effectively capture cross-modal knowledge. The learned cross-modal features inherit the characteristics of both RGB and depth data that capture motion, 3D structural, and spatiotemporal relationship information. In addition, private features are learned by projecting the original features to modality-specific spaces. These private features capture unique and intrinsic information of a modality, for example, edge cue in visual modality and distance cue in depth data. The use of both shared cross-modality features and private features allows us to leverage all discriminative information for classification. These two types of features are learned through data compression and noise “filtering” during the projection procedure, and are jointly optimized with the learning of the action classifiers. This automatically discovers compact yet discriminative features for the classifiers, and thus improves recognition performance (see Fig. 1). A structured prediction model is also proposed in this work, which allows us to model the agreement between low-level heterogeneous features and high-level structured outputs. We show in the experiment that the learned features are expressive and discriminative for differentiating action categories, even if one modality is missing in training or testing.

Fig. 1
figure 1

MMHIM projects and compresses both RGB visual features and depth features to a learned shared feature space and modality-specific private feature spaces. Features in these spaces are all used for learning classification boundaries. These two steps iterate until convergence

We represent both visual and depth features in a matrix form, which naturally encodes spatiotemporal structural relationships. Even though feature matrices are projected onto a low-dimensional space, the structural information of body parts is conserved and motion information is compressed and denoised. This overcomes the aforementioned problem of the collapsed information in feature vectors.

The recognition problem is formulated in a low-rank bilinear framework, particularly designed for feature representations in a matrix form. The proposed model learns feature projection matrices and classification parameter matrices, which operate as feature weighting in both rows and columns, respectively. The projection matrices are optimized to map original heterogeneous visual and depth features onto a shared feature space and private feature spaces. The shared space is the optimal space for building robust and effective cross-modal features for recognition; while the features in a private space inherit unique information of the corresponding modality. An information measure is incorporated in the learning of projection matrices to help reduce noise in the feature projection procedure. Classification is performed using both the learned cross-modal features and private features. The rank of the model is minimized to increase the generalization power and decrease computational cost (Wolf et al. 2007).

We propose an efficient algorithm to optimize MMHIM. Without approximations nor hard constraint on the rank of the parameter matrices, we present a regularized risk minimization problem that produces low-rank projection matrices and action classifiers by minimizing the Frobenius norm of the parameter matrices. This allows us to use existing efficient SVM solvers. The learning problem is iteratively solved with a bundle method (Teo et al. 2007; Do and Artieres 2009) being the solver for the inner optimization problem.

The main contribution of this work is the MMHIM, a novel formalism for RGB-D action recognition. With inputs of feature matrices rather than vectors, MMHIM keeps inherent spatiotemporal structural information within features, which plays a key role in recognition. In addition, MMHIM learns a shared space for fusing heterogeneous data (RGB and depth data in this work), where knowledge can be shared between them. MMHIM directly minimizes the rank of parameter matrices, and produces compact yet expressive cross-modal features through the use of information measure. MMHIM is also able to handle structured prediction problem. An efficient solver is developed for MMHIM in this work. We show that MMHIM achieves superior performance over state-of-the-art methods.

1.1 Overview of Our Approach

We study the use of both visual and depth data for action recognition from RGB-D videos. Note that a RGB-D has two channels, a RGB color video that is similar to a conventional color video used in action recognition, and an extra depth video that encodes distance information from the human subject to the camera.

The flowchart of the training procedure in our approach is illustrated in Fig. 2. Our approach consists of the following four main steps:

  1. 1.

    Preprocessing. Our approach uniformly samples a fixed number K of frames for all videos in a dataset. This procedure makes all the videos have equal length.

  2. 2.

    Low-level feature extraction. Given a RGB-D video sample (contains a color video and a depth video), we compute the gradient and the optical flow over both the color video and the depth video.

  3. 3.

    Feature representation. The histograms of oriented gradients (HOG) and histograms of optical flow (HOF) descriptors are adopted to represent the gradient and optical flow, respectively. Consequently, each channel (color video or depth video) of a RGB-D video will generate two features, i.e., HOG features and HOF features.

  4. 4.

    Model learning. The proposed MMHIM takes the low-level HOG and HOF features extracted from color and depth videos as inputs. Compact yet discriminative features are then summarized by MMHIM, and used for classification. Note that MMHIM jointly learns discriminative features and action classifiers at the same time.

Given a testing RGB-D video, our approach first samples K frames from each channel of the testing sample. Then HOG and HOF features are extracted over both the preprocessed color video and the depth video. The HOG and HOF action descriptors are fed into the trained MMHIM to compute the shared and private features, and finally the action label is predicted based on the computed features.

Fig. 2
figure 2

Flowchart of our approach in training. Please refer to Sect. 1.1 for details

2 Related Work

2.1 Action Recognition from RGB-D Videos

Previous action recognition approaches mainly focus on RGB action videos (Kong et al. 2014; Raptis and Sigal 2013; Tang et al. 2012; Ji et al. 2013). These studies used low-level interest point features (Tang et al. 2012), mid-level semantic features (Kong et al. 2014) or human pose (Raptis and Sigal 2013), or learned features using deep learning techniques (Ji et al. 2013). However, misclassification exists due to large intra-class motion and pose variations.

Thanks to the advent of low-cost Kinect sensors (Shotton et al. 2013), much effort has been devoted to object recognition (Chen et al. 2014; Bo et al. 2011; Wang et al. 2015) and action recognition (Li et al. 2010; Oreifej and Liu 2013; Yang et al. 2014; Hadfield and Bowden 2013; Wang et al. 2012a, b) from depth images. A group of RGB-D or depth action video datasets were introduced in Ni et al. (2011), Wang et al. (2012b), Oreifej and Liu (2013), Hadfield and Bowden (2013) and Ofli et al. (2013) such as RGBD-HuDaAct dataset, MSR Pair Actions dataset, and Hollywood 3D dataset. These studies showed that depth data capture 3D structural information, which helps reduce background noise and intra-class variations. Effective features have been proposed for the recognition task using depth data, such as histogram of oriented 4D normals (Oreifej and Liu 2013; Yang et al. 2014) and depth spatiotemporal interest points (Xia and Aggarwal 2013; Hadfield and Bowden 2013). Features from depth sequences can be encoded by Luo et al. (2013), or be used to build actionlets (Wang et al. 2012b) for recognition. The work in Sung et al. (2012) and Koppula and Saxena (2013) built layered action graph structures to model actions and subactions in a RGB-D video. Recent work Liu and Shao (2013) also showed that features of RGB-D data can be learned using deep learning techniques.

The methods in Li et al. (2010), Oreifej and Liu (2013), Yang et al. (2014), Hadfield and Bowden (2013), Wang et al. (2012a), and Luo et al. (2013) only use depth data, and thus would fail if depth data were missing. In contrast, our method uses both RGB and depth data, and can handle the case if one modality is missing. Even though existing work Hu et al. (2015), Jia et al. (2014), Lin et al. (2014), Liu and Shao (2013) and Wang et al. (2012b) used RGB and depth modalities, they only learned features shared between the two modalities and did not learn modality-specific or private features. Private features capture unique information of one modality and provide extra discriminative information for classification. In addition, the methods in Hu et al. (2015), Lin et al. (2014), Liu and Shao (2013) and Wang et al. (2012b) were not developed for dealing with missing modality problem and their performance in missing modality scenario is unknown. Moreover, they used features in a vector form, in which spatiotemporal structures would not be well represented (Tenenbaum and Freeman 2000; Kobayashi 2014). In this work, we use features in a matrix form (Pirsiavash et al. 2009), which naturally captures both spatiotemporal structural information and motion information. Our experiments show that features in a matrix format significantly improve the performance even though the rank of the parameter matrices in MMHIM is constrained to be 1.

An unsupervised setting was considered in Wu et al. (2015). In their work, long-range action relations such as the occurrence of put-milk-back-to-fridge and fetch-milk-from-fridge is modeled in RGB-D temporal sequences. Heterogeneous feature learning was also investigated in Hu et al. (2015). Their method projects various types of features including skeleton features and local HOG features into a shared feature space, and then uses SVM as the classifiers. The projection matrices are learned by minimizing the reconstruction loss. Different from this work, our approach jointly learns heterogeneous features and action classifiers. The projection matrices in our work are learned by minimizing the noise after projection and classification error using the projected features. The solution to the recognition task leverages auxiliary databases was studied in Jia et al. (2014) and Lin et al. (2014). Their methods assume actions can be reconstructed by entries in the auxiliary databases. Instead of using an auxiliary database to compute action representations, our method uses the information from the classifiers to guide the learning of discriminative action representations. This will learn the features that are optimized for classification. An efficient binary range-sample feature for depth data was proposed in Lu et al. (2014). This new type of depth feature has shown to be invariant to possible changes in scale, viewpoint, and background, and it is fast due to the binary property.

2.2 Action Recognition from Color Videos

In recent studies, human actions were popularly represented by local salient features detected by spatio-temporal interest points (Dollar et al. 2005; Schüldt et al. 2004; Laptev 2005; Klaser et al. 2008), structure features of interest points (Ryoo and Aggarwal 2009), trajectories (Wang et al. 2013; Raptis and Soatto 2010; Ni et al. 2015), holistic body shape (Liu et al. 2008), or key poses (Raptis and Sigal 2013), etc. Temporal evolution of human actions was captured in Fernando et al. (2015) and Kong et al. (2014). Bag-of-words (BoW) model is a common method for encoding these features in a vector format. BoW model first detects local or global features [e.g., spatiotemporal interests (Dollar et al. 2005; Laptev 2005) or histograms of oriented gradient (Dalal and Triggs 2005)] from videos. Then a clustering method such as k-means is employed to quantize these features into so-called visual words. After that, a histogram of the words contained in a video is computed and is considered as the vector format representation for the video. All these approaches use hand-crafted features, which normally require expensive human labor and expert knowledge to design extraction algorithms.

Thanks to recent deep learning techniques, human actions can be effectively learned from low-level observations (Ji et al. 2013; Karpathy et al. 2014; Simonyan and Zisserman 2014; Vondrick et al. 2016; Ma et al. 2016; Fernando et al. 2016). Specifically, these approaches use convolutional neural networks (CNNs) to perform convolution operation on images and build representations for video frames in a vector format. However, they generally require a large number of training samples as the CNNs they use have thousands of parameters to be learned and are very complex. Another line of research attempts to describe human actions using high-level semantics (Liu et al. 2011; Kong et al. 2014), i.e., action attributes. These binary action attributes explain whether a particular motion pattern is observed in a video, such as “arm raise up” and “leg move forward”.

Human interaction recognition (Kong et al. 2014; Lan et al. 2012; Ryoo and Aggarwal 2009; Marszałek et al. 2009) and human-object interaction (Zhou et al. 2015) were also explored in recent years. Previous studies Ryoo and Aggarwal (2009) and Marszałek et al. (2009) recognize interactions in the same way as single-person action recognition approaches (Dollar et al. 2005; Laptev 2005). Specifically, an interactions was represented as a motion descriptor including all the people in a video, and then an action classifier was adopted to classify this interaction. Context information was exploited in Kong et al. (2014) and Lan et al. (2012) in order to capture the motion relationships between people. The context information between a pair of motion attributes was captured in Kong et al. (2014). They described human interactions by this context information, which was called interactive phrases in their work. Action context between individuals was modeled in Lan et al. (2012). Their model can automatically determine which two individuals are having interactions.

2.3 Feature Learning

Feature learning methods (Kobayashi 2014; Pirsiavash et al. 2009; Argyriou et al. 2008; Xu et al. 2014) have been proposed to learn better feature representations for the recognition task. The methods in Pirsiavash et al. (2009) and Kobayashi (2014) adopt linear projections to learn better features in a matrix form for classification. They reduce the degree of freedom of the model parameter matrix by decomposing it into two parts and enforcing a hard restriction on their rank. Different from them, we elegantly use features from two modalities for recognition. In addition, we use an effective information measure to produce more compact cross-modal features. The work in Argyriou et al. (2008) learned a few common features across tasks using a regularizer, which couples the tasks and enforces sparsity.

Multimodal approaches (Jia et al. 2014; Xie and Xing 2013; Xu et al. 2014) attempt to discover common features between features of various modalities. The work in Jia et al. (2014) treated RGB and depth videos as two modalities. They used a cross-modality regularizer to link the two modalities in order to deal with the missing modality problem. A multimodal metric learning method in Xie and Xing (2013) embeds data of arbitrary modalities into a single latent space. The optimal distance metric is learned to better measure the similarity between data of different modalities. The method in Xu et al. (2014) extends information bottleneck (Tishby et al. 1999) to a multi-view model. Multiple information sources are filtered through a “bottleneck”, and then a margin maximization approach is used to strengthen the discrimination of the model.

Deep models have received lots of attentions in recent years, and various deep models have been developed for multi-modal learning (Ngiam et al. 2011; Andrew et al. 2013; Srivastava and Salakhutdinov 2014; Wang et al. 2015). The SplitAE method in Ngiam et al. (2011) assumes that a shared representation can be extracted from a single view, and can be used to reconstruct all views. Deep canonical correlation analysis (DCCA) was proposed in Andrew et al. (2013) to learn the correlations between two views using a deep architecture of nonlinear transformations. A multimodal deep Boltzmann machine (DBN) was presented in Srivastava and Salakhutdinov (2014). Their model uses modality-specific DBNs to build a layer of joint representation to fuse features from two modalities. The work in Wang et al. (2015) combined SplitAE and DCCA, and proposed three extensions for multimodal learning.

Fig. 3
figure 3

Examples of HOG feature computed on a a color image and b a depth image, and optical flow computed on c a color image and d a depth image

Fig. 4
figure 4

Feature matrix of size \(n_{xyt}\times n_f\) is constructed from features (e.g., HOG) computed on all the frames. \(n_{xyt}\) is the total number of pixels in all the feature frames, and \(n_f\) is the dimensionality of each local feature

Table 1 List of mathematical symbols

3 Max-Margin Heterogeneous Information Machine

The goal of this work is to utilize heterogeneous features from RGB-D action videos, and learn compact yet discriminative features for action recognition. Denote N RGB-D action videos for training purpose by \(\{X_i,y_i\}_{i=1}^N\), where \(X_i=\{X_i^{[v]}, X_i^{[z]}\}\in {\mathcal {X}}\) contains a RGB visual feature matrix \(X_i^{[v]}\in {\mathcal {X}}_v\) and a depth feature matrix \(X_i^{[z]}\in {\mathcal {X}}_z\) extracted from RGB-D data, and \(y_i\in {\mathcal {Y}}\) is the corresponding action label. Each feature matrix contains both HOG and HOF feature descriptors (see Fig. 3). Note that \(X_i^{[v]}\) and \(X_i^{[z]}\) in our work are defined as feature matrices of size \(n_{xyt}\times n_f\) (see Fig. 4), different from feature vectors (such as bag-of-word models) containing \(n_{xyt}\times n_f\) elements that are popularly used in the computer vision community. In this work, features \(X_i^{[v]}\) and \(X_i^{[z]}\) are extracted from a spatiotemporal grid of \(n_{xyt}=n_x\times n_y\times n_t\), and \(n_f\) is the dimensionality of each local feature. HOG and HOF features are concatenated horizontally, i.e., \(n_f\) is the dimensionality of concatenated features. Action representation in a matrix form allows us to capture inherent structure of features, such as spatiotemporal relationships. However, these relationships are collapsed in a vector form feature representation. Note that one can pull out other dimensions rather than the feature dimension in \(X_i^{[v]}\) and \(X_i^{[z]}\), but the structure of \(n_{xyt}\) pixels in the feature matrices will not be conserved by the proposed model.

RGB-D action data \(X_i\) contain two modalities, visual features \(X_i^{[v]}\) and depth features \(X_i^{[z]}\). The major challenge for effectively using the two-modality features is that they come from different distributions, and thus their similarities could not be measured directly. To solve this problem, we would like to learn two projection functions \(W_o^{[v]}\) and \(W_o^{[z]}\) for visual features \(X_i^{[v]}\) and depth features \(X_i^{[z]}\), respectively. Each of the projection functions maps the corresponding features to a space \({\mathcal {O}}\) shared between the two modalities: \(W_o^{[v]}:{\mathcal {X}}_v\rightarrow {\mathcal {O}}\), and \(W_o^{[z]}:{\mathcal {X}}_z\rightarrow {\mathcal {O}}\). After learning the projection functions, a classification model \(W_w\) can be learned to classify given the learned shared features \(O\in {\mathcal {O}}\).

The learned shared features O may not capture all the discriminative information for classification. Some important cues are not shared between modalities. We takes this into account by introducing private features for each modality. Two modality-specific projection matrices \(W_q^{[v]}\) and \(W_q^{[z]}\) are adopted to learn private features \(Q^{[v]}\in {\mathcal {Q}}_v\) and \(Q^{[z]}\in {\mathcal {Q}}_z\) from the original visual and depths features, respectively: \(W_q^{[v]}:{\mathcal {X}}_v\rightarrow {\mathcal {Q}}_v\), and \(W_q^{[z]}:{\mathcal {X}}_z\rightarrow {\mathcal {Q}}_z\). Classification models \(W_w^{[v]}\) and \(W_w^{[z]}\) can also be learned given the learned privates features \(Q^{[v]}\) and \(Q^{[z]}\).

In this work, we learn all the projection matrices and classification models simultaneously. Therefore, the learned projections are optimized for classification. We focus on learning a discriminant function \(F:{\mathcal {X}}\times {\mathcal {Y}}\rightarrow {\mathcal {R}}\) that scores each training sample \((X_i,y_i)\). The function F is applied to compute the compatibility among the input RGB-D features \(X_i\), the learned cross-modal features O, the private features Q, and the action label \(y_i\). A list of mathematical symbols used this paper is given in Table 1.

3.1 Model Formulation

Suppose we are given M types of modalities \(X_i^{[m]}|_{m=1}^M\). Here, m is the index of modality, which can be either visual (\(m=1\) or \(m=v\)) or depth (\(m=2\) or \(m=z\)). We represent both of the two modality features in a matrix form in order to capture inherent spatiotemporal structure. In this paper, we are interested in a binary linear discriminant function \(F(X_i,y|W)={{\mathrm{Tr}}}(W^{\mathrm {T}}X_i)=\sum _{m=1}^M{{\mathrm{Tr}}}(W^{[m]\mathrm {T}}X^{[m]}_i)\) parameterized by a model weight matrix W. In this work, we learn both shared features and private features from visual and depth modalities in order to capture rich discriminative information for classification. The graphical illustration of our model is shown in Figure 5. The one-vs-one scheme is adopted to extend our binary classifiers to a multi-class classifier.

-12pt

Fig. 5
figure 5

Graphical illustration of the proposed MMHIM model. Parameter matrix \(W_o^{[m]}\) (\(m=1,\cdots ,M\)) projects the m modality data, \(X^{[m]}\), into a learned shared space, and \(W_q^{[m]}\) (\(m=1,\cdots ,M\)) projects the data \(X^{[m]}\) into private spaces. Classification is performed using both the learned shared features and private features

Shared information between visual and depth modalities captures complex correlations between them. Model parameter \(W^{[m]}_O\) is used to extract shared information and classify actions. One of the challenges in RGB-D action recognition is that the two modalities, visual features and depth features, are in different feature spaces, and thus their similarities cannot be directly computed. We solve this problem by decomposing the model parameter \(W^{[m]}_O\) into two components, \(W_w\) and \(W_o^{[m]}\): \(W^{[m]}_O=W_wW_o^{[m]\mathrm {T}}\), which induces a bilinear model (Pirsiavash et al. 2009). Parameter matrix \(W_o^{[m]}\in {\mathcal {R}}^{n_f\times d_o}\) (\(m=1,\cdots ,M\)) projects the m-th modality data, \(X^{[m]}\), onto a learned shared space, and parameter matrix \(W_w\in {\mathcal {R}}^{n_{xyt}\times d_o}\) is applied to classify the projected data regardless of the modality. \(W_w\) is a spatiotemporal template defined over \(d_o\) features at each spatiotemporal location. Obviously, the rank of the model parameter matrix \(W^{[m]}_O\) will be enforced to be at most \(d_o\).

In addition to the shared features, each modality may also contain discriminative information but cannot be shared with the other modality. We capture such private features of modality m for classification using model parameter \(W^{[m]}_Q\). Similar to \(W^{[m]}_O\) for shared features, \(W^{[m]}_Q\) is also decomposed into two components, \(W_w^{[m]}\) and \(W_q^{[m]}\). Parameter matrix \(W_q^{[m]}\in {\mathcal {R}}^{n_f\times d_q}\) projects the original low-level data of modality m to a low-dimensional space, and \(W_w^{[m]}\in {\mathcal {R}}^{n_{xyt}\times d_q}\) is used to classify the projected data, which is essentially a spatiotemporal template for the projected data of \(d_q\) features. The rank of the model parameter \(W^{[m]}_Q\) is enforced to be at most \(d_q\).

Once the optimal model parameter matrix W is learned from training data, the action label \(y_i^*\) of a sample \(X_{i}\) can be computed by

$$\begin{aligned} \begin{aligned} y_i^*&={{\mathrm{sign}}}\Big [{{\mathrm{Tr}}}(W^{\mathrm {T}}X_i)\Big ]={{\mathrm{sign}}}\Big [{{\mathrm{Tr}}}(W_O^{\mathrm {T}}X_i+W_Q^{\mathrm {T}}X_i)\Big ]\\&={{\mathrm{sign}}}\Big [\sum _m{{\mathrm{Tr}}}(\underbrace{W_o^{[m]}W_w^{\mathrm {T}}X_i^{[m]}}_{ \textstyle {\text {shared information}}}\\&\quad +\,\underbrace{W_q^{[m]}W_w^{[m]\mathrm {T}}X_i^{[m]}}_{\textstyle {\text {private information}}})\Big ], \end{aligned} \end{aligned}$$
(1)

where \({{\mathrm{sign}}}(\cdot )\) is the sign function.

We train the MMHIM in Eq. (1) in a max-margin framework. Based on the empirical risk minimization principle, we formulate our learning problem as

$$\begin{aligned}&\begin{aligned}&\min _{W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]}} \lambda \cdot r(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})\Big |_{m}\\&\quad +\,\phi (W_o^{[m]},W_q^{[m]})\Big |_{m}+\eta \cdot l(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})\Big |_{m}. \end{aligned}\nonumber \\ \end{aligned}$$
(2)

For succinctness, “\(|_m\)” indicates that the parameters of the two modalities (\(m=v\) and \(m=z\)) are jointly optimized. \(\phi (\cdot )\) is a regularizer for reducing noise in the projected data, \(r(\cdot )\) is an additional regularization term related to the margin of our bilinear model, and \(l(\cdot )\) computes training loss for the two-modality data. \(\lambda \) and \(\eta \) are trade-off parameters balancing the importance of the corresponding terms.

3.1.1 Margin Regularizer \(r(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})|_m\)

Regularizer \(r(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})|_{m\in \{v,z\}}\) is used to measure the margin of the bilinear classifier. Minimizing \(r(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})|_{m\in \{v,z\}}\) is equivalent to maximizing the margin of the bilinear model, thereby improving the generalization power. The margin regularizer \(r(\cdot )\) is defined as

$$\begin{aligned}&r(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})\big |_mx \nonumber \\&\quad =\frac{1}{2}\Big [\underbrace{r_o(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})\big |_{m}}_{\textstyle {\text {margin for shared features}}}\nonumber \\&\qquad +\,\underbrace{r_q(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})\big |_{m}}_{\textstyle {\text {margin for private features}}}\Big ], \end{aligned}$$
(3)
$$\begin{aligned}&r_o(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})|_m \nonumber \\&\quad =\frac{1}{2}\Big [{{\mathrm{Tr}}}(W_wW^{[v]\mathrm {T}}_oW_o^{[v]}W_w^{\mathrm {T}})\nonumber \\&\qquad +\,{{\mathrm{Tr}}}(W_wW^{[z]\mathrm {T}}_oW_o^{[z]}W_w^{\mathrm {T}})\Big ], \end{aligned}$$
(4)
$$\begin{aligned}&r_q(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})|_m=\frac{1}{2}\nonumber \\&\qquad \times \, \Big [{{\mathrm{Tr}}}(W_w^{[m]}W^{[v]\mathrm {T}}_qW_q^{[v]}W_w^{[m]\mathrm {T}})\nonumber \\&\qquad +\,{{\mathrm{Tr}}}(W_w^{[m]}W^{[z]\mathrm {T}}_qW_q^{[z]}W_w^{[m]\mathrm {T}})\Big ]. \end{aligned}$$
(5)

The regularization term \(r(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})|_m\) considers the classifier margins of the models for both shared features and private features, \(r_o(\cdot )\) and \(r_q(\cdot )\). It naturally induces low-rank classifiers with the maximum rank of \(d_o\) and \(d_q\) for the two types of features, respectively. This restricts the degree of freedom of model parameter matrices. As shown in Wolf et al. (2007), the VC-dimension of low-rank classification models was proven to be less than that of the concatenated linear models.

Regularizer \(r(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})|_m\) is minimized to extract discriminative information from both cross-modal features O and private features Q for action recognition. It works together with \(\phi (W_o^{[m]},W_q^{[m]})|_m\) in Eq. (6) to extract discriminative information and filter out noise for the recognition task.

3.1.2 Projection Regularizer \(\phi (W_o^{[m]},W_q^{[m]})|_m\)

Regularizer \(\phi (W_o^{[m]},W_q^{[m]})|_m\) is a function that attempts to summarize and compress the original two-modality data. Since the raw RGB and depth data may not be in the same space, we use this term to compress the data, and discover shared and modality-specific knowledge between the two modalities. We define this term as

$$\begin{aligned} \phi (W_o^{[m]},W_q^{[m]})|_m= & {} \Big [\underbrace{\phi _o(W_o^{[m]},W_q^{[m]})\big |_m}_{\textstyle {\text {shared information}}}\nonumber \\&+\,\underbrace{\phi _q(W_o^{[m]},W_q^{[m]})\big |_m}_{\textstyle {\text {private information}}}\nonumber \\&+\,\underbrace{\phi _r(W_o^{[m]},W_q^{[m]})\big |_m}_{\textstyle {\text {redundant information}}}\Big ], \end{aligned}$$
(6)
$$\begin{aligned} \phi _o(W_o^{[m]},W_q^{[m]})|_m= & {} I(X^{[v]},O)+I(X^{[z]},O), \end{aligned}$$
(7)
$$\begin{aligned} \phi _q(W_o^{[m]},W_q^{[m]})|_m= & {} I(X^{[v]},Q^{[v]})+I(X^{[z]},Q^{[z]}), \end{aligned}$$
(8)
$$\begin{aligned} \phi _r(W_o^{[m]},W_q^{[m]})|_m= & {} I(Q^{[v]},O)+I(Q^{[z]},O), \end{aligned}$$
(9)

where \(I(\cdot ,\cdot )\) computes mutual information between two variables J and K:

$$\begin{aligned} I(J,K)=\sum _j\sum _kp(j)p(k|j)\log \frac{p(k|j)}{p(k)}. \end{aligned}$$
(10)

\(X^{[m]}=\{X^{[m]}_i\}_{i=1}^N\) (\(m=v\) or \(m=z\)) represents a set of all training samples in the m-th modality, \(O=\frac{1}{2}(X^{[v]}W_o^{[v]}+X^{[z]}W_o^{[z]})\in {\mathcal {O}}\) is the learned low-dimensional cross-modal features in the shared space, \(Q^{[m]}=X^{[m]}W_q^{[m]}\) denotes the private features of the m-th modality.

Equations (7) and (8) are utilized to introduce cross-modality and private knowledge to the model through the learning of the shared features O and the private features Q. Cross-modal features O inherit information from both RGB and depth data, including motion, structure, and spatiotemporal relationship information. Private features, on the other hand, capture information that is not sharable, such as distance cue in the depth data. We show in the experiments that the learned features play an important role in the recognition of RGB-D actions and in case of missing one modality in training or testing phase. Equation (9) aims at reduce redundancies between the shared and private features.

In addition, the term \(\phi (W_o^{[m]},W_q^{[m]})|_m\) helps to reduce noise and produce compact representations for cross-modal features O and private features Q. In the learning of the features O and Q, a large amount of noise irrelevant to action labels would also be introduced to low-dimensional spaces, and thus degrades the recognition performance. By minimizing \(\phi (W_o^{[m]},W_q^{[m]})|_m\), both noisy and discriminative information in O and Q will be reduced, but the later one can be well captured by the regularizer \(r(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})|_m\) in Eq. (3). Parameter \(\lambda \) for regularizer \(r(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})|_m\) is used for balancing the importance of the noise filter in MMHIM.

3.1.3 Loss Function \(l(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})|_m\)

Loss function \(l(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})|_m\) computes training loss given the learned model parameter matrices. We consider binary classifiers in this work, and define a hinge loss function \(h(y,f(x))=\max (0,1-yf(x))\) for each modality, which is similar to the one in the binary SVM:

$$\begin{aligned}&l(W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]})|_m\end{aligned}$$
(11)
$$\begin{aligned}&\quad =\sum _i\Big [h\big (y_i,{{\mathrm{Tr}}}(W_o^{[v]}W_w^{\mathrm {T}}X_i^{[v]})\big )+h\big (y_i,{{\mathrm{Tr}}}(W_o^{[z]}W_w^{\mathrm {T}}X_i^{[z]})\big )\end{aligned}$$
(12)
$$\begin{aligned}&\qquad +h\big (y_i,{{\mathrm{Tr}}}(W_q^{[v]}W_w^{[v]\mathrm {T}}X_i^{[v]})\big )+h\big (y_i,{{\mathrm{Tr}}}(W_q^{[z]}W_w^{[z]\mathrm {T}}X_i^{[z]})\big )\Big ]. \end{aligned}$$
(13)

Here, the losses in Eq. (12) are incurred by the shared features O, and the ones in Eq. (13) are incurred by the private features Q.

3.1.4 Learning Formulation

Plugging Eqs. (3), (6), and (11) into Eq. (2), optimal parameter matrices \(\{W_o^{[m]},W_q^{[m]},W_w^{[m]},W_w\}|_m\) can be learned by the following constrained optimization problem:

$$\begin{aligned}&\begin{aligned}&\min _{W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]}}\ \sum _m\Big \{I(X^{[m]},O)+I(X^{[m]},Q^{[m]})\\&\quad +\,I(Q^{[m]},O)+\frac{1}{4}\lambda \Big [\mathrm {Tr}(W_wW^{[m]\mathrm {T}}_oW_o^{[m]}W_w^{\mathrm {T}})\\&\quad +\,\mathrm {Tr}(W_w^{[m]}W^{[m]\mathrm {T}}_qW_q^{[m]}W_w^{[m]\mathrm {T}})\Big ]+\eta \cdot \sum _i(\xi _i^{[m]}+\epsilon _i^{[m]})\Big \},\\&\text {s.t.}\qquad y_i\mathrm {Tr}(W_o^{[m]}W_w^{\mathrm {T}}X_i^{[m]})\geqslant 1-\xi _i^{[m]},\quad \forall i, \forall m,\\&\quad \quad \quad \ y_i\mathrm {Tr}(W_q^{[m]}W_w^{[m]\mathrm {T}}X_i^{[m]})\geqslant 1-\epsilon _i^{[m]},\quad \forall i, \forall m,\\&\quad \quad \quad \ \xi _i^{[m]}\geqslant 0,\quad \forall i, \forall m,\\&\quad \quad \quad \ \epsilon _i^{[m]}\geqslant 0,\quad \forall i, \forall m, \end{aligned}\nonumber \\ \end{aligned}$$
(14)

where \(\xi _i^{[m]}\) and \(\epsilon _i^{[m]}\) are slack variables for the shared features and the private features of the m-th modality in the i-th RGB-D video, respectively.

3.2 Model Learning

The above constrained optimization problem can be solved by a coordinate descent algorithm that solves for one set of parameter matrices at each step with the others fixed. Each step in the algorithm is a regularized risk minimization problem, which can be solved using a bundle methodFootnote 1 (Teo et al. 2007; Do and Artieres 2009). In a nutshell, the bundle algorithm iteratively builds an increasingly accurate piecewise quadratic lower bound of the objective function. We adopt the bundle method as the inner problem solver due to its efficiency and good convergence.

We first reformulate the optimization problem (14) as an unconstrained regularized risk minimization problem:

$$\begin{aligned} \min _{W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]}}\ \eta \cdot \sum _i\sum _mL_i^{[m]}+\sum _mR^{[m]}, \end{aligned}$$
(15)

where

$$\begin{aligned}&\begin{aligned} L_i^{[m]}&=\max \left( 0,1-y_i\mathrm {Tr}(W_o^{[m]}W_w^{\mathrm {T}}X_i^{[m]})\right) \\&\quad +\,\max \left( 0,1-y_i\mathrm {Tr}(W_q^{[m]}W_w^{[m]\mathrm {T}}X_i^{[m]})\right) \\ R^{[m]}&=I(X^{[m]},O)+I(X^{[m]},Q^{[m]})+I(Q^{[m]},O)+\frac{1}{2}\lambda \cdot \\&\quad \left[ \mathrm {Tr}(W_wW^{[m]\mathrm {T}}_oW_o^{[m]}W_w^{\mathrm {T}})\right. \\&\quad \left. +\,\mathrm {Tr}(W_w^{[m]}W^{[m]\mathrm {T}}_qW_q^{[m]}W_w^{[m]\mathrm {T}})\right] , \end{aligned}\nonumber \\ \end{aligned}$$
(16)

are empirical loss and regularizers, respectively. We solve this optimization problem using a coordinate descent algorithm that iteratively update one variable at a time.

Update \(W_w\). Specifically, if \(\{W_w^{[m]},W_o^{[m]},W_q^{[m]}\}\) are fixed, the optimization problem is

$$\begin{aligned} \begin{aligned}&\min _{W_w}\frac{1}{2}\lambda \sum _m\mathrm {Tr}(W_wW^{[m]\mathrm {T}}_oW_o^{[m]}W_w^{\mathrm {T}})\\&\quad +\,\eta \sum _i\sum _m\max \left( 0,1-y_i\mathrm {Tr}(W_o^{[m]}W_w^{\mathrm {T}}X_i^{[m]})\right) . \end{aligned} \end{aligned}$$
(17)

To efficiently solve this problem, we define \(A=\sum _mW_o^{[m]\mathrm {T}}W_o^{[m]}\), and define two auxiliary variables \({\widehat{W}}_w=W_wA^\frac{1}{2}\) and \({\widehat{X}}_i^{[m]}=X_iW_o^{[m]}A^{-\frac{1}{2}}\). Note that A is a matrix of size \(d\times d\) that is in general invertible for small d. Then the problem (17) can be equivalently rewritten as

$$\begin{aligned}&\begin{aligned} \min _{{\widehat{W}}_w}\frac{1}{2}\lambda \mathrm {Tr}({\widehat{W}}_w^{\mathrm {T}}{\widehat{W}}_w)+\eta \sum _i\sum _m\max \left( 0,1-y_i\mathrm {Tr}({\widehat{W}}_w^{\mathrm {T}}{\widehat{X}}_i^{[m]})\right) . \end{aligned}\nonumber \\ \end{aligned}$$
(18)

This is an unconstrained regularized risk minimization problem equivalent to linear SVM if \({\widehat{W}}_w\) and \({\widehat{X}}_i^{[m]}\) are vectorized. We solve this problem using a bundle method. After learning \({\widehat{W}}_w\), the original parameter matrix \(W_w\) can be reconstructed by \(W_w={\widehat{W}}_wA^{-\frac{1}{2}}\).

Update \(W_w^{[m]}\). We fix \(\{W_w,W_o^{[m]},W_q^{[m]}\}\), and solve

$$\begin{aligned} \begin{aligned}&\min _{W_w^{[m]}}\frac{1}{2}\lambda \sum _m\mathrm {Tr}(W_w^{[m]}W^{[m]\mathrm {T}}_qW_q^{[m]}W_w^{[m]\mathrm {T}})\\&\quad +\,\eta \sum _i\sum _m\max \left( 0,1-y_i\mathrm {Tr}(W_q^{[m]}W_w^{[m]\mathrm {T}}X_i^{[m]})\right) . \end{aligned} \end{aligned}$$
(19)

Similar to the optimization procedure of parameter matrix \(W_w\), we also define \(B= W_q^{[m]\mathrm {T}}W_q^{[m]}\), and introduce two auxiliary variables \({\overline{W}}_w^{[m]}=W_w^{[m]}B^{\frac{1}{2}}\) and \({\overline{X}}_i^{[m]}=X_i^{[m]}W_q^{[m]}B^{-\frac{1}{2}}\). Then the optimization problem (19) can be equivalently given by

$$\begin{aligned} \begin{aligned}&\min _{{\overline{W}}_w^{[m]}}\frac{1}{2}\lambda \sum _m\mathrm {Tr}\left( {\overline{W}}_w^{[m]^{\mathrm {T}}}{\overline{W}}_w^{[m]}\right) \\&\quad +\,\eta \sum _i\sum _m\max \left( 0,1-y_i\mathrm {Tr}({\overline{W}}_w^{[m]\mathrm {T}}{\overline{X}}_i^{[m]})\right) . \end{aligned} \end{aligned}$$
(20)

This is also an unconstrained regularized risk minimization problem. If \({\overline{W}}_w^{[m]}\) and \({\overline{X}}_i^{[m]}\) are vectorized, this problem can be solved using standard linear SVM solver. After learning \({\overline{W}}_w^{[m]}\), the original parameter matrix \(W_w^{[m]}\) can be reconstructed by \(W_w^{[m]}={\overline{W}}_w^{[m]}B^{-\frac{1}{2}}\).

Update \(W_o^{[m]}\). When \(\{W_w,W_w^{[m]},W_q^{[m]}\}\) are fixed, \(W_o^{[m]}\) for each modality can be optimized in a similar form to Eq. (15) and (16) but with \(W_w\) as constant. We define \(C=W_w^{\mathrm {T}}W_w\), and further define two auxiliary variables, \({\widetilde{W}}_o\) and \({\widetilde{X}}_i\), as \({\widetilde{W}}_o^{[m]}=W_o^{[m]}C^\frac{1}{2}\) and \({\widetilde{X}}_i^{[m]}=X_i^{[m]\mathrm {T}}W_wC^{-\frac{1}{2}}\). Then, the parameter matrix \({\widetilde{W}}_o^{[m]}\) for each modality can be optimized independently by

$$\begin{aligned} \begin{aligned}&\min _{{\widetilde{W}}_o^{[m]}}\frac{1}{2}\mathrm {Tr}\left( {\widetilde{W}}_o^{[m]\mathrm {T}}{\widetilde{W}}_o^{[m]}\right) +\lambda I({\widetilde{X}}^{[m]},{\widetilde{O}})\\&\quad +\,\eta \sum _i\max \left( 0,1-y_i\mathrm {Tr}({\widetilde{W}}_o^{[m]\mathrm {T}}{\widetilde{X}}_i^{[m]})\right) , \end{aligned} \end{aligned}$$
(21)

with the assumption that the conditional distribution \(p(W_w,C^{-\frac{1}{2}}|X^{[m]},O)\) is a uniform distributionFootnote 2. This is also an unconstrained regularized risk minimization problem and can be solved by a bundle algorithm if \({\widetilde{W}}_o^{[m]}\) and \({\widetilde{X}}_i^{[m]}\) are unfolded into vectors. We repeat this step twice, each of which is fed with visual features \(X_i^{[v]}\) or depth feature \(X_i^{[z]}\). After optimizing \({\widetilde{W}}_o^{[m]}\), \(W_o^{[m]}\) can be recovered by \(W_o^{[m]}={\widetilde{W}}_o^{[m]}C^{-\frac{1}{2}}\).

Update \(W_q^{[m]}\). When \(\{W_w,W_w^{[m]},W_o^{[m]}\}\) are fixed, \(W_q^{[m]}\) can be optimized by

$$\begin{aligned} \begin{aligned}&\min _{W_q^{[m]}}\frac{1}{2}\mathrm {Tr}\left( W_w^{[m]}W^{[m]\mathrm {T}}_qW_q^{[m]}W_w^{[m]\mathrm {T}}\right) +\lambda I(X^{[m]},Q^{[m]})\\&\quad +\,\lambda I(Q^{[m]},O)\\&\quad +\,\eta \sum _i\max \left( 0,1-y_i\mathrm {Tr}(W_q^{[m]}W_w^{[m]\mathrm {T}}X_i^{[m]})\right) , \end{aligned} \end{aligned}$$
(22)

We use the similar method that is used in learning \(W_o^{[m]}\). We define \(D=W_w^{[m]\mathrm {T}}W_w^{[m]}\), and further introduce two auxiliary variables \({\widehat{W}}_q^{[m]}=W_q^{[m]}D^\frac{1}{2}\) and \({\widehat{X}}_i^{[m]}=X_i^{[m]\mathrm {T}}W_w^{[m]}D^{-\frac{1}{2}}\). Then the parameter matrix \(W_q^{[m]}\) for modality m can be learned by

$$\begin{aligned}&\begin{aligned}&\min _{{\widehat{W}}_q^{[m]}}\frac{1}{2}\mathrm {Tr}\left( {\widehat{W}}^{[m]\mathrm {T}}_q{\widehat{W}}_q^{[m]}\right) +\lambda I\left( {\widehat{X}}^{[m]},{\widehat{Q}}^{[m]}\right) \\&\quad +\,\lambda I({\widehat{Q}}^{[m]},{\widehat{O}})+\eta \sum _i\max \left( 0,1-y_i\mathrm {Tr}({\widehat{W}}_q^{[m]\mathrm {T}}{\widehat{X}}_i^{[m]})\right) , \end{aligned}\nonumber \\ \end{aligned}$$
(23)

The learning of parameter \({\widehat{W}}_q^{[m]}\) can be solved using the bundle algorithm. One of the key steps in the bundle algorithm is computing the subgradient of the mutual information term \(I({\widehat{O}},{\widehat{Q}}^{[m]})\) in the objective function in Eq. (23) with respect to the model parameter \({\widehat{W}}_q^{[m]}\). In this work, the subgradient with respect to the (ij)-th element in the model parameter \({\widehat{W}}_q^{[m]}\) can be computed by

$$\begin{aligned}&\begin{aligned} \frac{\partial I({\widehat{O}},{\widehat{Q}}^{[m]})}{\partial {\widehat{W}}_{q(i,j)}^{[m]}}&={{\mathrm{Tr}}}\Big \{\Big [p({\widehat{O}})\Big (\log \frac{{\widehat{W}}_o^{[m]+}{\widehat{W}}_q^{[m]}}{p({\widehat{Q}}^{[m]})}+1\Big )\Big ]^{\mathrm {T}}\\&\quad \cdot {\widehat{W}}_o^{[m]+}\cdot I_{ij}\Big \}, \end{aligned}\nonumber \\ \end{aligned}$$
(24)

where \({\widehat{W}}_o^{[m]+}\) computes the pseudo-inverse of \({\widehat{W}}_o^{[m]}\): \({\widehat{W}}_o^{[m]+}={\widehat{W}}_o^{[m]\mathrm {T}}({\widehat{W}}_o^{[m]}{\widehat{W}}_o^{[m]\mathrm {T}})^{-1}\). \(I_{ij}\) is a matrix (of size \(n_f\times d_p\)) with all 0s but with 1 at (ij).

We solve the optimization problem (23) twice, each of which is fed with visual features \(X_i^{[v]}\) or depth features \(X_i^{[z]}\). After optimizing \({\widehat{W}}_q^{[m]}\), \(W_q^{[m]}\) can be recovered by \(W_q^{[m]}={\widehat{W}}_q^{[m]}D^{-\frac{1}{2}}\)

The proposed MMHIM is solved by iteratively optimizing problems (18), (20), (21), and (23) until convergence. This is a biconvex problem as optimizing one parameter matrix holding the others fixed is a convex problem. The algorithm converges as optimizing each of model parameter matrices reduces objective function value.

3.3 Using Feature Vectors for Classification

The proposed approach takes feature matrices as input in order to capture spatiotemporal structures of human body parts. However, compared with features in vector format, there are not too many features represented in a matrix form. In order to utilize existing vector-based features [such as skeleton feature vectors (Du et al. 2015; Wang et al. 2012b) or normal vectors (Oreifej and Liu 2013; Yang et al. 2014)], we propose a generalized framework that utilizes features in both a vector format and a matrix format.

Skeleton features are popularly used in action recognition from RGB-D videos due to its high discriminative power (Du et al. 2015; Wang et al. 2012b; Xia and Aggarwal 2013). We extract skeleton feature using Du et al. (2015) in this work, represent it in a vector format, and use it for classification together with visual and depth cues that are defined in a matrix format. A new potential linear function \(w^{\mathrm {T}}_sx^{[s]}\) is added to the discriminant function in Eq. (1). Here, \(w_s\) is a vector of model parameters, and \(x^{[s]}\) is a vector of skeleton features. The linear function \(w^{\mathrm {T}}_sx^{[s]}\) measures the compatibility between the skeleton features \(x^{[s]}\) and the action label \(\text {+1/-1}\). By adding \(w^{\mathrm {T}}_sx^{[s]}\), the new discriminant function is

$$\begin{aligned} y_i^*={{\mathrm{sign}}}\Big [{{\mathrm{Tr}}}(W_O^{\mathrm {T}}X_i+W_Q^{\mathrm {T}}X_i)+w^{\mathrm {T}}_sx^{[s]}\Big ]. \end{aligned}$$
(25)

Note that we do not learn a shared feature space for the skeleton feature in this work, different from visual and depth features where a shared feature space is learned.

The model parameter \(w_s\) can be jointly learned with other model parameters \(\{W_o^{[m]},W_q^{[m]},W_w^{[m]},W_w\}|_m\) using the following optimization formulation similar to (15):

$$\begin{aligned} \min _{w_s,W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]}}\ \eta \cdot \sum _i\sum _mL_i^{[m]}+\sum _mR^{[m]}, \end{aligned}$$
(26)

where

$$\begin{aligned} \begin{aligned} L_i^{[m]}&=\max (0,1-y_iw^{\mathrm {T}}_sx^{[s]})\\&\quad +\,\max \left( 0,1-y_i\mathrm {Tr}(W_o^{[m]}W_w^{\mathrm {T}}X_i^{[m]})\right) \\&\quad +\,\max \left( 0,1-y_i\mathrm {Tr}(W_q^{[m]}W_w^{[m]\mathrm {T}}X_i^{[m]})\right) \\ R^{[m]}&=I(X^{[m]},O)+I(X^{[m]},Q^{[m]})+I(Q^{[m]},O)\\&\quad +\,\frac{1}{2}\lambda \Big [\Vert w\Vert ^2+\mathrm {Tr}(W_wW^{[m]\mathrm {T}}_oW_o^{[m]}W_w^{\mathrm {T}})\\&\quad +\mathrm {Tr}(W_w^{[m]}W^{[m]\mathrm {T}}_qW_q^{[m]}W_w^{[m]\mathrm {T}})\Big ]. \end{aligned} \end{aligned}$$
(27)

The optimization problem can be solved by a coordinate descent algorithm similar to the one we proposed in Sect. 3.2. If we update \(w_s\) and fix all the other parameters, the learning problem can be written as

$$\begin{aligned} \min _{w_s}\ \frac{1}{2}\lambda \Vert w\Vert ^2+\max (0,1-y_iw^{\mathrm {T}}_sx^{[s]}), \end{aligned}$$
(28)

which is a standard linear SVM optimization problem and can be solved using a off-the-shelf SVM solver.

3.4 Structured Prediction Model

The main limitation of the above MMHIM model is that it cannot be used in structured prediction problems. In this work, we further extend the MMHIM to a structured prediction model that can capture the correlations between multiple outputs. We consider a special case of learning a multi-class MMHIM \({{\mathrm{Tr}}}[W^{\mathrm {T}}{\varPhi }(X,y)]\) for \(n_c\) action categories. Here, W is model parameter matrix and \({\varPhi }(X,y)\) is a feature function that models the agreement between low-level features X and action label y. Various structured prediction models can be developed based on Structured MMHIM by using more complex model structures (e.g., a sequence of video frames, multiple body parts in part-based models). This allows us to learn shared and private features, and structured labels jointly.

The key in the structured prediction model is the feature function \({\varPhi }(X_i,y_i)\). In this work, we define the feature function as

$$\begin{aligned} W^{\mathrm {T}}{\varPhi }(X_i,y_i)=W^{\mathrm {T}}_O{\varPhi }_O(X_i,y_{i})+W^{\mathrm {T}}_Q{\varPhi }_Q(X_i,y_{i}). \end{aligned}$$
(29)

Similar to the ones in Eq. (1), both \(W^{\mathrm {T}}_O\) and \(W^{\mathrm {T}}_Q\) can be decomposed into two components, a classification component and a projection component. These components in \(W^{\mathrm {T}}_O\) or \(W^{\mathrm {T}}_Q\) are used to score the visual modality and depth modality in the sample \(X_i\).

We define the potential functions \(W^{\mathrm {T}}_O{\varPhi }_O(X,y)\) and \(W^{\mathrm {T}}_Q{\varPhi }_Q(X,y)\) as

$$\begin{aligned}&\begin{aligned} {{\mathrm{Tr}}}[W^{\mathrm {T}}_O{\varPhi }_O(X,y)]&=\sum _m\sum _{a\in {\mathcal {Y}}} W_{w,a}^{\mathrm {T}} \cdot 1(y=a)\cdot X^{[m]} W_o^{[m]} ,\\ {{\mathrm{Tr}}}[W^{\mathrm {T}}_Q{\varPhi }_Q(X,y)]&=\sum _m\sum _{a\in {\mathcal {Y}}} W_{w,a}^\mathrm {[m]T} \cdot 1(y=a)\cdot X^{[m]} W_q^{[m]}, \end{aligned}\nonumber \\ \end{aligned}$$
(30)

where a is an index for action labels and \(1(\cdot )\) is an indicator function. Note that as we are considering a multi-class classification problem, the classification parameters \(W_w\) in \(W_O\) and \(W_w^{[m]}\) in \(W_Q\) contain \(n_c\) classification templates, respectively: \(W_w=[W_{w1},W_{w2},\cdots ,W_{wn_c}]\) and \(W_w^{[m]}=[W_{w1}^{[m]},W_{w2}^{[m]},\cdots ,W_{wn_c}^{[m]}]\), where each \(W_{wt}\) or \(W^{[m]}_{wt}\) can be regarded as a classifier for action category \(t\in \{1,\cdots , n_c\}\). Feature function \({\varPhi }_O(X_i,y_{i})\) can be explicitly expressed as

$$\begin{aligned} {\varPhi }_O(X_i,y_{i})=({\mathbf {0}},{\mathbf {0}},\cdots ,X_{i},\cdots ,{\mathbf {0}},{\mathbf {0}})^{T}\in {\mathcal {R}}^{n_{c}\times d}. \end{aligned}$$
(31)

Here, \({\varPhi }_O(X_i,y_{i})\) is a matrix with \(n_{c}\) rows. \(X_{i}\) locates in the \(y_{i}\)-th (out of \(n_{c}\)) row of \({\varPhi }_O(X_i,y_{i})\), and d is the length of \(X_{i}\). Similarly, feature function \({\varPhi }_Q(X_i,y_{i})\) can be given by

$$\begin{aligned} {\varPhi }_Q(X_i,y_{i})=({\mathbf {0}},{\mathbf {0}},\cdots ,X_{i},\cdots ,{\mathbf {0}},{\mathbf {0}})^{T}\in {\mathcal {R}}^{n_{c}\times d}. \end{aligned}$$
(32)

Compared to MMHIM, Structured MMHIM only takes feature vectors as input and thus does not capture spatiotemporal structure information. Feature functions \({\varPhi }_O(\cdot )\) and \({\varPhi }_Q(\cdot )\) in Structured MMHIM are matrices corresponding to feature matrix X in MMHIM, and are projected using projection matrices \(W_o^{[m]}\) and \(W_q^{[m]}\), respectively. A straightforward way to design feature functions \({\varPhi }_O(\cdot )\) and \({\varPhi }_Q(\cdot )\) in Structured MMHIM is to follow the structured support vector machine (SSVM) (Joachims et al. 2009), where feature vectors are used in computing these feature functions. Following Joachims et al. (2009), feature vector \(X_{i}\) should be placed in the \(y_{i}\)-th row of \({\varPhi }_O(X_i,y_{i})\) (Eq. 31). Extensions to using feature matrix in Structured MMHIM is feasible but it requires the redesign of projection matrices and the classification parameter matrices.

We learn the Structured MMHIM using the formulation similar to the one proposed in (14), which is

$$\begin{aligned}&\min _{W_w,W_w^{[m]},W_o^{[m]},W_q^{[m]}}\ \sum _m\Big \{I\left( {\varPhi }(X^{[m]},y),{\varPhi }(O,y)\right) \nonumber \\&\quad +\,I\left( {\varPhi }(X^{[m]},y),{\varPhi }(Q^{[m]},y)\right) +I\left( {\varPhi }(Q^{[m]},y),{\varPhi }(O,y)\right) \nonumber \\&\quad +\,\frac{1}{2}\lambda \Big [\mathrm {Tr}\left( W_wW^{[m]\mathrm {T}}_oW_o^{[m]}W_w^{\mathrm {T}}\right) \nonumber \\&\quad +\,\mathrm {Tr}\left( W_w^{[m]}W^{[m]\mathrm {T}}_qW_q^{[m]}W_w^{[m]\mathrm {T}}\right) \Big ]+\eta \sum _i(\xi _i^{[m]}+\epsilon _i^{[m]})\Big \},\nonumber \\&\text {s.t.}\qquad \mathrm {Tr}\left( W_o^{[m]}W_w^{\mathrm {T}}{\varDelta }{\varPhi }(X_i^{[m]},y_i)\right) \geqslant 1-\xi _i^{[m]},\quad \forall i, \forall m,\nonumber \\&\quad \quad \quad \ \mathrm {Tr}\left( W_q^{[m]}W_w^{[m]\mathrm {T}}{\varDelta }{\varPhi }(X_i^{[m]},y_i)\right) \geqslant 1-\epsilon _i^{[m]},\quad \forall i, \forall m,\nonumber \\&\quad \quad \quad \ \xi _i^{[m]}\geqslant 0,\quad \forall i, \forall m,\nonumber \\&\quad \quad \quad \ \epsilon _i^{[m]}\geqslant 0,\quad \forall i, \forall m, \end{aligned}$$
(33)

where \({\varDelta }{\varPhi }(X_i^{[m]},y_i)={\varPhi }(X_i^{[m]},y_i)-{\varPhi }(X_i^{[m]},y)\).

It should be noted that the key in Structured MMHIM is the feature functions \({\varPhi }_O(\cdot )\) and \({\varPhi }_Q(\cdot )\). Structured MMHIM is flexible in graph structure and is capable of predicting structured labels while MMHIM cannot. In the following, we design a Structured MMHIM for modeling temporal frames with varying length, which cannot be performed in MMHIM.

Example We use Structured MMHIM to model temporal frames in videos in this example. Consider a graph \(G=\{V,E\}\), where V is a set of nodes and E is a set of edges. Here, a node corresponds to a (RGB or depth) video frame and an edge links two successive frames. Structured MMHIM projects the t-th RGB and depth video frames (\(X_{t}^{[v]}\) and \(X_{t}^{[z]}\)) onto a shared feature space to learn shared features, and projects the two modalities onto independent private spaces to learn modality-specific features as well. Structured MMHIM predicts an action label \(y_{t}\in {\mathcal {Y}}\) for each frame in a video using both shared and private features. A majority voting scheme is adopted on all the frames in a video to infer the label of the video.

Suppose a video has T frames, then the potential functions \(W_O^{\mathrm {T}}{\varPhi }(X,y)\) and \(W_Q^{\mathrm {T}}{\varPhi }(X,y)\) in the Structured MMHIM can be defined as

$$\begin{aligned}&{{\mathrm{Tr}}}[W^{\mathrm {T}}_O{\varPhi }_O(X,y)] \nonumber \\&\quad =\sum _m\Big \{\underbrace{\sum _{t=1}^{T}\sum _{a\in {\mathcal {Y}}} [W_{w,a}^{\mathrm {T}} \cdot 1(y_{t}=a)\cdot X_{t}^{[m]} W_o^{[m]}]}_{\textstyle {\text {unary potential}}} \nonumber \\&\qquad +\,\underbrace{\sum _{t=1}^{T-1}\sum _{a\in {\mathcal {Y}}}\sum _{b\in {\mathcal {Y}}}[W_{w,a,b}\cdot 1(y_{t}=a)\cdot 1(y_{t+1}=b)]\Big \}}_{\textstyle {\text {pairwise potential}}}, \nonumber \\ \end{aligned}$$
(34)
$$\begin{aligned}&{{\mathrm{Tr}}}[W^{\mathrm {T}}_Q{\varPhi }_Q(X,y)] \nonumber \\&\quad =\sum _m\Big \{\underbrace{\sum _{t=1}^{T}\sum _{a\in {\mathcal {Y}}} [W_{w,a}^\mathrm {[m]T} \cdot 1(y=a)\cdot X_{t}^{[m]} W_q^{[m]}]}_{\textstyle {\text {unary potential}}}\nonumber \\&\qquad +\,\underbrace{\sum _{t=1}^{T-1}\sum _{a\in {\mathcal {Y}}}\sum _{b\in {\mathcal {Y}}}[W_{w,a,b}^{[m]}\cdot 1(y_{t}=a)\cdot 1(y_{t+1}=b)]\Big \}}_{\textstyle {\text {pairwise potential}}}.\nonumber \\ \end{aligned}$$
(35)

Here, \(W_{w}=\{W_{w,a}\}|_{a=1,\cdots ,n_{c}}\) and \(W_{w}^{[m]}=\{W_{w,a}^{[m]}\}|_{a=1,\cdots ,n_{c}}\) are classification matrices, \(\{W_{w,a,b}\}|_{a,b=1,\cdots ,n_{c}}\) and \(\{W_{w,a,b}^{[m]}\}|_{a,b=1,\cdots ,n_{c}}\) are classification parameters, and \(W_o^{[m]}\) and \(W_q^{[m]}\) are projection matrices for shared and private features, respectively.

The first terms in the above two equations are unary potential functions and the second terms are pairwise potentials. Unary potentials model the compatibility between the low-level projected frame features \(X_t^{[m]}W_o^{[m]}\) (or \(X_t^{[m]}W_q^{[m]}\)) and the classification template \(W_{w,a}^{\mathrm {T}}\) (or \(W_{w,a}^{[m]\mathrm {T}}\)); while the pairwise potentials capture the compatibility between successive frames. We refer to this model as Structured MMHIM-2 and the original Structured MMHIM in Eq. (30) and Eq. (33) as Structured MMHIM-1.

Comparison Structured MMHIM-1 proposes a general framework for structured prediction. It is not defined on a graph structure. By extending its potential functions in Eq. (30) to Eq. (34) and Eq. (35), the new structured model, Structured MMHIM-2, defines a graph for modeling frame sequences. Eq. (34) projects all the depth and RGB frames onto a shared space, and learn modality-specific features at the same time. Eq. (35) models the correlations between successive frames in a video. Both Structured MMHIM-1 and MMHIM have to sample a fixed number of frames (10 frames in this work) in order to fix the size of input matrices; while Structured MMHIM-2 is not restricted to the number of frames in a video. Both Structured MMHIM-1 and MMHIM do not capture the correlations between successive frames in a video; while Structured MMHIM-2 considers the correlations. MMHIM takes feature matrices as input, while Structured MMHIM-1 and Structured MMHIM-2 take feature vectors as input.

3.5 Model Properties

We would like to discuss key properties of the proposed MMHIM here.

Matrix format feature representation used in this work naturally considers spatiotemporal motion structure. Recall that the feature matrix X is of size \(n_{xyt}\times n_f\). It pulls apart the feature dimension from the collapsed spatiotemporal dimension (x-y-t). In such a representation, the spatiotemporal structure is kept by \(n_{xyt}\) pixels in the feature matrix X. Motion relationships of body parts also exist in the rows of X. After projection using \(W_o\) or \(W_q\), the structure of \(n_{xyt}\) pixels in X and the motion relationships are still conserved in the projected feature matrix \(XW_o\) or \(XW_q\), as \(W_o\) or \(W_q\) only operates on the columns of X.

However, if we use a feature vector \({\mathbf {x}}\) instead of a feature matrix X, the spatiotemporal structure and the motion relationships between body parts will not be conserved in the projected features \(W_o^{\mathrm {T}}{\mathbf {x}}\) or \(W_q^{\mathrm {T}}{\mathbf {x}}\). This is because all the elements in \({\mathbf {x}}\) are involved in the projection. Even though the feature vector \({\mathbf {x}}\) itself captures the structure information using a spatiotemporal pyramid, for example, the information will collapse after projection due to the involvement of other elements in \({\mathbf {x}}\).

The matrix form representation used in this work is different from the 4th-order tensor format in Pirsiavash et al. (2009). Their method captures width, height, temporal extent and feature dimension of a spatiotemporal window. The rank restriction in their work forces a spatiotemporal template to be separable along the xyt axis. By comparison, our representation considers spatiotemporal structures jointly. We put spatiotemporal dimensions together, and pull out feature dimension in this work.

Low-rank bilinear model MMHIM naturally models feature matrices using two model parameter matrices \(W_o\) (or \(W_q\)) and \(W_w\). The rank of the proposed model is minimized to provide a better generalization power (Wolf et al. 2007). We show in the experiments that such a bilinear model can learn complex mappings, and the performance is even better than deep models (Liu and Shao 2013).

Information measure This is computed in the process of data projection in order to compress data and reduce noise in the learned space. We validate its effectiveness in the experiments.

Cross-modal features Our MMHIM learns cross-modal features from RGB and depth data. The cross-modal features are discriminative for classification as they capture implicit correlations between RGB and depth data, and inherit the characteristics of them including motion, 3D structural, and spatiotemporal correlation information.

Knowledge transfer The learned projection matrix \(W_o^{[m]}\) and \(W_q^{[m]}\) transfers information from original data \(X^{[m]}\) to the learned shared features O and private features Q. This helps exploit cross-modal knowledge if one modality is missing in testing.

Structured prediction Structured MMHIM is capable of predicting structured outputs. This allows us to fuse heterogeneous cues and capture relationships between multiple outputs at the same time. Structured MMHIMs can be used in temporal series domain while MMHIM cannot be.

The third modality Our MMHIM uses two modalities and can be extended to using the third modality. For Kinect sensors, the third modality could be the skeleton features, which capture motion information from body joints. However, existing skeleton features are generally represented in a vector format (Du et al. 2015; Wang et al. 2012b). In order to project the skeleton features into the feature space shared with visual and depth modalities, the raw skeleton features need to be represented in a matrix format so that it can be fed into MMHIM. A possible modification for changing vector-based skeleton features into matrix-based skeleton features can be made by: (1) Fixing the number of spatiotemporal locations of the features. (2) Pulling out the feature dimension. Consequently, the modified skeleton features are in matrix format and can be shared with the other two sources in the learned space. Nevertheless, this requires us to design a new feature representation for skeleton features, which is beyond the scope of this paper.

Fig. 6
figure 6

Examples frames of NEU-UB dataset. The 6 action classes from left to right are a “Lateral Bend”, b “Left Leg Lift”, c “Right Leg Lift”, d “Squat”, e “Step Backward”, and f “Step Forward”

Fig. 7
figure 7

Confusion matrix of a RGB-only (overall accuracy\(\,=\,80.00\%\)), b depth-only (overall accuracy\(\,=\,58.25\%\)), and c full MMHIM method (overall accuracy\(\,=\,83.51\%\)) on the NEU-UB dataset. Misclassification examples are also shown on the right

4 Experiments

4.1 Datasets and Settings

We collect a new RGB-D action dataset named NEU-UB dataset to test our approach. Example frames displayed in Fig. 6 show that depth videos in this dataset are extremely noisy. Therefore, it would be challenging for the methods that only use depth modality. This dataset was captured by three types of devices, including a Kinect sensor, Vicon sensors, and IMU sensors. We utilize the RGB and depth data generated by the Kinect sensor in this work. The NEU-UB dataset contains 6 action categories, including “Lateral Bend”, “Left Leg Lift”, “Right Leg Lift”, “Squat”, “Step Backward”, and “Step Forward”. Each action is performed by 20 subjects. Each actor repeats an action 5 times, to provide a total of 600 RGB-D videos. Visually similar temporal segments in different action categories frequently occur in this dataset, for example, “Step Backward” and “Step Forward”.

The proposed method is also evaluated on the MSR Action Pairs dataset (Oreifej and Liu 2013) and MSR Daily Activity dataset (Wang et al. 2012b). MSR Action Pairs dataset is an indoor RGB-D action dataset containing 12 types of activities performed by 10 subjects with both RGB and depth videos. Each actor repeats an action for three times, to provide a total of 360 videos for each of the RGB and depth modalities. MSR Daily Activity dataset contains 16 types of activities performed by 10 subjects. Each actor repeats an action twice, providing a total of 320 videos for each of the RGB and depth channels.

4.2 NEU-UB Action Dataset

Videos in this dataset are temporally normalized to 10 frames with spatial resolution of \(120\times 160\). Histograms of oriented gradient feature and histograms of oriented flow feature are both extracted from color and depth videos in this dataset. A total of \(n_{xyt}=3000\) patches are extracted from each video, with the feature dimensionality of \(n_f=93\). The cross-validation training strategy is adopted for this dataset. The videos of the first 10 subjects are used for training, videos of 4 subjects are used for cross-validation, and the remaining videos of the other subjects are adopted for testing.

4.2.1 Classification Performance

Confusion matrices of the proposed MMHIM using RGB data, depth data, and the full RGB-D data on NEU-UB dataset are illustrated in Fig. 7. MMHIM achieves 83.51% accuracy in classifying actions in RGB-D videos. Misclassifications are mainly due to visually similar movements, for example, “Step backward” and “Step forward”, “Step backward” and “Right leg lift” (shown on the right hand side). 13% of “Lateral bend” videos are misclassified as “Squat”. This is mainly due to similar temporal segments in “Lateral bend”. The videos in “Lateral bend” has long durations with a person standing still, which is very similar to the ones in “Squat”. 13% of “Step backward” videos are misclassified as “Step forward” due to motion similarities. The two actions mainly differ in the distance changes from the human subject to the camera along temporal axis, which is not very clear in color videos. 7 and 9% of “Step backward” videos are also misclassified as “Left leg lift” and “Right leg lift”, respectively. The underlying reason is that in these misclassified videos, human subjects perform other similar actions (lift his/her leg) during their action executions, which confuse MMHIM.

Fig. 8
figure 8

Visualizations of a raw RGB features \(X^{[v]}\), b raw depth features \(X^{[z]}\), c the projected visual features \(X^{[v]}W_o^{[v]}\), d the projected depth features \(X^{[z]}W_o^{[z]}\), e the learned cross-modal features O in the shared space, f the projected visual features \(X^{[v]}W_q^{[v]}\), and g the projected depth features \(X^{[z]}W_q^{[z]}\)

4.2.2 Comparison Results

We use Du et al. (2015) to extract skeleton features, and represent the features in a vector format. MMHIM+Skeleton method is built using the formulation in Sect. 3.3. We compare MMHIM+Skeleton method, MMHIM, Structured MMHIM-1, Structured MMHIM-2, with existing methods (Pirsiavash et al. 2009; Jia et al. 2014; Oreifej and Liu 2013; Yang et al. 2014; Kong and Fu 2015) on NEU-UB dataset. Linear SVM is adopted as baseline. Comparison results are shown in Table 2.

Table 2 Recognition accuracy of comparison methods on NEU-UB dataset

Results show that MMHIM+Skeleton method outperforms all the comparison approaches. MMHIM+Skeleton method achieves 87.78% accuracy, outperforming MMHIM by 4.27% due to the use of skeleton features. Skeleton features naturally capture motion information of 20 joints, which is helpful in discriminating visually similar action categories. Structured MMHIM-2 outperforms MMHIM by 2.05% on NEU-UB dataset as it captures the correlation between two successive frames using pairwise potentials in Eq. (34) and Eq. (35). By contrast, MMHIM only samples 10 frames in a video and does not capture the correlation information. MMHIM achieves 83.51% accuracy, 2.29% higher than the BHIM with the same features. This demonstrates the benefit of learning private features in MMHIM. Given RGB-D data, MMHIM learns two extra pieces of information, i.e., private features for RGB data and private features for depth data. By encouraging minimum redundant information with the learned shared features, the learned private features capture unique information that is not sharable with the other modality. For example, this information can be a 2D edge cue in visual RGB modality and a distance cue in depth modality. Such information is discriminative for classification, and thus using this extra information can further improve the classification performance in MMHIM. MMHIM outperforms linear SVM due to the learning of better feature representations. Linear SVM simply concatenates visual and depth features into a long vector, which may not be able to capture complex correlations between visual and depth modalities. However, MMHIM finds the optimal space for fusing the two modalities, and thus improves the performance. Similar to MMHIM, bilinear SVM also learns a shared feature space for the two modalities. Nevertheless, MMHIM filters out redundant information and reduces noise in the learned features using the information measure \(\phi (W_o^{[m]},W_q^{[m]})|_m\) in Eq. (6). This improves the discriminative power of the features. Compared with Jia et al. (2014), the learned features in MMHIM are optimized for classification, thereby being able to represent visually similar actions. Methods in Oreifej and Liu (2013) and Yang et al. (2014) achieve surprisingly low performance compared with MMHIM due to noisy depth data. Thanks to the ability of effectively using visual data, MMHIM extracts discriminative features from visual data, and thus outperforms (Oreifej and Liu 2013; Yang et al. 2014). On the other hand, MMHIM represents features in a matrix form. This allows us to construct a low-rank bilinear model that can improve the generalization power. The learned features and parameter matrices in MMHIM are visualized in Fig. 8.

4.2.3 Single Modality

We evaluate the performance of MMHIM in single modality scenario, and investigate the contribution of each modality in recognition. In this experiment, we train MMHIM using one single modality data (RGB or depth), and test it using the same modality data. MMHIM and Structured MMHIM-1 are compared to linear SVM (with HOG, HOF, and HOGHOF features), bilinear SVM, and BHIM (Kong and Fu 2015).Footnote 3 In addition, we independently build two MMHIMs, each of which is fed with RGB or depth modality. The prediction scores from the two MMHIMs are averaged to predict action labels. We refer to this method as MMHIM-ind. Results are shown in Table 3.

Table 3 Comparison results on NEU-UB Dataset given depth-only, RGB-only, and RGB-D data

Recognition accuracy in Table 3 shows that MMHIM achieves superior performance in a majority of cases. Despite the method HOGHOF+linear SVM achieves better results over HOG or HOF-based method, these linear SVM methods do not fully take advantage of rich correlations between multiple features. MMHIM, on the contrary, finds better feature spaces to suppress noise and extract discriminative features, and thus achieves significantly better performance. MMHIM outperforms bilinear SVM by 7.20% given depth data. Compared with bilinear SVM, MMHIM utilizes information measure to compress data, and elegantly reduces redundancy in the data. MMHIM also outperforms BHIM due to the learning of discriminative private features. Given single RGB or depth modality, the “shared features”Footnote 4 O and the private features \(Q^{[m]}\) in MMHIM may capture similar discriminative information. The private features capture little additional discriminative information compared to the share features, and thus the improvement of MMHIM over BHIM is only around 1%. The improvement of MMHIM over BHIM increases to 2.29% if two modalities are provided. The shared features will be compromised in order to capture the overlapping information between the two modalities. Consequently, it may not be able to capture all the discriminative information within one modality. In this case, private features will be acting as complementary features to shared features, and thus the improvement increases. Results also clearly show that RGB information contributes more to the classification than the depth information on this dataset. This is possibly because the motion variations of depth videos in different categories are subtle, and depth videos in NEU-UB dataset are much noisier than RGB videos.

Results also show that MMHIM significantly outperforms MMHIM-ind by 15.26% given RGB-D (Depth+RGB) data. MMHIM-ind achieves low performance because depth data are noisy and will lower down the overall performance if we simply combine prediction results on RGB and depth data. MMHIM-ind trains on RGB and depth data independently, and does not leverage the correlations between the two modalities to improve the discriminability. We observe in experiment that prediction results of MMHIM-ind on RGB data are always inconsistent with the predictions on noisy depth data. If we simply combine the prediction scores of the two modalities by averaging, the accurate results on RGB data will be compromised, and thus the overall performance is low (68.25% in Table 3). On the contrary, MMHIM learns shared features and private features from RGB-D data jointly. Even though the depth data are noisy, MMHIM projects the data onto a subspace where the discriminative information of depth data and RGB data can be jointly discovered. This enhances the discriminative power of the learned features from depth data, and thus makes MMHIM outperform MMHIM-ind.

Table 4 Knowledge transfer results on NEU-UB dataset

4.2.4 Knowledge Transfer

We evaluate the performance of our MMHIM in missing modality scenarios, where one modality is missing in training or testing. MMHIM is tested in the following four scenarios: depth data are missing in testing (RGB-D \(\rightarrow \) RGB), RGB data are missing in testing (RGB-D \(\rightarrow \) Depth), depth data are missing in training (RGB \(\rightarrow \) RGB-D), and RGB data are missing in training (Depth \(\rightarrow \) RGB-D). We compare MMHIM with linear SVM, bilinear SVM, BHIM (Kong and Fu 2015) and Structured MMHIM-1, and investigate how the knowledge transferred from observed modality influences the performance of the five methods. We also build a baseline, MMHIM-ind, which trains an independent classifier for each modality, and uses the relevant classifier in prediction phase according to the presence of modality.

Recognition results in Table 4 show that MMHIM significantly outperforms linear and bilinear SVM due to the use of a matrix form feature representation, and the learned cross-modal features and private features. MMHIM outperforms MMHIM-ind as the shared information between depth and RGB modalities is discovered and used in the training of MMHIM. The performance gap is significant when RGB modality is missing in testing (RGB-D \(\rightarrow \)  Depth) as RGB data are more discriminative in the NEU-UB dataset. Without RGB modality in testing, MMHIM-ind will be confused by noisy depth data. The effectiveness of private features in MMHIM can be clearly seen from performance gap between MMHIM and BHIM, especially in the scenario of RGB-D \(\rightarrow \) Depth, where MMHIM outperforms BHIM by 22.11%. By extracting private features from the testing depth videos, MMHIM is capable of using extra discriminative information for classification, while BHIM can only use raw noisy depth features.

4.2.5 Sensitivity to Dimensionality

We also evaluate the sensitivity of MMHIM to the dimensionality \(d_o\) of the shared feature space and \(d_q\) of the private feature space. The performance variations of MMHIM given different parameters \(d_o\) and \(d_q\) are visualized in Fig. 9 in order to better show how MMHIM is affected by these two parameters. \(d_o\) and \(d_q\) range from 5 to 90 with step 5 (\(d_o\) or \(d_q\) should be less than \(n_f=93\)). Results show that MMHIM is not sensitive to the dimensionality \(d_o\) and \(d_q\). The performance variation with respect to \(d_o\) is only 1.4% and it is only 1.75% with respect to \(d_q\).

Fig. 9
figure 9

RGB-D action recognition results on our MMHIM on NEU-UB dataset with different a dimensionality \(d_o\) of the shared feature space and b dimensionality \(d_q\) of the private feature space

4.3 MSR Action Pairs Dataset

We follow the same preprocessing scheme used in the previous dataset. The same split training strategy in Oreifej and Liu (2013) is adopted, where RGB-D videos of the first 5 subjects are used as training data.

4.3.1 Comparison Experiment

We compare with existing methods (Oreifej and Liu 2013; Wang et al. 2012b; Yang et al. 2012, 2014; Jia et al. 2014; Kong and Fu 2015), and use linear SVM as baseline. We also extend the bilinear SVM (Pirsiavash et al. 2009) to capture two-modality data, and use it as baseline.

Results in Table 5 show that MMHIM outperforms all the comparison approaches. MMHIM achieves 100% accuracy, significantly outperforming linear SVM by 12.22%. Linear SVM simply concatenates RGB and depth features into a long vector, while MMHIM finds the optimal space for fusing the two features. Both bilinear SVM and MMHIM learn a shared feature space for feature fusion. However, MMHIM gains 3.89% improvement over bilinear SVM, demonstrating the benefit of noise reduction in MMHIM. In order to reduce noise in data projection, MMHIM uses the information measure \(\phi (W_o^{[m]},W_q^{[m]})|_m\) in Eq. (6) to compress data and reduce information irrelevant to our recognition task. MMHIM also outperforms (Oreifej and Liu 2013; Wang et al. 2012b; Yang et al. 2012, 2014), which shows the benefits of effectively utilizing both visual and depth data, and representing features in a matrix form. Using a matrix form feature representation allows us to construct a low-rank bilinear model that can improve the generalization power. The effectiveness of a matrix form feature representation can also be clearly seen from the performance difference between MMHIM and Structured MMHIM-1. The learned features and parameter matrices in MMHIM are visualized in Fig. 10.

Table 5 Recognition accuracy of comparison methods on MSR Action Pairs dataset

4.3.2 Sensitivity to Parameters

In this experiment, we investigate the sensitivity of MMHIM to the maximum rank of \(d_o\), the parameter \(\eta \) and the parameter \(\lambda \) in Eq. (2). \(d_q\) is fixed to 10.

Table 6 Comparison results with various dimensionality \(d_o\) of the feature space

We first test the sensitivity of MMHIM to the maximum rank \(d_o\). MMHIM is compared with linear SVM and bilinear SVM with various \(d_o\) values. Note that there are a total of \(n_{xyt}\cdot d_o\) elements in the shared space for each modality in MMHIM and bilinear SVM. To conduct a fair comparison, for linear SVM, we use PCA to reduce the dimensionality of feature vectors of each modality to \(n_{xyt}\cdot d_o\), making sure all the five methods have the same number of elements in the low-dimensional features. The projected visual and depth features are concatenated into a long vector and fed to linear SVM. In bilinear SVM and MMHIM, the original feature matrix \(X^{[m]}\) is projected by \(W_o^{[m]}\). The rank parameter \(d_o\) is set to 1, 5, and 31, respectively.

Fig. 10
figure 10

Visualizations of a the projected visual features \(X^{[v]}W_q^{[v]}\), b the projected depth features \(X^{[z]}W_q^{[z]}\) in the private space, and c the learned cross-modal features O in the shared space

Table 7 Knowledge transfer results on MSR Action Pairs dataset

The performance of the five methods on depth features, RGB features, and RGB-D features are shown in Table 6. Results indicate that MMHIM achieves higher performance in most of the cases given low-dimensional features, and its performance on RGB-D data is not sensitive to parameter \(d_o\). When \(d_o=1\), the projected feature matrices may lose certain amount of information. However, the structural information is reserved in MMHIM, resulting in significant higher performance over linear SVM. In addition, the learned shared space in MMHIM is optimized for classification, while it is not the case in PCA. Compared with bilinear SVM, noisy information is reduced in MMHIM, and thus it achieves superior performance, for example, an increase of 26.66% given RGB-D data.

Even though linear SVM captures full information from visual and depth features when \(d_o=31\), it simply concatenates depth and RGB features, suggesting that the similarities between the two types of features are directly compared. This may not be appropriate since they are from different distributions. In contrast, our MMHIM solves this problem by projecting the two features into a shared feature space. The learning of a shared feature space allows us to effectively use the two types of features for recognition. MMHIM outperforms bilinear SVM especially in the case of RGB data, where the performance gain is 18.89%.

MMHIM achieves superior results over BHIM when \(d_o=5\) due to the learning of private features. An improvement of 8.33% can be observed given RGB data. The learned cross-modal features in BHIM lose too much discriminative information using the information measure \(\phi (W_o^{[m]},W_q^{[m]})\). MMHIM, on the contrary, uses extra private features to capture rich discriminative information.

Figure 11 illustrates RGB-D action recognition performance variations of MMHIM on MSR Action Pairs dataset given different values of parameter \(\lambda \) and parameter \(\eta \). Parameter \(\lambda \) is set: \(\lambda \in \{0.00001,0.00005,0.0001,0.0005,0.001, 0.005,0.01,0.05,0.1,0.5,1,5,10,50,100\}\), and parameter \(\eta \) is set: \(\eta \in \{0.0001,0.0005,0.001,0.005,0.01,0.05,0.1, 0.5,1\}\). Figure 11a shows that MMHIM is insensitive to \(\lambda \) if \(\lambda \leqslant 50\); the largest performance variations is only 2.22%. However, the performance drops to 77.78% if \(\lambda =100\) as MMHIM focuses on the margin regularizer but pays less attention to the training loss. Figure 11b demonstrates that MMHIM is insensitive to parameter \(\eta \) when \(\eta \leqslant 0.1\); the variation is only 2.22% The performance drops to 86.11% when \(\eta =0.5\) and it further drops to 67.22% when \(\eta =1\). This is because if \(\eta \) is set to some relatively large values (\(\eta =1\) for example), MMHIM prefers to learn very complex decision boundaries to minimize training loss. This affects the generalization performance of MMHIM, and results in low recognition performance.

Fig. 11
figure 11

RGB-D action recognition results of our MMHIM on MSR Action Pairs dataset with different values of a parameter \(\lambda \) and b parameter \(\eta \)

4.3.3 Knowledge Transfer

We evaluate the performance of our MMHIM, and investigate the effectiveness of the cross-modal features, the private features, and the information measure if one modality is missing in training or testing. MMHIM is tested in four scenarios as in the previous dataset: RGB-D \(\rightarrow \) RGB, RGB-D \(\rightarrow \) Depth, RGB \(\rightarrow \) RGB-D, and Depth \(\rightarrow \) RGB-D. We compare MMHIM with linear SVM, bilinear SVM, and BHIM (Kong and Fu 2015), and verify their performance in knowledge transfer.

Recognition results in Table 7 show that MMHIM markedly outperforms BHIM, linear and bilinear SVM in this knowledge transfer experiment. MMHIM achieves better performance over BHIM in this experiment, especially when RGB data are missing, i.e. RGB-D \(\rightarrow \) Depth and Depth \(\rightarrow \) RGB-D. The underlying reason is that private features can capture extra discriminative information for classification. If RGB modality is missing, the depth modality itself is noisy and can affect the performance. Using private features, the depth modality is projected to two spaces (one for learning shared features and the other for private features), and thus extra discriminative information can be used for classification. The performance gain over linear SVM is noticeable, indicating the effectiveness of using a matrix form feature representation, and cross-modal and private features in MMHIM. The improvement of MMHIM over bilinear SVM is also significant, especially in RGB \(\rightarrow \) RGB-D where the increase is 17.22%. MMHIM achieves better performance over BHIM in this experiment, especially when RGB data are missing, i.e. RGB-D \(\rightarrow \) Depth and Depth \(\rightarrow \) RGB-D.

4.4 MSR Daily Activity Dataset

RGB and depth sequences in this dataset are spatially and temporally normalized, and the people of interest are extracted from these sequences. We follow the same training protocol in Wang et al. (2012b). Skeleton feature vectors in Sect. 3.3 are adopted to build a MMHIM+Skeleton method, where skeleton features are extracted using Du et al. (2015). Our methods are first compared with existing approaches (Yang et al. 2012; Liu and Shao 2013; Zanfir et al. 2013; Oreifej and Liu 2013; Wang et al. 2012b; Yang et al. 2014; Kong and Fu 2015; Hu et al. 2015) on this dataset, and then evaluated given RGB, depth, and RGB-D data, respectively. Linear SVM and bilinear SVM are used as baseline.

Table 8 Recognition accuracy of comparison methods on MSR Daily Activity Dataset
Table 9 Comparison results on MSR Daily Activity Dataset given depth-only, RGB-only, and RGB-D data

Comparison results are shown in Table 8. MMHIM+skeleton achieves superior performance over state-of-the-art methods due to the effective use of visual, depth, and skeleton data. Its performance gain over MMHIM (without skeleton data) is 8.10%, suggesting the benefit of using skeleton feature vectors. MMHIM+skeleton significantly outperforms BHIM due to the fusion of three types of features and the ability of using private features. Structured MMHIM-2 outperforms MMHIM by 1.88% on MSR Daily Activity dataset, indicating the importance of modeling correlations between frames on this dataset. MMHIM notably outperforms linear SVM by 22.50% owing to the learning of shared and private feature spaces for the two types of features, and a matrix form representation that naturally captures spatiotemporal structural information. Recognition accuracy of MMHIM is also higher than bilinear SVM thanks to the use of information measure, which is helpful in suppressing useless information and noise. MMHIM outperforms recent surface normal-based approaches (Oreifej and Liu 2013; Yang et al. 2014). Although these approaches essentially capture structural information in the feature design stage, they only focus on depth sequences, and do not utilize valuable visual information. In addition, the two approaches use the full length feature vectors and do not learn a better feature space for classification. MMHIM achieves better performance than the actionlet ensemble approach (Wang et al. 2012b) since we elegantly use visual and depth information, and effectively compress informative cues and remove noise before classification. We also compare with two deep models, CNN and DBN in Liu and Shao (2013). The two models do not perform very well (even worse than the hand-crafted HON4D method with linear SVM classifier, and bilinear SVM method) possibly due to small size of the training data. This also demonstrates that linear and bilinear models are able to learn complex mappings compared to deep models, especially if the training data is scarce.

Performance of the proposed MMHIM on the RGB-only, depth-only, and RGB-D data in the MSR Daily Activity dataset is also reported in this paper. Recognition accuracy in Table 9 shows that MMHIM achieves satisfactory results even though only one modality of features is provided. MMHIM outperforms BHIM by 1.88% given RGB data due to the use of private features. RGB cue is not as discriminative as depth cue in the MSR Daily Activity dataset as the motions in various actions are similar and objects are relatively small. However, using private features can capture additional discriminative information that may be lost in the learning of share features. The improvement of MMHIM over linear and bilinear SVM is remarkable, which is 14.72 and 10.00% given depth data, respectively. This mainly contributes to the use of private features and the information measure in MMHIM. Private features enable us to use extra discriminative information, and information measure reduces non-helpful information for classification.

4.5 Discussions

The three datasets, NEU-UB Action dataset, MSR Action Pairs dataset, MSR Daily Activity dataset, vary in terms of actions and noise. NEU-UB dataset contains 6 actions with basic limb movement, including “lateral bend”, “left leg lift”, “right leg lift”, “squat”, “step backward”, “step forward”. Actions in MSR Action Pairs dataset are human-object interactions, such as “pick up/put down box”, “lift/place box”, “push/pull chair”, “wear/take off hat”, “put on/off backpack”, “stick/remove poster”. MSR Daily Activity dataset consists of actions that are often seen in daily life, such as “drink”, “call cellphone”, and “lay down on sofa”.

The performance variation on the three datasets is mainly due to the noise level in depth videos, and visual similarity and motion-object relationships in RGB videos. Compared to existing MSR Action Pairs dataset and MSR Daily Activity dataset, depth videos in NEU-UB dataset are more noisy. This can be clearly seen from the performance difference using depth videos on these datasets. Given depth modality in training and testing, MMHIM only achieves the recognition accuracy of 58.25% on NEU-UB dataset, 24.25% lower than the results on MSR Daily Activity dataset.

Some of RGB videos of different categories in NEU-UB dataset are visually similar, for example, “step forward” and “step backward”. The major difference between them is the temporal size change of the human body in videos, which is difficult to capture. Therefore, MMHIM only achieves 80.00% accuracy in RGB modality learning. By comparison, the major difference in actions in MSR Action Pairs dataset is the motion-object shape relationship. The relationship is easy to capture as particular human motion is acted to operate an object, and objects differ in their sizes. That is why a majority of approaches achieve high performance on this dataset. RGB action videos in MSR Daily Activity dataset usually involve objects of small size, such as a cup, a book, and a cellphone. This makes the motion-object relationships difficult to capture in actions involving small objects, and MMHIM achieves the lowest performance (79.38%) among the three datasets.

MMHIM is implemented in Matlab and trained on a 3.5 GHz machine with 64 GB memory. The training on NEU-UB dataset, MSR Action Pairs dataset, and MSR Daily Activity dataset takes about 4, 7, and 6 min, respectively. Memory usage for training on the three datasets are 3, 2, and 1.5 GB, respectively.

5 Conclusion

We have proposed a max-margin heterogeneous information machine (MMHIM) for action recognition from RGB-D sequences. MMHIM effectively utilizes RGB visual and depth data for classification. The two modality data are used to learn cross-modality features to capture their correlations, and learn private features to capture their modality specific characteristics. Features are represented in a matrix form in MMHIM, which essentially characterizes spatiotemporal relationships of human body parts. This naturally allows us to construct a low-rank bilinear model to use these feature matrices, and learn cross-modality and private features. These two types of features are denoised using the information measure. Classification is performed using the denoised shared and private features. We learn a low-rank MMHIM by directly minimizing the rank of the model, in order to increase the generalization power. An efficient optimization algorithm is proposed in this work with an off-the-shelf SVM solver as the inner optimization solver. The MMHIM is extensively evaluated on three public RGB-D action datasets, and outperforms state-of-the-art approaches.