Keywords

1 Introduction

Nowadays, the Closed Circuit Television (CCTV) has been widely deployed in many security-sensitive places such as individual house, museum, bank, etc. Because of the economical or privacy issues, there are always non-overlapping regions between different camera views. Therefore, re-identifying pedestrians across different camera views is a critical and fundamental problem for intelligent video surveillance such as cross-camera person searching and tracking. Such a problem is called person re-identification (re-id).

Fig. 1.
figure 1

Illustration of RGB and depth images. Row 1 shows the RGB images in the bright environment. Row 2 shows the RGB images in the dark environment. Row 3 shows the depth images in the dark environment.

Currently, most CCTV systems are based on RGB cameras, and thus the corresponding re-id approaches mainly rely on appearance features. However, in dark environment, appearance features may be unreliable since limited information perceived by RGB cameras. Hence, it is necessary to apply new devices to the dark environment. One alternative solution is to use depth cameras, such as Kinect [1, 8, 9]. Depth cameras provide depth information and body skeleton joints which are invariant to illumination changes (see Fig. 1). That is, depth cameras remain valid in the dark environment. Depth information is more related to human body shape, and thus beneficial to the re-id in the dark environment [9]. RGB cameras and depth cameras, therefore, form a heterogeneous surveillance network. Previous works in person re-id focus on either RGB camera network [2,3,4,5,6,7] or depth camera network [8, 9], yet none of them addresses the heterogeneous camera network that contains both RGB and depth cameras. In this paper, we focus on how to match pedestrians across depth and RGB cameras in the heterogeneous surveillance network, which has not been studied before.

Along with the line of the traditional person re-id framework [3,4,5,6, 9], our cross-modality re-id system contains two phases: feature extraction and similarity measurement. Different from [3,4,5,6, 9], the key idea behind our approach is to mine the correlation between different modalities. In the feature extraction phase, considering that color and texture features [3,4,5,6,7] are not available in depth cameras, we propose to extract body shape information which intrinsically exists in both RGB and depth images. Specifically, for RGB images, we respectively extract two kinds of typical edge gradient features, the classic Histogram of Oriented Gradient (HOG) [7] and the recent proposed Scale Invariant Local Ternary Patterns (SILTP) [5]. Both of them are widely used for shape descriptors. For depth images, we extract the Eigen-depth feature that is recently proposed for Kinect-based person re-id [9]. Note that, both edge gradient feature and Eigen-depth feature describe human body shape and thus can reduce the discrepancy between two kinds of different-modality features. Unfortunately, this correlation is far from a good solution to the cross-modality matching. To address this problem, we need to mine more correlation between the two modalities. Therefore, we propose a dictionary learning based algorithm that transforms edge gradient feature and Eigen-depth feature into sparse codes that share a common space. In this way, the similarity of edge gradient feature and Eigen-depth feature can be measured with the learned sparse codes. Figure 2 shows the overview of our approach.

In this paper, we identify the dark environment problem in person re-id suffered by unreliable and limited information from RGB cameras, and address this key problem through a novel cross-modality matching approach. To summarize, our contributions include:

  • It is a new attempt to the re-id task across depth and RGB modality that we propose a dictionary learning based method to encode different-modality body shape features (edge gradient feature and Eigen-depth feature) into a common space.

  • To enforce the discriminability of the learned dictionary pair, we design an explicit constraint term for dictionary learning so that our approach is more discriminative than several contemporary dictionary learning methods.

  • Experiments on two heterogeneous person re-id benchmark datasets show the effectiveness of our approach.

Fig. 2.
figure 2

Overview of our proposed approach. In the training phase, labeled image pairs from RGB and depth cameras are used to jointly learn the discriminative coupled dictionaries optimized by correlation structure and explicit constraint term. In the testing phase, we encode the features of different modalities through the coupled dictionaries as new representations for matching.

2 Proposed Method

2.1 Problem Specification

For the training phase, \(F_1 = [ f_{11}, f_{21}, \ldots , f_{i1} ] \) and \(F_2 = [ f_{12}, f_{22}, \ldots , f_{i2} ]\) denote the gallery and the probe descriptor matrices, respectively, where \(f_{ij}\) is the feature set of the \(i_{th}\) training sample. \(F_1\) and \(F_2\) are from two heterogeneous cameras (depth camera and RGB camera) belonging to two different modalities with different dimensions, \(d_1\) and \(d_2\). The goal is to learn the dictionaries \( D_1\in { \mathbb {R}^{d_1\times k}}\) and \(D_2\in {\mathbb {R}^{d_2\times k}}\) jointly, where k is the size of sparse code. Let \(C_1 = [ c_{11}, c_{21}, \ldots , c_{i1} ]\) and \(C_2 = [ c_{12}, c_{22}, \ldots , c_{i2} ]\) denote the sparse codes of \(F_1\) and \(F_2\), each column of which \(c_{ij}\in \mathbb {R}^k\) is the sparse code representing the \(i_{th}\) sample.

For the testing phase, feature matrices, \(F_G = [ f^G_1, f^G_2, \ldots , f^G_i ]\) and \(F_P =[ f^P_1, f^P_2, \ldots , f^P_i ]\), extracted from the gallery and the probe, and correspondent sparse codes are \(C_G =[ c^G_1, c^G_2, \ldots , c^G_i ]\)and \(C_P =[ c^P_1, c^P_2, \ldots , c^P_i ]\), respectively.

2.2 Correlative Dictionary Learning

In traditional dictionary learning problem [10], the smaller reconstruction error contributes to a superior dictionary. Hence, we learn a dictionary pair by minimizing two sets of reconstruction errors. Besides, we constrain sparse codes \(C_1\) and \(C_2\) by \(L_1\) regularization similar to sparse representation [10]. To prevent overfitting, we additionally incorporate \(L_2\) regularization for dictionaries and formulate the optimization problem as:

$$\begin{aligned} \begin{aligned} \underset{D_1,D_2,C_1,C_2}{\arg \min } \{ {\left\| F_1-D_1C_1\right\| }_F^2 + {\left\| F_2-D_2C_2\right\| }_F^2 +\\ \lambda _C{\left\| C_1\right\| }_1 +\lambda _C{\left\| C_2\right\| }_1 +\lambda _D{\left\| D_1\right\| }_F^2 +\lambda _D{\left\| D_2\right\| }_F^2 \} \end{aligned} \end{aligned}$$
(1)

where \(\lambda _C\) and \(\lambda _D\) are regularization parameters to balance the terms.

According to Least Square Semi-Coupled Dictionary Learning (LSSCDL) [11], \(L_1\) regularization on sparse codes is more likely to destroy correlation structure of features, which is suggested to be replaced by \(L_2\) regularization. Many researches [11,12,13,14] have proved that \(L_2\) regularization can also play the effect of sparse representation. Therefore, in this paper, we also use \(L_2\) regularization on sparse codes to improve Eq. (1).

Because of the difference between RGB-based and depth-based features, the direct matching results are always unsatisfied. So we capture the correlation between those same persons of cross-modality and penalize those largely different-class scatter. We consider minimizing the Euclidean distance between two sparse codes, namely \({\left\| C_2 - C_1\right\| }_F^2\), to develop the correlation between two dictionaries. Therefore, the objective function is given by:

$$\begin{aligned} \begin{aligned} \underset{D_1,D_2,C_1,C_2}{\arg \min } \{ {\left\| F_1-D_1C_1\right\| }_F^2 + {\left\| F_2-D_2C_2\right\| }_F^2 +\lambda {\left\| C_2-C_1\right\| }_F^2+\\ \lambda _C{\left\| C_1\right\| }_F^2 +\lambda _C{\left\| C_2\right\| }_F^2 +\lambda _D{\left\| D_1\right\| }_F^2 +\lambda _D{\left\| D_2\right\| }_F^2 \} \end{aligned} \end{aligned}$$
(2)

where \(\lambda \) is a positive value which controls the tradeoff between the reconstruction errors and the distance between sparse coding matrices.

In our model, we seek for a discriminative dictionary pair that is able to be discriminative between the same pair and different pairs. We perform this discriminability by enforcing the constraint on the sparse coefficients corresponding to the learning dictionaries. Let \(d_{ii} = c_{i1}-c_{i2}\) denote the Euclidean distance between sparse coefficients corresponding to the gallery and the probe of the same person i, and \(d_{ij} = c_{i1}-c_{j2}\) denote the same form of different persons i and j. Specifically, we optimize our model such that the distance between the same person is much smaller than different persons, namely,

$$\begin{aligned} \begin{aligned} d_{ii}<d_{ij},\forall j\ne i,\forall i \end{aligned} \end{aligned}$$
(3)

Thus we optimize the objective function by imposing explicit constraint term:

$$\begin{aligned} \begin{aligned} \text {s.t.}{\left\| c_{i1}-c_{i2}\right\| }_2^2&<{\left\| c_{i1}-c_{j2}\right\| }_2^2\\ \forall j\ne&i,\forall i \end{aligned} \end{aligned}$$
(4)

To simplify optimization, we build the objective function as convex function. Therefore, the constraint term could be modified as:

$$\begin{aligned} \begin{aligned} \text {s.t.}&{\left\| c_{i1}-c_{i2}\right\| }_2^2<s_1,\forall i\\&{\left\| c_{i1}-c_{j2}\right\| }_2^2<s_2,\forall j\ne i,\forall i \end{aligned} \end{aligned}$$
(5)

where \(s_1\) and \(s_2\) are two constants, and \(s_1\ll s_2\). \(s_1\) and \(s_2\) are used to limit the distance between the samples.

In summary, the optimization problem of dictionary learning is described as:

$$\begin{aligned} \begin{aligned} \underset{D_1,D_2,C_1,C_2}{\arg \min }\sum _{i=1}^n \{ {\left\| f_{i1}-D_1c_{i1}\right\| }_2^2&+ {\left\| f_{i2}-D_2c_{i2}\right\| }_2^2 +\lambda {\left\| c_{i2}-c_{i1}\right\| }_2^2\\ +\lambda _C{\left\| c_{i1}\right\| }_2^2 +\lambda _C{\left\| c_{i2}\right\| }_2^2&+\lambda _D{\left\| D_1\right\| }_F^2 +\lambda _D{\left\| D_2\right\| }_F^2 \}\\ \text {s.t.}{\left\| c_{i1}-c_{i2}\right\| }_2^2&<s_1,\forall i\\ {\left\| c_{i1}-c_{j2}\right\| }_2^2&<s_2,\forall j\ne i,\forall i \end{aligned} \end{aligned}$$
(6)

We employ the alternating optimization algorithm to solve Eq. (6). Specically, we alternatively optimize over \(D_1\), \(D_2\), \(C_1\) and \(C_2\) one at a time, while fixing the other three. Firstly fix \(D_1\), \(D_2\), \(C_2\) and use CVX [20] to optimize a column \(c_{i1}\) of \(C_1\). The way to optimize \(C_2\) is similar to \(C_1\). And then we get \(D_1\) and \(D_2\) at gradient algorithm, which are given by

$$\begin{aligned} \begin{aligned} D_1=(F_1C_1^T)(C_1C_1^T+ \lambda _DI)^{-1}\\ \end{aligned} \end{aligned}$$
(7)
$$\begin{aligned} \begin{aligned} D_2=(F_2C_2^T)(C_2C_2^T+ \lambda _DI)^{-1} \end{aligned} \end{aligned}$$
(8)

where I is a \(k\times k\) identity matrix.

In this way, we alternatively optimize over \(D_1\), \(D_2\), \(C_1\) and \(C_2\) until convergency. The algorithm for training is summarized in Algorithm 1.

figure a

2.3 Person Re-identification by Our Framework

Using the correlative dictionary pair \(D_1\) and \(D_2\), we can obtain the sparse representations of the gallery and the probe. According to Eq. (6), the sparse code coefficients \(C_G =[ c^G_1, c^G_2, \ldots , c^G_i ]\) and \(C_P =[ c^P_1, c^P_2, \ldots , c^P_i ]\) can be respectively obtained by

$$\begin{aligned} \begin{aligned} \underset{c^G_i}{\arg \min }{\left\| f^G_i-D_1c^G_i\right\| }_2^2 +\lambda _G{\left\| c^G_i\right\| }_2^2 , \forall i\\ \end{aligned} \end{aligned}$$
(9)
$$\begin{aligned} \begin{aligned} \underset{c^P_i}{\arg \min }{\left\| f^P_i-D_2c^P_i\right\| }_2^2 +\lambda _P{\left\| c^P_i\right\| }_2^2 , \forall i\\ \end{aligned} \end{aligned}$$
(10)

where \(\lambda _G\) and \(\lambda _P\) are regularization parameters to balance the terms for the gallery and the probe, respectively.

We use CVX to solve the problems in Eqs. (9) and (10). The algorithm for testing is summarized in Algorithm 2. Finally, the learned sparse codes \(C_G\) and \(C_P\) are taken as correlative reconstructive features to identity matching by computing the similarity according to the Euclidean distance. In this way, the computational efficiency of identity matching is the same as the standard sparse representation in person re-id [21, 22].

figure b

3 Experiment

3.1 Datasets and Features

Datasets. We evaluate our approach on two RGB-D person re-id datasets RGBD-ID [19] and BIWI RGBD-ID [15] collected by Kinect cameras.

BIWI RGBD-ID [15] has three groups, namely “Training”, “Still” and “Walking”, which respectively contains 50, 28 and 28 humans with different clothing. Each person has 300 frames of RGB images, depth images and skeletons. We use the complete “Training” and “Still” set and hence there are 78 samples in total. And then we select one frame including RGB and depth for each sample. By convention, we randomly choose about half of the samples, 40 pedestrians for training and the other for testing.

RGBD-ID [19] contains 79 identities with five RGB images, five point clouds and skeletons. We randomly sample approximately half of people (41 identities) in “Walking1” for training and the rest for testing because groups “Walking1” and “Walking2” contain the same person with different frontal views. Only one frame with all information for each person is randomly selected to experiment.

Features. Torso and head are segmented from each image and divided into 6 \(\times \) 2 rectangular patches by image preprocessing. We obtain integral features of each image by combining local features of each patch. To test the ability of our model adapted to different representations, we consider two kinds of representative edge features, HOG [7] and SILTP [5] as RGB-based features in our experiment. For each depth image, we combine Eigen-depth feature with skeleton information for complete representation as depth-based features [9]. Both HOG and SILTP features capture local region human body shape, as well as depth-based features. Note that RGB-based features and depth-based features belong to different modalities.

3.2 Experiment Settings

Methods for Comparison. To evaluate the effectiveness of our approach, we compare our method with Least Square Semi-Coupled Dictionary Learning (LSSCDL) [11], Canonical Correlation Analysis (CCA) [16]. We also set a baseline which is the multi-modality matching result without any connection between RGB-based and depth-based features for comparison. CCA is a coherent subspace learning algorithm which projects two sets of random variables to the correlated space so as to maximize the correlation between the projected variables in correlated space. LSSCDL is a similar dictionary learning based algorithm, which learns a pair of dictionaries and a mapping function efficiently to investigate the intrinsic relationship between feature patterns. Recently, CCA and LSSCDL have been applied to re-id problem of matching people across disjoint camera views, involving multi-view or multi-modality tasks. CCA and LSSCDL can be used to address the multi-modality matching problem because they can provide connection between uncorrelated variables.

Evaluation Metrics. Recognition rates at selected ranks and the histograms are used to evaluate the performance. The rank-n rate represents the expectation of finding the correct match in the top n matches [17] and rank-1 rate plays an important role to determine the performance of re-id. To ensure fair comparison, the same training and testing samples are used in all methods and the experiments are conducted 10 times to gain the average results.

Parameter Settings. In the following experiments, we set k = 100, \(\lambda \) = 0.1, \(\lambda _C\) = \(\lambda _D\) = 0.001, \(\lambda _G\) = \(\lambda _P\) = 0.01, \(s_1\) = 0.1, \(s_2\) = 100 for our method. All parameters of other methods are set as suggested in their papers [11, 16].

Table 1. Recognition rates (%) of cross HOG/SILTP and depth feature on BIWI and RGBD-ID dataset. \(F_g\): Gallery; \(F_p\): Probe; D: Depth; H: HOG; S: SILTP.
Fig. 3.
figure 3

Histogram and rank-1 rate on BIWI in two cases.

Fig. 4.
figure 4

Histogram and rank-1 rate on RGBD-ID in two cases.

3.3 Experiment Results

Result on BIWI. To prove the universal applicability of our approach, we extract two kinds of typical RGB-based features, HOG and SILTP, to match depth-based features, respectively. Each experiment is carried out in two cases. One is depth-based features for the gallery and RGB-based features for the probe. The other is the reverse. The results are shown in Table 1 and Fig. 3. It can be seen that our method largely outperforms the baseline, which shows the effectiveness of our method to address the multi-modality matching problem. Compared with CCA, our method can establish closer connection than CCA. The main reason is that our method allows screening the vital information and reduces the influence of invalid elements by sparse representation while CCA can not. Our method also outperforms LSSCDL generally, which demonstrates that the explicit constraint term enforces the discriminability of the learning dictionary pair.

Result on RGBD-ID. In RGBD-ID, people’s head has be blurred in each RGB images, so the problem becomes more challenging. Following the protocol of experiment on BIWI, we compare the methods in [11, 16] using the same feature as that on BIWI in two cases. Table 1 and Fig. 4 show that our method presents the best performance in rank-1 rate. Note that, the margins between the proposed model, CCA, and LSSCDL are small, because the blurred images may significantly degrade the discriminability of edge gradient features and thus reduce the correlation between two heterogeneous modalities. With such a weak correlation, the margins between these models cannot be large.

3.4 Effect of Feature Dimensions

We further evaluate the effect of the dimensions of reconstructive features by adjusting the number of the dimensions on BIWI dataset. In particular, we change the dimensions of reconstructive features from 50 to 500 and observe the performance of CCA, LSSCDL and our method. The results in Fig. 5 reflect that (1) reconstructive features with low dimensions outperform those with high dimensions on the whole. The reason may be attributed to the fact that high dimensions are more likely to cause overfitting when the number of training samples is small [18]. (2) The explicit constraint term in Eq. (5) allows to mine more discriminant features, making our method more stable and effective than CCA and LSSCDL in different dimensions.

Fig. 5.
figure 5

Rank-1 rate on BIWI with variable dimensions of reconstructive feature in two cases and mean average of rank-1 rate.

4 Conclusion

In this paper, we have extended the traditional RGB camera based person re-identification problem to a RGB and depth based cross-modality matching problem. Such a problem is critical when video analysis is needed in heterogeneous camera network. To the best of our knowledge, it is the first attempt for the person re-id to deal with the situation across RGB and depth modality. We have also proposed an effective approach to solve this cross-modality matching problem. It jointly learns coupled dictionaries for RGB and depth camera views. The two views are linked by imposing the two dictionaries to be representative and discriminative. In the testing phase, sparse codes are used for the matching person images across RGB and depth modality. Experimental results on two benchmark heterogeneous person re-id datasets show the effectiveness and superiority of the proposed approach for multi-modality re-id problem.

In the future, we will carefully integrate correlative dictionary learning into a deep convolutional neural network to jointly learn more robust feature representations and a cross-modality distance metric in an end-to-end way.