1 Introduction

It is very common that an object can have very different representations in different modalities. For instance, printed and hand-written forms of the same character can look very different, so are face photo and face sketch of the same person. Humans have little problem in recognizing objects across different modalities (e.g., matching face sketches to face photos). In contrast, conventional machine learning methods, such as k-NN classifiers, perform poorly in cross-modality pattern recognition since they assume both the training data and test patterns are randomly sampled from the same distribution (which is not the case in cross-modality pattern recognition) (Tenenbaum and Freeman 2000).

There exist a number of research studies in the literature targeting at cross-modality pattern recognition, which can be roughly classified into one of the three main approaches. The first approach consists of transforming one modality into another in a preprocessing step (Zhou et al. 2012; Blanz et al. 2005). The second approach is by extracting modality-invariant features to represent an object (Lowe 2004; Zhang et al. 2011). A major limitation of these two approaches is that methods based on these approaches are usually tailor-made for each different modality pair involved in different recognition tasks. The third approach is to find an underlying latent common space shared between different modalities (Tenenbaum and Freeman 2000; Knutsson et al. 1997; Lin and Tang 2006; Sun et al. 2008; Prince et al. 2008). Unlike the first two approaches, the third approach does not depend on task-dependent knowledge. Methods based on the third approach are therefore general frameworks that can be applied to different applications. Existing methods of the third approach often require absolutely-paired observations as training data. We refer to them as Absolutely-Paired Space Analysis (APSA). These methods assume the projections of paired observations being dependent in the latent space, and can only represent a binary relationship between observations (i.e., either paired observations or non-paired observations).

In many application scenarios, however, it is more suitable to consider relatively-paired observations (i.e., two observations from different modalities are more-likely-paired than another two) than absolutely-paired observations. For instance, given an input text query, an image search-engine (such as Google) will return a list of most probable images. The images selected by the user are not absolutely-paired with the input text, but instead are more-likely-paired with the input text than other returned images. In fact, relative-pairing is a general pairing relationship that also covers absolute-pairing. One can safely consider two observations that are absolutely-paired being more-likely-paired than other non-paired observations. Another advantage of considering relatively-paired training data is that label information of the observations can be easily integrated to boost recognition performance. It is reasonable to assume observations with the same label being more-likely-paired than those with different labels. This strategy can be used to reduce within-class scatter while maximizing between-class scatter in the latent common space, as well as increase the minimum distance between observations with different labels in the latent common space.

In this paper, we introduce a general framework named Relatively-Paired Space Analysis (RPSA) which works on relatively-paired observations. Note that RPSA is not a trivial extension of APSA as they are based on completely different models. APSA methods are often based on generative models (Knutsson et al. 1997; Bach and Jordan 2005; Prince et al. 2008) which either explicitly or implicitly assume the distributions of model parameters and noise (e.g., Gaussian distribution). The final estimation will be unreliable when real data do not fit the assumption. As opposed to APSA, our method is based on a discriminative model that has no distribution assumption. Besides, APSA methods learn a projection function for each modality by exploring the statistics dependence of the projections of absolutely-paired observations in the latent common space. This one-to-one absolute-pairing requirement makes them not suitable for relatively-paired observations. In our proposed framework, we compute the projection functions by preserving the relative proximities of observations in the latent common space.

Figure 1 illustrates the principle of the proposed method based on the data set Wiki Text-Image (Rasiwasia et al. 2010) used in our experiments. The data set has two modalities, namely image modality and text modality. We select three images \(a\), \(b\) and \(c\) and one text article \(d\) from it. \(a\), \(b\) and \(c\) show a soccer player, a baseball player, and a building respectively while \(d\) describes a soccer team “Chelsea” and their team members. Obviously, \(d\) is highly relevant to \(a\), slightly relevant to \(b\) (since both \(b\) and \(d\) have the concept of sports), and little relevant to \(c\). Therefore, \(a\) is more likely paired with \(d\) than \(b\), and \(b\) is more likely paired with \(d\) than \(c\). Assume \(p_a\), \(p_b\), \(p_c\) and \(p_d\) are projections of \(a\), \(b\), \(c\) and \(d\) in the latent common space respectively. The proposed method attempts to learn one projection function for each modality so that the distance between \(p_a\) and \(p_d\) is shorter than that between \(p_b\) and \(p_d\) which is shorter than that between \(p_c\) and \(p_d\).

Fig. 1
figure 1

Illustration of the proposed method. a shows the relative-pairing relationships between observations from image and text modalities.

figure a
indicates being more likely paired. b shows the distances between the projections of observations in the latent common space

We first learn the model parameters of RPSA via alternating variable method (Shen et al. 2011), and find that the training time increases dramatically as the number of training triplets increases. To this end, we reformulate the RPSA problem into learning a structural model (Tsochantaridis et al. 2004), and a scalable approach based on the cutting plane algorithm is proposed to solve this problem.

We validate our RPSA framework by applying it to feature fusion, cross-pose face recognition, text-image retrieval and attribute-image retrieval. Experimental results demonstrate that our proposed framework outperforms other state-of-the-art approaches. The main contributions of this paper are

  1. 1.

    We propose a general framework called Relatively-Paired Space Analysis (RPSA) for automatically learning a latent common space between different modalities from relatively-paired observations, which, to the best of our knowledge, has not been explored before.

  2. 2.

    We propose a scalable optimization approach based on the cutting plane algorithm to learn the model parameters of RPSA.

  3. 3.

    We apply our proposed RPSA framework to feature fusion, cross-pose face recognition, text-image retrieval and attribute-image retrieval. RPSA achieves significant improvement in recognition and retrieval performance compared with other state-of-the-art methods.

Preliminary results of this work had been published in the proceedings of the British Machine Vision Conference 2013 in Bristol, UK (Kuang and Wong 2013). The differences between this version and the previous one are as follows:

  1. 1.

    A more detailed and up-to-date survey of multi-modality analysis is included in Sect. 2.

  2. 2.

    A term which measures the sum of the distances between points and their corresponding target neighbors is added in the proposed objective energy function to boost the performance of RPSA.

  3. 3.

    The RPSA problem is reformulated as a structural learning model and a scalable approach based on the cutting plane algorithm is proposed to solve it.

  4. 4.

    New experiments on Wiki Text-Image data set (Rasiwasia et al. 2010) and Public Figures Face Database (Parikh and Grauman 2011; Kumar et al. 2009) have been carried out for testing and the results are compared with state-of-the-art techniques.

2 Related Work

There exist a large number of research studies on cross-modality pattern recognition in the literature. Due to page limitation, however, we focus our discussion only on those most relevant work that automatically learn a latent common space between different modalities. Knutsson et al. (1997) proposed the Canonical Correlation Analysis (CCA) which finds a latent common space by maximizing the correlation of the projections of cross-modality observations. Sun et al. (2008) extended CCA by maximizing the within-class correlations and minimizing between-class correlations. Torre and Black (2001) developed the Asymmetric Coupled Component Analysis (ACCA) to explicitly learn the dependence of projections in a latent common space. Similarly, Lin and Tang (2005) explored the coupled space by alternatively maximizing the correlation of projections of cross-modality observations and finding the relations between these projections. Different from CCA, Partial Least Square (PLS) (Prince et al. 2008; Rosipal and Krämer 2006) chooses linear mappings such that the covariance between projections of cross-modality observations in the latent common space is maximized. Bilinear Model (BLM) (Tenenbaum and Freeman 2000) was proposed to separate style and content. Observations with different styles (from different modalities) for an object are encouraged to map to the same content in a latent common space by solving two-factor tasks. Recently, Sharma and Kumar (2012) proposed a General Multi-view Analysis (GMA) approach which learns a latent common space by solving a generalized eigenvalue problem. Kan et al. (2012) introduced a Multi-view Discriminant Analysis (MvDA) method to seek for a projection function for each modality by optimizing a generalized Rayleigh quotient. Besides, researchers have proposed advanced nonlinear methods based on the Gaussian Process Latent Variable Model (GPLVM) (Shon et al. 2006; Navaratnam et al. 2007; Ek et al. 2008). All the above methods require absolutely-paired observations as training data. Recently, Lampert and Krömer (2010) learned a latent space based on weakly-paired data (i.e., subsets of observations of one modality are paired with those of another modality) by alternatively finding element pairs and maximizing covariance of projections of cross-modality observations. Different from previous work, our proposed framework depends on neither prior distribution assumptions nor statistics computations, and learns a latent common space by preserving relative proximities of the relatively-paired training data in the latent common space.

Metric learning can be interpreted as finding a latent space for a single-modality observation space by linear projection. Xing et al. (2002) proposed to minimize the distances between samples from a similar set while keeping the distances of those from a dissimilar set above a threshold. Goldberger et al. (2004) directly maximized a stochastic variant of the leave-one-out k-NN score on the training set. Since then, many other methods (Weinberger et al. 2006; Davis et al. 2007; Shen et al. 2009; Zheng et al. 2013) were proposed to achieve a similar goal. Specifically, Zheng et al. (2013) proposed a metric learning approach named Relative Distance Comparison (RDC) to solve reidentification. They formulated RDC to maximize the likelihood of a pair of true matches having a relatively smaller distance than that of a wrong match pair in a soft discriminant manner. However, these methods only focus on a single modality. For cross-modality pattern recognition problems as studied in this paper, observations from different modalities are heterogeneous, and metric learning approaches cannot get good results (Knutsson et al. 1997; Sun et al. 2008; Torre and Black 2001; Wu et al. 2010). Our experiments on cross-pose face recognition support this conclusion. In some cases, observations from different modalities have different numbers of dimension (our experiments on feature fusion and image-text retrieval are examples). Metric learning approaches cannot be used to do cross-modality pattern recognition since metrics such as Mahalanobis distance, require the same dimension number for different observations (which is not the case in this task). Our work is not a trivial extension of metric learning. First, the relative-pairing information which encodes the relationship between observations from different modalities is novel. Second, the scalable optimization method based on structural learning to speed up multi-modality analysis was not explored before. Recently, Quadrianto and Lampert (2011) extended metric learning to multiple modalities by explicitly modeling linear projections. Their objective function is non-convex and thus the final optimum obtained depends on initialization. Moreover, their method requires the dimension of the latent common space to be known a priori. As opposed to their method, our model is convex which guarantees a global optimum, and can find a latent common space with any dimension in a single optimization.

Exploiting latent spaces can also be found in related research studies, such as local metric learning (Andrea et al. 2007), hashing (Bronstein and Bronstein 2010), multi-task learning (Parameswaran and Weinberger 2010), domain adaption (Saenko et al. 2010) and ranking (Wang et al. 2009). However, their goals are very different from the one in this paper.

3 Relatively-Paired Space Analysis

In this section, we describe our RPSA framework for learning a latent common space from relatively-paired observations. The goal is to find linear mappings that project observations from different modalities into a latent common space in which the relative proximities of the relatively-paired observations are preserved.

3.1 Preliminaries

Let us define some notation first. We use boldface uppercase, lowercase and calligraphic letters (e.g., \(\mathbf{X},\mathbf{x}\) and \(\mathcal {X}\)) to denote matrices, vectors and sets, respectively. \(X_{ij}\) denotes the \((i,j)\)th entry of \(\mathbf{X}\), \(x_i\) denotes the \(i\)th entry of \(\mathbf{x}\), and \(x_{ij}\) denotes the \(j\)th entry of \(\mathbf{x}_i\). \(\mathbf{X}\succeq 0\) denotes \(\mathbf{X}\) being a positive semi-definite matrix. Let \(\mathrm{Tr}(\mathbf{X})\) denote the trace of \(\mathbf{X}\) and \(\mathbf{X}^\mathrm{T}\) its transpose, and the inner product of two matrices \(\langle \mathbf{X},\mathbf{Y}\rangle \) can then be represented by \(\mathrm{Tr}(\mathbf{X}^\mathrm{T}\mathbf{Y})\). For a symmetric matrix \(\mathbf{X}\), its eigenvalue decomposition is given by \(\mathbf{X}=\mathbf{U}\varLambda \mathbf{U}^\mathrm{T}\) with \(\mathbf{U}\) being an orthogonal matrix. The positive part of the matrix \(\mathbf{X}\) is defined as

$$\begin{aligned} (\mathbf{X})_+=\mathbf{U}\max (\varLambda ,\mathbf{0})\mathbf{U}^\mathrm{T}, \end{aligned}$$
(1)

and the negative part as

$$\begin{aligned} (\mathbf{X})_-=\mathbf{U}\min (\varLambda ,\mathbf{0})\mathbf{U}^\mathrm{T}. \end{aligned}$$
(2)

Clearly, \(\mathbf{X}=(\mathbf{X})_++(\mathbf{X})_-\) always holds true.

3.2 The RPSA Model

Consider a set of \(M\) modalities \(\{\varOmega _1, \varOmega _2, \dots , \varOmega _M\}\) with dimensions \(\{d_1, d_2,\dots \, d_M\}\) respectively, and a training data set of \(N\) observations \(\{\mathbf{x}_1,\mathbf{x}_2,\dots ,\mathbf{x}_{N}\}\) with a corresponding flag set \(\{t_1, t_2, \dots , t_N\}\) such that \(t_i\in \{1,\ldots ,M\}\) indicates that \(\mathbf{x}_i\) comes from \(\varOmega _{t_i}\). Let the relative-pairing knowledge of the observations be represented by a set of triplets \(\mathcal {T}=\{(i, j, k)\}\), where each triplet \((i, j, k)\) encodes that \(\mathbf{x}_i\) and \(\mathbf{x}_j\) are more-likely-paired than \(\mathbf{x}_i\) and \(\mathbf{x}_k\). Note that \(\mathbf{x}_i\), \(\mathbf{x}_j\) and \(\mathbf{x}_k\) can come from either the same or different modalities. When they are from the same modality, “being more-likely-paired” means “being more similar”.

To learn a latent common space \(Z\) with dimension \(d_z\), we seek a \(d_z \times d_m\) linear projection matrix \(\mathbf{W}_{\varOmega _m}\) for each modality \(\varOmega _m\) such that the relative proximities of the projections of the relatively-paired observations are preserved in \(Z\), i.e.,

$$\begin{aligned} d(i,j) \le d(i,k)\quad \forall (i, j, k)\in \mathcal {T}, \end{aligned}$$
(3)

where

$$\begin{aligned} d(i, j) = \Vert \mathbf{W}_{\varOmega _{t_i}}\mathbf{x}_i - \mathbf{W}_{\varOmega _{t_j}}\mathbf{x}_j\Vert ^2 \end{aligned}$$
(4)

denotes the squared Euclidean distance between the projections of \(\mathbf{x}_i\) and \(\mathbf{x}_j\) in \(Z\). Let \(\mathbf{W} = [\mathbf{W}_1\ \dots \mathbf{W}_M]\) and \(\mathbf{S}_{\varOmega _m}\) be a \((\sum d_n) \times d_m\) matrix with all elements being zero except for row \((\sum _{n<m} d_n) + 1\) to row \(\sum _{n\le m} d_n\) being an identity matrix, such that \(\mathbf{W}_{\varOmega _m} = \mathbf{WS}_{\varOmega _m}\). Substituting this into (4) gives

$$\begin{aligned} d(i, j)&= (\mathbf{S}_{\varOmega _{t_i}}\mathbf{x}_i - \mathbf{S}_{\varOmega _{t_j}}\mathbf{x}_j)^\mathrm{T}\mathbf{W}^\mathrm{T}\mathbf{W}(\mathbf{S}_{\varOmega _{t_i}}\mathbf{x}_i - \mathbf{S}_{\varOmega _{t_j}}\mathbf{x}_j) \nonumber \\&= \mathrm{Tr}(\mathbf{AC}_{i, j}), \end{aligned}$$
(5)

where \(\mathbf{A} = \mathbf{W}^\mathrm{T}\mathbf{W}\) and

$$\begin{aligned} \mathbf{C}_{i, j} = (\mathbf{S}_{\varOmega _{t_i}}\mathbf{x}_i - \mathbf{S}_{\varOmega _{t_j}}\mathbf{x}_j)(\mathbf{S}_{\varOmega _{t_i}}\mathbf{x}_i - \mathbf{S}_{\varOmega _{t_j}}\mathbf{x}_j)^\mathrm{T}. \end{aligned}$$
(6)

Substituting (5) into (3) gives

$$\begin{aligned} \mathrm{Tr}(\mathbf{AC}_{i, k}) - \mathrm{Tr}(\mathbf{AC}_{i, j}) \ge 0 \quad \forall (i, j, k) \in \mathcal {T}. \end{aligned}$$
(7)

(7) defines the relative proximity constraints on \(\mathbf{A}\) which encodes \(\mathbf{W}\) (i.e., the set of projection matrices). Since \(t_i\), \(t_j\), \(t_k\) \(\in \{1,\ldots ,M\}\), there are \(M^3\) possible modality configurations for a triplet \((i, j, k)\). When \(t_i=t_j=t_k\), \(\mathbf{x}_i\), \(\mathbf{x}_j\) and \(\mathbf{x}_k\) are from the same modality, and (7) provides constraints in one modality which is the same as metric learning. Now to learn the latent common space, we find a positive-semidefinite matrix \(\mathbf{A}\) (i.e., \(\mathbf{A}\succeq 0\).) which fulfills (7). Note that if \(\mathbf{A}^*\) is a solution, multiplying \(\mathbf{A}^*\) by any arbitrary positive scalar will also give a solution. To specify a unique solution, we let \(\mathbf{C}_{i,j,k} = \mathbf{C}_{i,k}-\mathbf{C}_{i,j}\) and optimize an SVM style energy function, given by

$$\begin{aligned} \min \quad&\frac{1}{2}\left\| \mathbf{A} \right\| _\mathrm{F}^2 + \gamma _1 \sum \xi _{i,j,k}+\gamma _2 \sum \mathrm{Tr}(\mathbf{AC}_{\overline{i}, \overline{j}})\nonumber \\ s.t. \quad&\mathrm{Tr}(\mathbf{AC}_{i,j,k}) \ge 1 - \xi _{i,j,k},\;\;\mathbf{A}\succeq 0\;\; \mathrm{and} \;\; \nonumber \\&\qquad \xi _{i,j,k} \ge 0, \;\; \forall (i,j,k) \in \mathcal {T},\;\; \forall (\overline{i}, \overline{j}) \in \mathcal {P}, \end{aligned}$$
(8)

where both \(\gamma _1\) and \(\gamma _2\) are non-negative weights, and \(\mathcal {P}\) is a set of pairs \((\overline{i}, \overline{j})\) which indicates \(\mathbf{x}_{\overline{j}}\) is a target neighbor of \(\mathbf{x}_{\overline{i}}\). We will discuss how to set \(\gamma _1\), \(\gamma _2\), \(\mathcal {T}\) and \(\mathcal {P}\) in Sect. 5.2. The first term in (8) is a regularization term which controls the complexity of the model we learn. The second term is the standard hinge loss term which gives a penalty for any violated constraint defined in (7). Minimizing the hinge loss term is equivalent to maximizing a distance margin, which makes the learned model robust against noise. The third term encourages the Euclidean distance between the projections of \(\mathbf{x}_{\overline{i}}\) and \(\mathbf{x}_{\overline{j}}\) in the latent common space (i.e., \(d(\overline{i},\overline{j})\)) to be as short as possible. The effects of optimizing (8) is illustrated in Fig. 2.

Fig. 2
figure 2

Illustration of the effects of optimizing (8). Given \(\mathbf{x}_h\) from different modalities, where \(h\in \{ i,j,k, \overline{i}, \overline{j}\}\) with \((i,j,k) \in \mathcal {T}\) and \((\overline{i}, \overline{j}) \in \mathcal {P}\), we attempt to learn projection matrices \(\mathbf{W}_{\varOmega _{t_h}}\) so that a distance margin \(d(i,k)-d(i,j)\) is maximized while the distance between points in target neighborhood \(d(\overline{i}, \overline{j})\) is minimized

3.3 Optimization

We consider the Lagrangian of (8):

$$\begin{aligned}&\mathcal {L}(\mathbf{A},\mathbf{\xi }, \mathbf{X}, \mathbf{u},\mathbf{p})=\frac{1}{2}\left\| \mathbf{A} \right\| _\mathrm{F}^2 + \gamma _1 \sum \xi _{i,j,k}\nonumber \\&\qquad +\,\gamma _2 \sum \mathrm{Tr}(\mathbf{AC}_{\overline{i}, \overline{j}})-\sum {u_{i,j,k}\mathrm{Tr}(\mathbf{AC}_{i,j,k})}\nonumber \\&\qquad +\sum {u_{i,j,k}}-\sum {u_{i,j,k}\xi _{i,j,k}} -\mathbf{p}^\mathrm{T}\mathbf{\xi }-\mathrm{Tr}(\mathbf{AX})\nonumber \\&\qquad s.t. \;\; \mathbf{X}\succeq 0,\;\; \mathrm{and} \;\; u_{i,j,k} \ge 0\;\; \mathrm{and} \;\;p_{i,j,k} \ge 0, \;\;\nonumber \\&\qquad \forall (i,j,k) \in \mathcal {T}, \forall (\overline{i},\overline{j}) \in \mathcal {P}, \end{aligned}$$
(9)

where \(\mathbf{X}\), \(u_{i,j,k}\) and p are the Lagrangian multipliers for the primal variable \(\mathbf{A}\), the constraint corresponding to the training triplet \((i,j,k)\) in (8), and \(\xi \) respectively. Setting the gradient of (9) with respect to the primal variables \(\mathbf{A}\) and \(\mathbf{\xi }\) to \(\mathbf{0}\) gives

$$\begin{aligned} \mathbf{A}^{*}=\mathbf{X}^{*}+\sum {u_{i,j,k}^{*}\mathbf{C}_{i,j,k}}-\gamma _2\sum {\mathbf{C}}_{\overline{i}, \overline{j}}, \end{aligned}$$
(10)

and \(u_{i,j,k}^{*}=\gamma _1-p_{i,j,k}\). Substituting the above expressions into the Lagrangian (9) gives the negative of the dual problem:

$$\begin{aligned} \min \quad&\frac{1}{2}\left\| \mathbf{X}-\hat{\mathbf{C}} \right\| _\mathrm{F}^2 - \sum u _{i,j,k}\nonumber \\ s.t. \;\;&\mathbf{X}\succeq 0\;\; \mathrm{and} \;\; \gamma _1 \ge u _{i,j,k} \ge 0, \;\;\nonumber \\&\qquad \forall (i,j,k) \in \mathcal {T}, \end{aligned}$$
(11)

where \(\hat{\mathbf{C}}=-\sum {u_{i,j,k}\mathbf{C}_{i,j,k}}+\mathbf{B}\) with the matrix \(\mathbf{B}\) being \(\gamma _2\sum {\mathbf{C}}_{\overline{i}, \overline{j}}\).

(11) has two variables, namely \(\mathbf{X}\) and \(\mathbf{u}\). It is optimized by alternating variable method, where one variable is optimized while another is fixed at one time. \(\mathbf{X}\) is first optimized while fixing \(\mathbf{u}\), and \(\mathbf{u}\) is then optimized while fixing \(\mathbf{X}\) in each iteration. Specifically, while fixing \(\mathbf{u}\), (11) fortunately has a close-form optimal solution:

$$\begin{aligned} \mathbf{X}^{*}=(\hat{\mathbf{C}})_+. \end{aligned}$$
(12)

Fixing \(\mathbf{X}\) leads to the following box constraints quadratic programming (QP) over \(\mathbf{u}\):

$$\begin{aligned} \min \quad&\frac{1}{2}\left\| \mathbf{X}^{*}-\mathbf{B}+\sum {u_{i,j,k}\mathbf{C}_{i,j,k}} \right\| _\mathrm{F}^2 - \sum u _{i,j,k}\nonumber \\ s.t. \;\;&\gamma _1 \ge u _{i,j,k} \ge 0, \;\; \forall (i,j,k) \in \mathcal {T}. \end{aligned}$$
(13)

The off-the-shelf first order Newton algorithm L-BFGS-B (Liu and Nocedal 1989) is employed to solve this QP problem. L-BFGS-B is an iterative algorithm, in each iteration of which, \(\mathbf{u}\) is updated till the algorithm converges.

In normal alternating variable method framework, one variable (e.g., \(\mathbf{X}\)) is updated after another variable (e.g., \(\mathbf{u}\)) stop changing. For fast convergence, \(\mathbf{X}\) is updated once \(\mathbf{u}\) is changed in each iteration of L-BFGS-B. Therefore, the gradient of (13) is given by

$$\begin{aligned} {G}(u_{i,j,k})&= \mathrm{Tr}((\mathbf{X}-\mathbf{B}+\sum {u_{i,j,k}\mathbf{C}_{i,j,k}})\mathbf{C}_{i,j,k})-1\nonumber \\&= \mathrm{Tr}(-(\hat{\mathbf{C}})_-\mathbf{C}_{i,j,k})-1. \end{aligned}$$
(14)

The overall optimization procedure is summarized in Algorithm 1. Its main body is the off-the-shelf algorithm L-BFGS-B. The code for computing objective value and gradient, and updating \(\mathbf{X}\) (Line 5–8) is implemented by callback functions. The code for updating \(\mathbf{u}\) and its approximated Hessian (Line 9) is provided internally in L-BFGS-B. Therefore, there is no need to implement it.

figure b

After getting the optimum \(\mathbf{A^*}\), we obtain \(\mathbf{W}\) by minimizing \(\left\| \mathbf{A^*}-\mathbf{W}^\mathrm{T}\mathbf{W} \right\| _\mathrm{F}\). Suppose the rows of \(\mathbf{W}\) are orthogonal to each other, \(\mathbf{W}^\mathrm{T}\mathbf{W}\) will then be a positive-semidefinite matrix with rank \(d_z\) (i.e., the dimension of the latent common space \(Z\)). According to Eckart–Young theorem (Stewart 1993), \(\mathbf{W}^\mathrm{T}\mathbf{W}\) will be the rank-\(d_z\) approximation of \(\mathbf{A^*}\). We perform eigenvalue decomposition over the positive-semidefinite matrix \(\mathbf{A^*}\), getting \(\mathbf{A^*}={\mathbf{U}\varLambda } \mathbf{U}^\mathrm{T}\) with \(\mathbf{U}\) being an orthogonal matrix and \({ \varLambda }\) a real diagonal matrix with decreasing singular values \(\sigma _1\ge \cdots \ge \sigma _{\sum d_m}\). We obtain \(\mathbf{W}={ \varLambda }'\mathbf{U}^\mathrm{T}\) with \({\varLambda }'\) being a diagonal matrix with decreasing diagonal values \(\sqrt{\sigma _1}, \sqrt{\sigma _2}, \dots , \sqrt{\sigma _{d_z}}, 0, \dots , 0\). Linear projections \(\mathbf{W}_{\varOmega _m}\) for different dimensions of \(Z\) can be obtained after optimizing (8) and one eigenvalue decomposition. Note that the appropriate latent common space dimension \(d_z\) is application dependent, and is determined by cross validation in this paper.

3.4 Time Complexity

In this section, we discuss the time complexity of Algorithm 1. In each iteration, the time complexity for computing \(\hat{\mathbf{C}}\) is \(\mathcal {O}(Ks^2)\) where \(K\) is the number of training triplets (i.e., \(|\mathcal {T}|\)) and \(s\) is the averaged sparsity of \(\mathbf{S}_{\varOmega _{t_i}}\mathbf{x}_i - \mathbf{S}_{\varOmega _{t_j}}\mathbf{x}_j\) or that of \(\mathbf{S}_{\varOmega _{t_i}}\mathbf{x}_i - \mathbf{S}_{\varOmega _{t_k}}\mathbf{x}_k\), whichever is bigger (Line 5). The eigenvalue decomposition of \({\hat{\mathbf{C}}}\) has \({\mathcal {O}(D^3)}\) time complexity with \(D=\sum {d_m}\) (Line 6). Computing gradient has the time complexity of \(\mathcal {O}(Ks^2)\) (Line 7) while \(\mathcal {O}(D^2+K)\) for computing objective values (Line 8). The cost for updating \(\mathbf{u}\) and approximation Hessian (Shen et al. 2011) is \(\mathcal {O}(rK)\) with \(r\) being a constant (Line 9).

If \(K\) is not much greater than \(D\), the eigenvalue decomposition of \({\hat{\mathbf{C}}}\) dominates the computation complexity in each iteration and the optimization algorithm can converge in a small number of iterations. In this case, the overall time complexity is \({\mathcal {O}(T_1 D^3)}\) with \(D=\sum {d_m}\) and \(T_1\) being the iteration number.

However, in real applications, for learning stable models, one prefers collecting large scale data set so that the distribution of training data can converge the true, underlying data distribution. In general, one has \({\mathcal {O}(N^3)}\) training triplets for the data set with size \(N\) if enumerating all possible combinations. Although one can cut down on the number of training triplets with heuristics as in the work of Rakotomamonjy (2004), the number is still very large. For instance, the number of triplets in our experiments on the Wiki data set in Sect. 5 is as large as 1 million. In this case, \(K\) is much greater than \(D\), and the time spent on the eigenvalue decomposition can be ignored. The overall time complexity is \(\mathcal {O}(T_1r_1K)\) with \(r_1\) being a constant.

To validate the above discussion, an experiment was conducted on the Wiki data to show the relationship between the training time and the number of training triplets (Fig. 3). It has been observed the time of eigenvalue decomposition (Line 6) dominates the total training time when the number of training triplets is small while that of computing gradient (Line 7) dominates when the number of training triplets is large (see Fig.  3a). Figure 3b shows that the training time of Algorithm 1 is a linear function w.r.t. the number of training triplets when the number is large. These observations are consistent with our previous discussion.

Fig. 3
figure 3

Training time of RPSA against the number of training triplets. a shows the training time ratio of each step in Algorithm 1 as the number of training triplets increases. b shows the training time of Algorithm 1 as the number of training triplets increases

4 Efficient Relatively-Paired Space Analysis

As discussed in the previous section, the optimization procedure of (8) slows down dramatically as the number of training triplets increases. The underlying reason is that large scale training triplets would lead to a long vector \(\mathbf{u}\) and thus a large scale QP problem (13). One possible way to speed up the optimization procedure is to reduce the number of training triplets involved. In the literature, Stochastic Gradient Descent (SGD) (Bottou 2010) is usually employed to train models over large scale training samples by randomly selecting a mini-batch of them each time. Although it is successfully used in many applications such as training SVM (Shalev-Shwartz et al. 2007), however, there is no theoretical guarantee that SGD converges to optimal solutions and thus its usefulness heavily dependents on users’ parameter tuning experience. Structural learning  (Taskar 2004) can also be used to speed up training models by selecting the most violated constraint (training sample) in each iteration. Joachims (2006) reformulated a linear SVM model into a structural SVM model which is solved by the cutting plane algorithm. The reformulation model is proved to be equivalent to the original SVM model. Making things interesting, the structural learning based approach is several orders of magnitude faster than decomposition methods when feature vector is highly sparse. Our model solved in this paper is different from SVM. However, both of them involve optimizations with constraints. It is not clear whether structural learning can speed up our multi-modality analysis model or not.

To this end, we first reformulate the problem of relatively-paired space analysis into learning a structure model. The cutting plane algorithm (Tsochantaridis et al. 2004) is then used to solve this problem, in each iteration of which only a few training triplets are involved. The efficiency of Algorithm 1 and structural learning based approach is finally compared in terms of time complexity and empirical training time.

4.1 Reformulating RPSA into Structural Learning

By introducing a binary variable \(c_{i,j,k}\in \{0,1\}\) for each triplet \((i,j,k)\), the RPSA model can be reformulated into a structural learning problem which can be learned by solving the following optimization problem:

$$\begin{aligned} \min \quad&\frac{1}{2}\left\| \mathbf{A} \right\| _\mathrm{F}^2 + \gamma _1 K \xi +\gamma _2 \sum \mathrm{Tr}(\mathbf{AC}_{\overline{i}, \overline{j}})\nonumber \\ s.t. \quad&\frac{1}{K}\sum c_{i,j,k}\mathrm{Tr}(\mathbf{AC}_{i,j,k}) \ge \frac{1}{K}\sum c_{i,j,k} - \xi , \quad \nonumber \\&\qquad \mathbf{A}\succeq 0\;\; \mathrm{and} \;\; \xi \ge 0, \forall \mathbf{c}\in \{0,1\}^K. \end{aligned}$$
(15)

While (15) has \(2^K\) constraints, one for each possible vector \(\mathbf{c}\in \{0,1\}^K\), it has only one slack variable variable \(\xi \) which is shared across all constraints. Interestingly, (15) and (8) are equivalent.

Theorem 1

Any solution \(\mathbf{A}^*\) of (15) is also the solution of (8) (and vice verse) with \(\xi ^*=\frac{1}{K}\sum \xi _{i,j,k}^* \).

Proof

The following derivation will show for any \(\mathbf{A}\), (8) and (15) have the same objective value. Given \(\mathbf{A}\), \(\xi _{i,j,k}\) can be optimized individually such that the objective value of (8) is as small as possible. i.e., \(\xi _{i,j,k}^*=\max (0,1-\mathrm{Tr}(\mathbf{A}\mathbf{C}_{i,j,k}))\). For (15), we have

$$\begin{aligned} \gamma _1 K \xi ^*&= \gamma _1 K \max (0,\max _{\mathbf{c}}(\frac{1}{K}\sum {c_{i,j,k}}\nonumber \\&-\frac{1}{K}\sum {c_{i,j,k}}\mathrm{Tr}(\mathbf{A}\mathbf{C}_{i,j,k})))\end{aligned}$$
(16a)
$$\begin{aligned}&= \gamma _1 \max (0,\max _{\mathbf{c}}(\sum {c_{i,j,k}}\nonumber \\&-\sum {c_{i,j,k}}\mathrm{Tr}(\mathbf{A}\mathbf{C}_{i,j,k})))\end{aligned}$$
(16b)
$$\begin{aligned}&= \gamma _1 \max (0,\sum \max _{c_{i,j,k}}(c_{i,j,k}\nonumber \\&- c_{i,j,k}\mathrm{Tr}(\mathbf{A}\mathbf{C}_{i,j,k})))\end{aligned}$$
(16c)
$$\begin{aligned}&= \gamma _1 \sum \max (0,1-\mathrm{Tr}(\mathbf{A}\mathbf{C}_{i,j,k}))\end{aligned}$$
(16d)
$$\begin{aligned}&= \gamma _1 \sum \xi _{i,j,k}^* \end{aligned}$$
(16e)

(16a) follows directly from the definition of \(\xi ^*\); (16c) holds because each element of \(\mathbf{c}\) is independent and can be optimized individually. (16c) and (16d) are equivalent because \(c_{i,j,k}\in \{0,1\}\); again, (16e) follows directly from the definition of \(\xi _{i,j,k}^*\). The above equations prove that the objective values of (8) and (15) are the same for any given \(\mathbf{A}\). Additionally, (8) and (15) have identical solution space.\(\square \)

4.2 Optimization

Theorem 1 guarantees that (8) and (15) have the same optima. One may spot that (15) has even more constraints (\(2^K\)) than (8) (\(K\)), and wonder what one may benefit from this kind of reformulation. Note that there is only one slack variable which is an upper bound for the penalty of all possible constraints in (15). It suggests that only a small subset of constraints are informative. Ignoring other non-informative constraints would lead to a simple reduced problem which can be efficiently solved as only a few constraints are involved in its primal problem or a few dual variables in its dual problem. Joachims (2006) employed the cutting-plane algorithm to find a small set of most violated constraints to speed up the training procedure of linear support vector machine. It has been proved that one can approximate the SVM problem by a reduced problem with a small constant number of constraints. Surprisingly, the number of constraints in the reduced problem is independent of that in the original SVM.

The cutting-plane algorithm is employed to solve (15). In each iteration, the most-violated constraint is first found:

$$\begin{aligned} \mathbf{c}^{*}= \mathrm{arg\,max }_{\mathbf{c}} \sum c_{i,j,k}-\sum c_{i,j,k}\mathrm{Tr}(\mathbf{AC}_{i,j,k}), \end{aligned}$$
(17)

and put it into the most-violated constraint set \(\Gamma \). i.e., \(\Gamma =\Gamma \cup {\mathbf{c}^{*}}\). \(c_{i,j,k}\) are independent and thus can be optimized individually. One has \(c^{*}_{i,j,k}=1\) if \(\mathrm{Tr}(\mathbf{AC}_{i,j,k})<1\), otherwise 0. \(\mathbf{A}\) is then updated by optimizing the following reduced problem:

$$\begin{aligned} \min \quad&\frac{1}{2}\left\| \mathbf{A} \right\| _\mathrm{F}^2 + \gamma _1 K \xi +\gamma _2 \sum \mathrm{Tr}(\mathbf{AC}_{i, j})\nonumber \\ s.t. \quad&\frac{1}{K}\sum c_{i,j,k}\mathrm{Tr}(\mathbf{AC}_{i,j,k}) \ge \frac{1}{K}\sum c_{i,j,k} - \xi ,\nonumber \\&\qquad \mathbf{A}\succeq 0\;\; \mathrm{and} \;\; \xi \ge 0,\forall \mathbf{c}\in \Gamma . \end{aligned}$$
(18)

Obviously, the number of constraints is \(|\Gamma |\), which usually is a small number. Let \(\overline{\mathbf{C}}_{\mathbf{c}}=\frac{1}{K}\sum c_{i,j,k} \mathbf{C}_{i,j,k}\) and \(\overline{\mathbf{w}}_{\mathbf{c}}=\frac{1}{K}\sum c_{i,j,k}\). The negative dual problem of (18) is given by:

$$\begin{aligned}&\min \quad \frac{1}{2}\left\| \overline{\mathbf{X}}-\hat{\overline{\mathbf{C}}} \right\| _\mathrm{F}^2 - \sum \overline{ w}_{\mathbf{c}}\overline{u} _{\mathbf{c}}\nonumber \\&\qquad s.t. \;\; \sum \overline{u}_{\mathbf{c}}\le K\gamma _1,\;\;\mathrm{and} \;\; \overline{u}_{\mathbf{c}} \ge 0, \;\; \forall \mathbf{c}\in \Gamma , \end{aligned}$$
(19)

where \(\overline{\mathbf{X}}\) and \(\overline{u}_{\mathbf{c}}\) are the Lagrangian multiplier of \(\mathbf{A}\) and the constraint corresponding to \(\mathbf{c}\) respectively. \(\hat{\overline{\mathbf{C}}}=-\sum {\overline{u}_{\mathbf{c}}\overline{\mathbf{C}}_{\mathbf{c}}}+\mathbf{B}\). Similar to (11), (19) also has two variables and can be optimized by alternating variable method. Again, \(\overline{\mathbf{X}}\) has a close-form optimal solution while fixing \(\overline{\mathbf{u}}\). i.e.,

$$\begin{aligned} \overline{\mathbf{X}}^{*}=(\hat{\overline{\mathbf{C}}})_+, \end{aligned}$$
(20)

Fixing \(\overline{\mathbf{X}}\) and optimizing \(\overline{u}_{\mathbf{c}}\) gives a quadratic programming (QP) with a sum constraint:

$$\begin{aligned} \min \quad&\frac{1}{2}\left\| \overline{\mathbf{X}}^{*}-\mathbf{B}+\sum {\overline{u}_{\mathbf{c}}\overline{\mathbf{C}}_{\mathbf{c}}} \right\| _\mathrm{F}^2 - \sum \overline{w}_{\mathbf{c}}\overline{u} _{\mathbf{c}}\nonumber \\ s.t. \;\;&\overline{\mathbf{X}}\succeq 0\;\; \mathrm{and} \;\; \sum \overline{u}_{\mathbf{c}}\le K\gamma _1,\;\; \overline{u}_{\mathbf{c}} \ge 0, \;\; \forall \mathbf{c}\in \Gamma . \nonumber \\ \end{aligned}$$
(21)

Since L-BFGS-B cannot solve QP problems with a sum constraint, the quadprog function with interior points optionFootnote 1 in Matlab is employed to optimize it efficiently.

The optimization procedure of (15) is summarized in Algorithm 2. In Line 3, \(\mathbf{A}\) is initialized to an identity matrix \(\mathbf{I}\). In Line 7, we initialize the problem (21) using previous \(\overline{\mathbf{u}}\) to speed up convergence.

figure c

4.3 Time Complexity

In each outer iteration of Algorithm 2, the time complexity of computing the most violated constraints (Line 5) is \(\mathcal {O}(Ks^2)\). Line 6–9 solve the reduced problem (18) which is actually an RPSA problem and can be optimized by Algorithm 1. Since it has very limited constraints (i.e., \(|\Gamma |\) is small and thus \(\overline{\mathbf{u}}\) is a short vector), it can be solved with time complexity \({\mathcal {O}(T_1 D^3)}\) as discussed in Sect. 3.4. For large scale training triplets, the time of computing the most violated constraints dominates the total time of each outer iteration while that of solving the reduced problem is negligible. Therefore, the time complexity of Algorithm 2 is \(\mathcal {O}(T_2Ks^2)\) with the constant \(T_2\) being the outer iteration number.

Both Algorithm 1 and 2 have a linear time complexity for each outer iteration when a huge number of training triplets are available. However, Algorithm 2 empirically converges much faster than Algorithm 1. Theoretically, its outer iteration number \(T_2\) does not depend on the number of training examples \(K\) (Joachims et al. 2009).

4.4 Efficiency Comparison

Algorithm 1 and 2 both converge to an identical solution which is guaranteed by Theorem 1. One concerns only their efficiency. We conducted two experiments on the Wiki text-image data set to compare them. The number of training triplets is set to \(4\times 10^4\) in the first experiment while \(10^6\) in the second one (other settings can be found in Sect. 5.5).

Figure 4 plots the energy of (8) against training time. Figure 4a shows that Algorithm 1 converges faster than Algorithm 2 when the number of triplets (\(K\)) is small. This is reasonable since Algorithm 1 invokes less eigenvalue decomposition which dominates computation time in this case than Algorithm 2. Moreover, Algorithm 1 has more stable energy decreasing procedure. The underlying reason is that Algorithm 2 finds a most violated constraint in each of its outer iterations. Figure 4b shows that Algorithm 2 is typically several orders of magnitude faster than Algorithm 1 when the number of triplets (\(K\)) is huge.

Fig. 4
figure 4

Efficiency comparison between Algorithm 1 and 2 on the Wiki Text-Image data set with different numbers of training triplets

4.5 Discussion

In analogy to previous work (Tsochantaridis et al. 2004; Joachims 2006), Algorithm 2 also uses a 1-slack energy function. However, there are two significant differences. First, Algorithm 2 is a reformulation of a semi-definite programming problem while Tsochantaridis et al. (2004)’s work is a general framework for structural learning and Joachims (2006)’s work is a reformulation of a linear SVM. Second, Algorithm 2 has very different property from that of Joachims (2006)’s work. Algorithm 2’s reformulation has advantage when \(K\) is huge compared with the original formulation while Joachims (2006)’s reformulation has advantage when feature vectors are highly sparse. From above discussion, our main technical contribution is to seamlessly integrate semi-definite programing with the cutting plane algorithm. Another technical contribution is a detailed analysis of time complexity of Algorithm 1 and 2 and their empirical comparison.

5 Experiments

The performance of our proposed RPSA framework was evaluated by applying it to feature fusion, cross-pose face recognition, text-image retrieval and attribute-image retrieval.

5.1 Training Triplets and Pairs

Training triplets \((i, j, k)\in \mathcal {T}\) can be generated in an unsupervised or supervised fashion. Relatively-paired data can be collected from clickthrough data of search engines or priori knowledge about relative-pairing. This kind of data is naturally gotten. It can also be generated from category labels based on the principle that observations with the same label are expected to be more-likely-paired than those with different labels. Let \(l_i\) denote the label of an observation \(\mathbf{x}_i\). Given a pair of cross-modality observations \((\mathbf{x}_p, \mathbf{x}_q)\) (where \(t_p \ne t_q\)) for an object, we define four types of triplets to describe the relative-pairing knowledge (see Table 1). Each triplet \((i,j,k)\) suggests that \(\mathbf{x}_i\) is more-likely-paired with \(\mathbf{x}_j\) than with \(\mathbf{x}_k\). Euclidean distance between two observations is used in defining nearest neighbor in Table 1. Figure 5 gives a graphical illustration for these four types of triplets. If the numbers of these four types of triplets are \(n_1\), \(n_2\), \(n_3\) and \(n_4\), respectively, for each given pair \((\mathbf{x}_p, \mathbf{x}_q)\), we say that the training triplets have a structure of \((n_1,n_2,n_3,n_4)\). The total number of triplets is therefore \((n_1+n_2+n_3+n_4)\times N_p\), where \(N_p\) is the number of pairs.

Fig. 5
figure 5

Four types of triplets defined for describing relative-pairing information of a given pair of observations (\(\mathbf{x}_p\), \(\mathbf{x}_q\)). means \(\mathbf{x}_p\) and \(\mathbf{x}_q\) are paired observations from different modalities. Grids on the same horizontal line contain cross-modality observations with the same label

Table 1 Four types of triplets defined for describing relative-pairing information of a given pair of observations (\(\mathbf{x}_p\), \(\mathbf{x}_q\))

Similarly, we generate training pair set \(\mathcal {P}\) from labels. Given a pair of cross-modality observations \((\mathbf{x}_p, \mathbf{x}_q)\) (where \(t_p \ne t_q\)) for an object, three types of pairs are defined to describe target neighborhood (see Table 2). Each pair \((\overline{i}, \overline{j})\) suggests that \(\mathbf{x}_{\overline{i}}\) is a target neighbor of \(\mathbf{x}_{\overline{j}}\) and vise versa. Again, if the number of these three types of pairs are \(\overline{n}_1\), \(\overline{n}_2\) and \(\overline{n}_3\), respectively, for each given pair \((\mathbf{x}_p, \mathbf{x}_q)\), we say that the training pairs have a structure of \((\overline{n}_1, \overline{n}_2, \overline{n}_3)\). Therefore, we have \((\overline{n}_1+ \overline{n}_2 + \overline{n}_3)\times N_p\) training pairs in total. Note that \(\overline{n}_1\) has only two choices \(0\) or \(1\).

5.2 Parameter Settings

There are two weights, namely \(\gamma _1\) and \(\gamma _2\) in our energy function (8). In our experiments, we found that \(\gamma _1\) is not sensitive to other settings, such as the number of training triplets. The underlying reason is that the hinge loss term only penalize violated constraints in (7) no matter how many training triplets we have. We therefore fixed it to 1 in all our experiments. Because \(\gamma _2\) is the weight of the sum of distances between points in neighborhood, it is affected by the scale of feature vectors. Hence, we individually tuned \(\gamma _2\) for different data sets using validation data. Detailed analyses can be found in each corresponding sections.

Theoretically, the more training triplets we use, the more constraint information and better performance we get. This has been confirmed in our experimental results (see Fig. 6). Therefore, we used as many as possible triplets in our experiments. Because the numbers of training data of each category in different data sets are different, we have different training triplets in different tasks. Detailed triplet structures for different tasks can be found in their corresponding sections.

Fig. 6
figure 6

Performance of RPSA on validation data with different numbers of training triplets. RPSA is used to fuse the feature pair (Zer, Mor). \(n_1\), \(n_2\), \(n_3\) and \(n_4\) are set to \(n\) while other parameters are tuned to maximize the performance for each specific \(n\)

The parameters regarding training pairs \(\mathcal {P}\) are insensitive to other settings. If one expects the distance between the projections of \(\mathbf{x}_p\) and \(\mathbf{x}_q\) to be short, then \(\overline{n}_1\) should be set to \(1\), otherwise \(0\). The second and third kind of training pairs encourage small distances between projections of observations with the same label in each modality. We found that small target neighborhood (i.e., \(\overline{n}_2\) and \(\overline{n}_3\) is set to a small number, e.g., 5) works well in our experiments. For feature fusion, diversity of projections of exactly-paired observations from different modalities and small distances between projections of observations with the same label in target neighborhood in each modality are desirable. Therefore, we set \(\overline{n}_1=0\) and \(\overline{n}_2=\overline{n}_3=5\). For cross-modality pattern recognition tasks, namely, cross-pose face recognition, text-image retrieval and attribute-image retrieval, similar projections of exactly-paired observations are desirable, and thus we set \(\overline{n}_1=1\) and \(\overline{n}_2=\overline{n}_3=0\).

To summarize, only \(\gamma _2\) should be tuned for each data set while other parameters are fixed in advance.

Table 2 Three types of pairs defined for describing target neighborhood of a given pair of observations (\(\mathbf{x}_p\), \(\mathbf{x}_q\))

5.3 Feature Fusion

For classifying patterns with different kinds of features stemming from different sources, a critical issue is to efficiently utilize these cross-modality features. A common solution is feature fusion by first projecting cross-modality features into a latent common space to reduce dimension and suppress noise, and then adding the paired projections together as a final feature vector. The fused feature for two modalities (Sun et al. 2008; Zhang and Zhang 2011) is usually given by

$$\begin{aligned} \mathbf{y} = \mathbf{W}_{ \varOmega _{t_i}}\mathbf{x}_i + \mathbf{W}_{\varOmega _{t_j}}\mathbf{x}_j, \end{aligned}$$
(22)

where \(\mathbf{x}_i\) and \(\mathbf{x}_j\) are two feature vectors for different modalities of an object (i.e., \(t_i\ne t_j\)). The proposed method was used to fuse features of UCI Multiple Features data set.Footnote 2 This data set consists of 2,000 instances of ten hand-written numerals (‘0’–‘9’). Each instance has six features, namely Fou, Fac, Kar, Pix, Zer and Mor, with dimensions 76, 216, 64, 240, 47, and 6 respectively. We considered each feature as one modality. In our experiment, any two kinds of features were selected to fuse, and we had \(C_2^6=15\) combination pairs. In the training phase, for each feature pair, the number of training data for each digit (\(N_t\)) was set to 100. The latent common space had a dimension of 25, except for feature pairs involving Mor where it had a dimension of 6. In the testing phase, we find the nearest training fused feature with label for each testing fused feature. The experiment was repeated 10 times by randomly selecting fixed number of training data (i.e., \(N_t\times 10\), here \(10\) is the number of digit categories). We evaluated our method by mean recognition rates.

To determine \(\gamma _2\), we used \(\frac{1}{5}\) of training data as validation data and the rest as “training data”. The structure of training triplets are fixed to be (79,79,79,79) (given a digit \(\mathbf{x}_p\), the number of digits with the same label as \(\mathbf{x}_p\) is 79 in the “training data”). The pair of modalities (Zer, Mor) are fused with different \(\gamma _2\). The recognition rates are shown in Fig. 7. It has been observed that the proposed method gets the best result with \(\gamma _2=10\). Therefore, we fixed \(\gamma _2\) to 10 in this experiment. After parameter tuning, RPSA was trained with all training data with fixed parameters.

Fig. 7
figure 7

Performance of RPSA on validation data when fusing the modality pair (Zer Mor) with different \(\gamma _2\)

The proposed method was compared with Canonical Correlation Analysis (CCA) (Hardoon et al. 2004), Discriminative Canonical Correlation Analysis (DCCA) (Sun et al. 2008), Partial Least Squares (PLS) (Prince et al. 2008; Rosipal and Krämer 2006), bagging CCA (bgCCA), bagging DCCA (bgDCCA), boosting CCA (bsCCA), boosting DCCA (bsDCCA) (Zhang and Zhang 2011) and Random Correlation Ensemble (RCE) (Zhang and Zhang 2011). For fair comparison, all the methods employ nearest neighbor method as the classifier. The results of competitors are from Table 2 in Zhang and Zhang (2011). From Table 3, we see that RPSA is clearly superior to CCA, DCCA, bgCCA, bgDCCA, bsCCA, bsDCCA and PLS. RPSA achieves better accuracy than RCE for 12 pairs, and identical accuracy for the remaining 3 pairs. Note that RCE is a sophisticated method which first finds random cross-view correlations between within-class examples and then boosts performance by ensemble learning.

Table 3 Recognition rates on multiple features data set

Our proposed RPSA is a general framework of multi-modality analysis. It is natural to extend RPSA to fuse features from three modalities. We conducted feature fusion over all possible three features and compared with the result with that of Multi-View CCA (Rupnik and Shawe-Taylor 2010; Gong et al. 2014). Therefore, we have \(C_3^6=20\) configurations. Due to space limitation, we only reported the average of the mean recognition rates over 20 configurations. The average of the mean recognition rates of RPSA over three features is \(0.97\) while that of Multi-View CCA is \(0.93\). The average of the mean recognition rates of RPSA over two features is \(0.95\), which suggests that fusing more features can produce better recognition results.

5.4 Cross-Pose Face Recognition

Faces observed under a particular pose can be considered as being sampled from one modality, and therefore faces observed under different poses correspond to different modalities. RPSA can be used to recognize faces under different poses, in which gallery faces are in one pose while probe faces are in another pose. Note that our method requires knowing the rough pose of each photo (i.e., to which modality it belongs) as in the work of Sharma and Jacobs (2011). CMU PIE face databaseFootnote 3 was used in our experiments. This data set consists of 68 subjects, each of which has face photos in 13 different poses (indexed by c27/05/29/37/11/07/09/02/14/22/34 /25/31). Photos in the same pose were aligned by the eyes and mouth. All photos were cropped and down-sampled to \(48\times 40\). Each photo was then reshaped into a column vector giving an observation \(\mathbf{x}_i\). In our experiments, subject 1 to 34 were used as training data, while the rest were used as testing data. In the training phase, we learned one projection matrix for each modality. The learned latent common space had a dimension of 25. In the testing phase, the nearest gallery face of each probe face was found in the learned latent common space, and the recognition rates were reported.

To tune \(\gamma _2\), we randomly selected \(\frac{1}{5}\) of training data as validation data and the rest as “training data”. The structure of training triplets are fixed to be (26,26,0,0) (given a face \(\mathbf{x}_p\), the number of faces with different labels is 26 while the number of faces with same label as \(\mathbf{x}_p\) is 0). RPSA is used to do cross-pose face recognition with c22 as gallery and c07 as query. Figure 8 plots the recognition rates with different \(\gamma _2\). It has been shown that RPSA is not sensitive to \(\gamma _2\) as long as \(\gamma _2>0.01\). Therefore, we fixed \(\gamma _2=10\). and retrained our model over all training data.

Fig. 8
figure 8

Performance of RPSA over the validation set with different \(\gamma _2\)

In Table 4, we compare our method with those using frontal faces (photos indexed by c27 in CMU PIE data set) as gallery, in terms of mean recognition rates over different subsets of probe poses. The subsets of probe poses are set to be the same as those in Sharma and Jacobs (2011). The results of competitors are from Table 3 in Sharma and Jacobs (2011). It can be seen that RPSA outperforms all competitors. Note that TFA requires 14 user-elaborately-clicked points for photo alignment and Gabor filter for extracting complex features, whereas our method only needs 3 points (eyes and mouth) for photo alignment and directly employs the face image as a feature vector.

We also compare our method with PLS (Sharma and Jacobs 2011) and Multi-view Discriminant Analysis (MvDA) (Kan et al. 2012) which, to the best of our knowledge, report the best performance in the recent literature. Two arbitrary poses were selected as a gallery-probe pair, and we therefore had \(P^{13}_2=156\) configurations. The results are shown in Table 5. The result of PLS is from Table 1 in Sharma and Jacobs (2011) and that of MvDA is collected by running its publicly available code.Footnote 4 It can be seen that the proposed RPSA is much better than PLS and MvDA. RPSA achieves the best average recognition rates for all different galleries. It gets the best results in 140 configurations and the second best results in 16 configurations. RPSA is slightly worse than MvDA for the configurations (c22,c07) and (c07, c22) (c22 is a side view while c07 is a frontal view). This might be due to big pose difference between c22 and c07.

The overall accuracy of the proposed RPSA for all gallery-probe pairs is 0.957 while those of PLS and MvDA are 0.901 and 0.922 respectively. Our method improves the state-of-the-art result by \(3.8~\%\).

In this experiment, observations from different modalities have the same number of dimension. Therefore, metric learning methods designed for one modality can be used to do cross-pose face recognition without considering modality difference. We evaluated the performance of the state-of-the-art metric learning method Information Theoretic Metric Learning (ITML) on this task. The average accuracies are \(0.409\), \(0.390\), \(0.547\), \(0.532\), \(0.542\), \(0.471\), \(0.583\), \(0.446\), \(0.449\), \(0.569\), \(0.463\), \(0.529\), and \(0.392\) when c34, c31, c14, c11, c29, c09, c27, c07, c05, c37, c25, c02 and c22 are gallery respectively. Its overall accuracy is only \(0.486\). It has been observed that multi-modality analysis methods greatly outperform metric learning methods designed for one modality on cross-modality pattern recognition problems.

Table 4 Mean recognition rates for frontal faces (c27) gallery
Table 5 Recognition rates for different gallery-probe pose pairs on CMU PIE

5.5 Text-Image Retrieval

Text and image are two different modalities. Using text query to retrieve images or image query to retrieve texts are cross-modality problems, which requires common representations. The proposed RPSA was validated by text-image retrieval on Wiki Text-Image data set (Rasiwasia et al. 2010). The data set consists of 2,173 training and 693 testing image-text pairs with 10 different categories. The images are represented by 128-dimensional SIFT feature vectors while texts are encoded by 10-dimensional latent Dirichlet allocation model-based feature vectors (Blei et al. 2003). In the training phase, we learned one projection matrix for each modality. In the testing phase, queries and probes were projected into the learned latent space with the dimension of 10, and then text-image retrieval was conducted by finding the nearest neighbors of the projections of queries. It is considered to be correct if the retrieved image (or text) has the same label as the query text (or image). As in Sharma and Kumar (2012), the precisions of retrieval are evaluated at 11 different recall levels \(\{0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,\) \(0.9,1\}\); the Mean Average Precision (mAP) given by \(\frac{1}{11}\sum _{i=1}^{11} precision_i\) is finally reported.

To tune \(\gamma _2\), \(\frac{1}{8}\) of training data were selected as validation data and the rest as “ training set”. The structure of training triplets were fixed to be (119,119,119,119) (given an image \(\mathbf{x}_p\), the number of images with the same label as \(\mathbf{x}_p\) is 119). We evaluated RPSA with text as query on validation data. We found that the performance was best when \(\gamma _2=100\). Therefore, we fixed \(\gamma _2=100\). RPSA was retrained over all training data with fixed parameters.

In our experiments, we found that overweighting the first type of training triplets (i.e., when \(t_p\) is image modality) will greatly boost the performance of RPSA with image as query while the second type training triplets (i.e., when \(t_p\) is text modality) with text as query. This might be because the model we learned is too simple (the dimension of the latent common space is too low) to satisfy all training triplet constraints at the same time. The results reported in this Section were obtained by overweighting the first and the second type of training triplets by 20 times when image and text are used as queries respectively.

The performance of RPSA is compared with those of the state-of-the-art multi-modality analysis techniques: CCA, PLS, BLM, Semantic Matching (SM) (Rasiwasia et al. 2010), Semantic Correlation Matching (SCM) (Rasiwasia et al. 2010), Generalized Multi-view Marginal Fisher Analysis (GMMFA) (Sharma and Kumar 2012) and Generalized Multi-view LDA (GMLDA) (Sharma and Kumar 2012) in Table 6. The results of competitors are from Table 3 in Sharma and Kumar (2012). It has been shown that the proposed method achieves the best performance in terms of mAP with image query, that with text query, and the average mAP. Specifically, RPSA improves the state of the art result by \(4.7~ \%\). Figure 9 shows recall precision curves of RPSA compared with others. It has been shown that RPSA performs better than CCA and PLS with an obvious margin. Figure 10 shows mAP of each category with text query obtained by CCA, PLS and RPSA. It has been shown that RPSA consistently performs better than its competitors for each category except sport.

Fig. 9
figure 9

Comparison between recall precision curves of CCA, PLS, and RPSA. a shows recall precision curves with image query. b shows recall precision curves with text query

Fig. 10
figure 10

Comparisons between mAP of each category obtained by CCA, PLS and RPSA

Table 6 mAP on Wiki Text-Image data set

5.6 Attribute-Image Retrieval

In the previous experiments, all the training triplets are generated with category labels as shown in Table 1. In this Section, we would like to evaluate PRSA using natural relative-pairing information. The data set we used is Public Figures Face Database (Kumar et al. 2009). In Parikh and Grauman (2011), a subset consisting of 241 training faces and 531 test faces are selected to study relative attributes. These faces belong to 8 persons. They have 11 attributes, namely, “Male”, “White”, “Yong”, “Smiling”, “Chubby”, “Visible Forehead”, “Bush Eyebrows”, “Narrow Eyes”, “Pointy Nose”, “Big Lips” and “Round Face”. Each face is encoded by a 542-d feature vector based on Gist and color histogram extracted from its image. It is also encoded by 11-d binary attribute vector. e.g., the attribute vector (1,1,0,0,0,0,0,0,0,0,0) indicates a white male face. In this experiment, face image feature is considered as image modality while attribute code as attribute modality. We learn a 11-d latent common space between image modality and attribute modality, and then retrieve image (attribute) with attribute (image) as query. The evaluation measure is the same as that in Image-Text retrieval in Sect. 5.5.

Different from text-image retrieval experiment, we used natural training triplets instead of those generated with category labels. In Parikh and Grauman (2011), each image is assigned a relative strength for each attribute. If face image \(a\) has higher strength for the \(k\) th attribute than image \(b\), then image \(a\) is more-likely paired with the attribute code \(c\) than image \(b\), where \(c\) being a 11-d zero vector except the \(k\)th element being \(1\). Figure 11 shows one example. We used 73,470 training triplets by enumerating all attribute comparison given in the data set.

Fig. 11
figure 11

Illustration of training triplets generated with relative attribute.

figure d
indicates being more likely paired. Face \(a\) is more smiling than face \(b\). Therefore, \(a\) is more likely paired with the smiling (the fourth) attribute code \(c\) than \(b\) with \(c\)

We set \(\overline{n}_1=1\) and \(\overline{n}_2=\overline{n}_3=0\) as attribute-image retrieval is a cross-modality retrieval problem. To tune \(\gamma _2\), \(\frac{1}{8}\) of training data were selected as validation data and the rest as “ training set”. We found that RPSA performed best with \(\gamma _2\) being \(100\). Therefore, \(\gamma _2\) was fixed to \(100\). In order to evaluate RPSA trained only on natural relative-pairing information, we also reported the performance of RPSA without training pairs (named by RPSA\(_n\)). i.e., \(\mathcal {P}=\emptyset \).

RPSA is compared with CCA and PLS. The results are listed in Table 7. It has been shown that RPSA outperforms its competitors with a large margin. RPSA improves the results of CCA and PLS by 129.7 and \(91.3~\%\) in terms of the average mAP. RPSA\(_n\) is inferior to RPSA. The reason is that training pairs used by RPSA can encourage the projections of exactly-paired observations from different modalities to be identical, which is important in cross-modality pattern recognition or retrieval as discussed in Sect. 5.2. Nevertheless, RPSA\(_n\) is still superior to CCA and PLS. Note that both RPSA and RPSA\(_n\) are trained without category labels.

Table 7 mAP on public figures face database

6 Conclusion and Future work

In this paper, we have proposed a framework called Relatively-Paired Space Analysis (RPSA) which can automatically learn a latent common space between multiple modalities from relatively-paired observations. Relative-pairing can explore more general semantic relationships between observations than absolute-pairing, and allows easy integration of label information. Theoretically, RPSA is a discriminative model which does not assume any parameter or noise distribution, and is a general framework which can be used in any cross-modality pattern recognition. We have evaluated the performance of RPSA by applying it to feature fusion, cross-pose face recognition, text-image retrieval and attribute-image retrieval. Experimental results show that RPSA outperforms other state-of-the-art techniques, some of which are tailored for the particular problems. In future work, we would like to extend RPSA to a nonlinear version.