Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Feeding our insatiable appetite for content, multiple data sources surround us. Data from a single source is often not rich enough and users look for information across multiple sources and modalities. The research community has focused on data mining and analysis from single data source, but limited work addresses issues arising from the joint analysis of multiple data sources. This has created open opportunities to develop formal frameworks for analyzing multiple data sources, exploiting common properties to strengthen data analysis and mining. Discovering patterns from multiple data sources often provides information such as commonalities and differences, otherwise not possible with isolated analysis. This information is valuable for various data mining and representation tasks.

As an example, consider social media. Entirely new genres of media have been created around the idea of participation, including wikis (e.g. Wikipedia), social networks (e.g. Facebook), media communities (e.g. YouTube), news aggregators (e.g. Digg), blogs and micro-blogs (e.g. Blogspot, Twitter). These applications are significant because they are often ranked highest by traffic volume and attention. Modeling collective data across semantically similar yet disparate sources is critical for social media mining and retrieval tasks. Open questions are: how can we effectively analyze such disparate data sources together exploiting their mutual strengths for improving data mining tasks? Can we establish the correspondence or similarity of items in one data source with items in other data sources?

This chapter attempts to address these questions and develops a framework to model multiple data sources jointly by exploiting their mutual strengths while retaining their individual knowledge. To analyze multiple disparate data sources jointly, a unified piece of meta data—textual tags—are used. Although we use textual tags in this work, any other feature unifying the disparate data sources can be used. Textual tags are rich in semantics [5, 15] as they are meant to provide higher level description to the data, and are freely available for disparate data types e.g. images, videos, blogs, news etc. However, these tags cannot be used directly to build useful applications due to their noisy characteristics. The lack of constraints during their creation are part of their appeal, but consequently they become ambiguous, incomplete and subjective [5, 15], leading to poor performance in data mining tasks. Work on tagging systems has been mainly aimed at refining tags by determining their relevance and recommending additional tags [15, 18] to reduce the ambiguity. But the performance of these techniques is bounded by the information content and noise characteristics of the tagging source in question, which can vary wildly depending on many factors, including the design of the tagging system and the uses to which it is being put by its users. To reduce tag ambiguity, the use of auxiliary data sources along with the domain of interest is suggested in Ref. [10]. The intuition behind the joint analysis of multiple data sources is that the combined tagging knowledge tend to reduce the subjectivity of tags [7] as multiple related sources often provide complementary knowledge and strengthen one another.

Departing from single data source based methods, we formulate a novel framework to leverage tags as the unifying metadata across multiple disparate data sources. The key idea is to model the data subspaces that allows flexibility in representing the commonalities whilst retaining their individual differences. Retaining the individual differences of each data source is crucial when dealing with heterogeneous multiple data sources as ignoring this aspect may lead to negative knowledge transfer [6]. Our proposed framework is based on nonnegative matrix factorization (NMF) [11] and provides shared and individual basis vectors wherein tags of each media object can be represented by linear combination of shared and individual topics. We extend NMF to enable joint modeling of multiple data sources deriving common and individual subspaces.

Fig. 1
figure 1

Some possible sharing configurations for \(n=3\) datasets. In this chapter, we consider a Chain sharing, b Pairwise sharing and c General sharing

Pairwise analysis using two data sources has been considered in our previous works [7, 8]. However, these works are limited and unusable for many real-world applications where one needs to include several auxiliary sources to achieve more meaningful improvement in performance as shown in this chapter. Furthermore, extension to multiple data sources requires efficient learning of arbitrarily shared subspaces which is nontrivial and fundamentally different from Refs. [7, 8]. For example, consider three sources \(D_{1}\), \(D_{2}\) and \(D_{3}\); jointly modeling (\(D_{1},D_{2},D_{3}\)) is different from pairwise modeling (\(D_{1},D_{2}\)) or (\(D_{2},D_{3}\)). Figure 1 depicts an example of the possible sharing configurations (refer Sect. 3) for three data sources. We note that the frameworks considered in Refs. [7, 8] can not handle the sharing configuration of Fig. 1c. To demonstrate the effectiveness of our approach, we apply the proposed model on two real world tasks—improving social media retrieval using auxiliary sources and cross-social media retrieval—using three disparate data sources (Flickr, YouTube and Blogspot). Our main contributions are :

  • A joint matrix factorization framework along with an efficient algorithm for extraction of shared and individual subspaces across an arbitrary number of data sources. We provide complexity analysis of the learning algorithm and show that its convergence is guaranteed via a proof of convergence (in Sect. 3 and Appendix).

  • We further develop algorithms for social media retrieval in a multi-task learning setting and cross-social media retrieval (in Sect.  4).

  • Two real world demonstrations of the proposed framework using three representative social media sources–blogs (Blogspot.com), photos (Flickr) and videos (YouTube) (in Sect. 5).

By permitting differential amounts of sharing in the subspaces, our framework can transfer knowledge across multiple data sources and thus, can be applied to a much wider context—it is appropriate wherever one needs to exploit the knowledge across multiple related data sources avoiding negative knowledge transfer. Speaking in social media context, it provides efficient means to mine multimedia data, and partly transcend the semantic gap by exploiting the diversity of rich tag metadata from many media domains.

2 Related Background

Previous works on shared subspace learning are mainly focused on supervised or semi-supervised learning. Ando and Zhang [1] propose structure learning to discover predictive structures shared by the multiple classification problems to improve performance on the target data source in transfer learning settings. Yan et al. [20] propose a multi-label learning algorithm called model-shared subspace boosting to reduce information redundancy in learning by combining a number of base models across multiple labels. Ji et al. [9] learn a common subspace shared among multiple labels to extract shared structures for multi-label classification task. In a transfer learning work [6], Gu and Zhou propose a framework for multi-task clustering by learning a common subspace among all tasks and use it for transductive transfer classification. A limitation of their framework is that it learns a single shared subspace for each task which often violates data faithfulness in many real world scenarios. Si et al. [17] propose a family of transfer subspace learning algorithms based on a regularization which minimizes Bregman divergence between the distributions of the training and test samples. Though, this approach, fairly generic for domain adaptation setting, is not directly applicable for multi-task learning and does not model multiple data sources. In contrast to the above works, our proposed framework not only provides varying levels of sharing but is flexible to support arbitrary sharing configurations for any combination of multiple data sources (tasks).

Our proposed shared subspace learning method is formulated under the framework of NMF. NMF is chosen to model the tags (as tags are basically textual keywords) due to its success in text mining applications [3, 16, 19]. An important characteristic of NMF is that it yields parts based representation of the data.

Previous approaches taken for cross-media retrieval [21, 22] use the concept of a Multimedia Document (MMD), which is a set of co-occurring multimedia objects that are of different modalities carrying the same semantics. The two multimedia objects can be regarded as context for each other if they are in the same MMD, and thus the combination of content and context is used to overcome the semantic gap. However, this line of research depends on co-occurring multimedia objects, which may not be available.

3 Multiple Shared Subspace Learning

3.1 Problem Formulation

In this section, we describe a framework for learning individual as well as arbitrarily shared subspaces of multiple data sources. Let \(\mathbf {X}_{1},\ldots ,\mathbf {X}_{n}\) represent the feature matrices constructed from a set of \(n\) data sources where \(\mathbf {X}_{1},\ldots ,\mathbf {X}_{n}\) can be, for example, user-item rating matrices (where each row corresponds to a user, each column corresponds to an item and the features are user ratings) in case of collaborative filtering application or term-document matrices for tag based social media retrieval application (where each row corresponds to a tag, each column corresponds to an item and features are usual tf-idf weights [2]) and so on. Given \(\mathbf {X}_{1},\ldots ,\mathbf {X}_{n}\), we decompose each data matrix \(\mathbf {X}_{i}\) as a product of two matrices \(\mathbf {X}_{i}=\mathbf {W}_{i}\cdot \mathbf {H}_{i}\) such that the subspace spanned by the columns of matrix \(\mathbf {W}_{i}\) explicitly represents arbitrary sharing among \(n\) data sources through shared subspaces and individual data by preserving their individual subspaces. For example, when \(n=2\), we create three subspaces: a shared subspace spanned by matrix \(W_{12}\) and two individual subspaces spanned by matrices \(W_{1},W_{2}\). Formally,

$$\begin{aligned} \mathbf {X}_{1}&=\underbrace{\left[ W_{12}\mid W_{1}\right] }_{\mathbf {W}_{1}}\underbrace{\left[ \begin{array}{c} H_{1,12}\\ \\ H_{1,1}\end{array}\right] }_{\mathbf {H}_{1}}=W_{12}\cdot H_{1,12}+W_{1}\cdot H_{1,1}\end{aligned}$$
(1)
$$\begin{aligned} \mathbf {X}_{2}&=\underbrace{\left[ W_{12}\mid W_{2}\right] }_{\mathbf {W}_{2}}\underbrace{\left[ \begin{array}{c} H_{2,12}\\ \\ H_{2,2}\end{array}\right] }_{\mathbf {H}_{2}}=W_{12}\cdot H_{2,12}+W_{2}\cdot H_{2,2} \end{aligned}$$
(2)

Notationwise, we use bold capital letters \(\mathbf {W}\), \(\mathbf {H}\) to denote the decomposition at the data source level and normal capital letters \(W\), \(H\) to denote the subspaces partly. In the above expressions, the shared basis vectors are contained in \(W_{12}\) while individual basis vectors are captured in \(W_{1}\) and \(W_{2}\) respectively, hence giving rise to the full subspace representation \(\mathbf {W}_{1}=\left[ W_{12}\mid W_{1}\right] \) and \(\mathbf {W}_{2}=\left[ W_{12}\mid W_{2}\right] \) for the two data sources. However, note that the encoding coefficients of each data source in the shared subspace corresponding to \(W_{12}\) are different, and thus, an extra subscript is used to make it explicit as \(H_{1,12}\) and \(H_{2,12}\).

To generalize these expressions for arbitrary \(n\) datasets, we continue with this example (\(n=2\)) and consider the power set over \(\left\{ 1,2\right\} \) given as

$$\begin{aligned} S\left( 2\right)&=\left\{ \emptyset ,\left\{ 1\right\} ,\left\{ 2\right\} ,\left\{ 1,2\right\} \right\} \end{aligned}$$

We can use the power set \(S\left( 2\right) \) to create an index set for the subscripts ‘\(1\)’, ‘\(2\)’ and ‘\(12\)’ used in matrices of Eqs. (1) and (2). This helps in writing the factorization conveniently using a summation. We further use \(S\left( 2,i\right) \) to denote the subset of \(S\left( 2\right) \) in which only elements involving \(i\) are retained, i.e.

$$\begin{aligned} S\left( 2,1\right)&=\left\{ \{1\},\{1,2\}\right\} \text { and }S\left( 2,2\right) =\left\{ \{2\},\{1,2\}\right\} \end{aligned}$$

With a little sacrifice of perfection over the set notation, we rewrite them as \(S\left( 2,1\right) =\left\{ 1,12\right\} \) and \(S\left( 2,2\right) =\left\{ 2,12\right\} \). Now, using these sets, Eqs. (1) and (2) can be re-written as

$$\begin{aligned} \mathbf {X}_{1}=\sum _{\nu \in \left\{ 1,12\right\} }W_{\nu }\cdot H_{1,\nu }&\text { and }\mathbf {X}_{2}=\sum _{\nu \in \left\{ 2,12\right\} }W_{\nu }\cdot H_{2,\nu }\end{aligned}$$

For an arbitrary set of \(n\) datasets, let \(S\left( n\right) \) denote the power set of \(\left\{ 1,2,\ldots , n\right\} \) and for each \(i=1,\ldots ,n\), let the index set associated with the \(i\)-th data source be defined as \(S\left( n,i\right) =\left\{ \nu \in S\left( n\right) \mid i\in \nu \right\} \). Our proposed joint matrix factorization for \(n\) data sources can then be written as

$$\begin{aligned} \mathbf {X}_{i}&=\mathbf {W}_{i}\cdot \mathbf {H}_{i}=\sum _{\nu \in S\left( n,i\right) }W_{\nu }\cdot H_{i,\nu }\end{aligned}$$
(3)

Our above expression is in its most generic form considering all possible sharing opportunities that can be formulated. In fact, the total number of subspaces equates to \(2^{n}-1\) which is the cardinality of the power set \(S\left( n\right) \) minus the empty set \(\emptyset \). We consider this generic form in this paper. However, our framework is directly applicable where we can customize the index set \(S\left( n,i\right) \) to tailor any combination of sharing one wish to model. Figure 1 illustrates some of the possible scenarios when there are three data sources (\(n=3\))

If we explicitly list the elements of \(S\left( n,i\right) \) as \(S\left( n,i\right) =\left\{ \nu _{1},\nu _{2},\ldots ,\nu _{Z}\right\} \) then \(\mathbf {W}_{i}\) and \(\mathbf {H}_{i}\) are

$$\begin{aligned} \mathbf {W}_{i}=\left[ W_{\nu _{1}}\mid W_{\nu _{2}}\mid \ldots \mid W_{\nu _{Z}}\right] \text {,}\quad \mathbf {H}_{i}=\left[ \begin{array}{c} H_{i,\nu _{1}}\\ \vdots \\ H_{i,\nu _{Z}}\end{array}\right] \end{aligned}$$
(4)

3.2 Learning and Optimization

Our goal is to achieve sparse part-based representation of the subspaces and therefore, we impose nonnegative constraints on \(\left\{ \mathbf {W}_{i},\mathbf {H}_{i}\right\} _{i=1}^{n}\). We formulate an optimization problem to minimize the Frobenius norm of joint decomposition error. The objective function accumulating normalized decomposition error across all data matrices is given as

$$\begin{aligned} J\left( \mathbf {W},\mathbf {H}\right)&=\frac{1}{2}\left\{ \sum _{i=1}^{n}\lambda _{i}\left\| \mathbf {X}_{i}-\mathbf {W}_{i}\cdot \mathbf {H}_{i}\right\| _{F}^{2}\right\} \nonumber \\&=\frac{1}{2}\left\{ \sum _{i=1}^{n}\lambda _{i}\left\| \mathbf {X}_{i}-\sum _{\nu \in S\left( n,i\right) }W_{\nu }\cdot H_{i,\nu }\right\| _{F}^{2}\right\} \end{aligned}$$
(5)

where \(\left\| .\right\| _{F}\) is the Frobenius norm and \(\lambda _{i}\triangleq \left\| \mathbf {X}_{i}\right\| _{F}^{-2}\) is the normalizing factor for data \(\mathbf {X}_{i}\). Thus, the final optimization is given as

$$\begin{aligned} \text {minimize }J\,\left( \mathbf {W},\mathbf {H}\right) \,\text {subject to }W_{\nu },H_{i,\nu }\ge 0\text { for all }1\le i\le n\text { and }\nu \in S\left( n,i\right) \end{aligned}$$
(6)

where \(J\left( \mathbf {W},\mathbf {H}\right) \) is defined as in Eq. (5). A few directions are available to solve this nonnegatively constrained optimization problem, such as gradient-descent based multiplicative updates [11] or projected gradient [12]. We found that optimization of \(J\left( \mathbf {W},\mathbf {H}\right) \) using multiplicative updates provides a good trade off between automatically selecting gradient-descent step size and fast convergence for both synthetic and real datasets, and therefore, will be used in this chapter. Expressing the objective function element-wise, we shall show that multiplicative update equations for \(W_{\nu }\) and \(H_{i,\nu }\) can be formulated efficiently as in the standard NMF [11]. Since the cost function of Eq. (5) is non-convex jointly for all \(W_{\nu }\) and \(H_{i,\nu }\), the multiplicative updates lead to a local minima solution. However, unlike NMF, this problem is less ill-posed due to the constraints of common matrices in the joint factorization. The gradient of the cost function in Eq. (5) w.r.t. \(W_{\nu }\) is given by

$$\begin{aligned} \nabla _{W_{\nu }}J\left( \mathbf {W},\mathbf {H}\right)&=\sum _{i\in \nu }\lambda _{i}\left[ -\mathbf {X}_{i}H_{i,\nu }^{\mathsf {T}}+\mathbf {X}_{i}^{\left( t\right) }H_{i,\nu }^{\mathsf {T}}\right] \end{aligned}$$

where \(\mathbf {X}_{i}^{\left( t\right) }\) is defined as

$$\begin{aligned} \mathbf {X}_{i}^{\left( t\right) }&=\sum _{\nu \in S\left( n,i\right) }W_{\nu }\cdot H_{i,\nu }\end{aligned}$$
(7)

Using Gradient-Descent optimization, we update matrix \(W_{\nu }\) as the following

$$\begin{aligned} \left( W_{\nu }\right) _{lk}^{t+1}\leftarrow \left( W_{\nu }\right) _{lk}^{t}+\eta _{\left( W_{\nu }\right) _{lk}^{t}}\left( -\nabla _{\left( W_{\nu }\right) _{lk}^{t}}J\left( \mathbf {W},\mathbf {H}\right) \right) \end{aligned}$$
(8)

where \(\eta _{\left( W_{\nu }\right) _{lk}^{t}}\) is the optimization step-size and given by

$$\begin{aligned} \eta _{\left( W_{\nu }\right) _{lk}^{t}}=\frac{\left( W_{\nu }\right) _{lk}^{t}}{\sum \limits _{i\in \nu }\lambda _{i}\left( \mathbf {X}_{i}^{\left( t\right) }H_{i,\nu }^{\mathsf {T}}\right) _{lk}^{t}} \end{aligned}$$
(9)

In Appendix, we prove that the updates in Eq. (8) when combined with step-size of Eq. (9), converge to provide a locally optimum solution of the optimization problem (6). Plugging the value of \(\eta _{\left( W_{\nu }\right) _{lk}^{t}}\) from Eq. (9) in Eq. (8), we obtain the following multiplicative update equation for \(W_{\nu }\)

$$\begin{aligned} \left( W_{\nu }\right) _{lk}\leftarrow \left( W_{\nu }\right) _{lk}\frac{\left( \sum \limits _{i\in \nu }\lambda _{i}\mathbf {X}_{i}\cdot H_{i,\nu }^{\mathsf {T}}\right) _{lk}}{\left( \sum \limits _{i\in \nu }\lambda _{i}\mathbf {X}_{i}^{(t)}\cdot H_{i,\nu }^{\mathsf {T}}\right) _{lk}} \end{aligned}$$
(10)

Multiplicative updates for \(H_{i,\nu }\) can be obtained similarly and given by

$$\begin{aligned} \left( H_{i,\nu }\right) _{km}\leftarrow \left( H_{i,\nu }\right) _{km}\frac{\left( W_{\nu }^{\mathsf {T}}\cdot \mathbf {X}_{i}\right) _{km}}{\left( W_{\nu }^{\mathsf {T}}\cdot \mathbf {X}_{i}^{(t)}\right) _{km}} \end{aligned}$$
(11)

As an example, for the case of \(n=2\) data sources mentioned earlier, the update equations for the shared subspace \(W_{12}\) (corresponding to \(\nu =\left\{ 1,2\right\} \)) reduce to

$$\begin{aligned} \left( W_{12}\right) _{lk}&\leftarrow \left( W_{12}\right) _{lk}\frac{\left( \lambda _{1}\mathbf {X}_{1}\cdot H_{1,12}^{\mathsf {T}}+\lambda _{2}\mathbf {X}_{2}\cdot H_{2,12}^{\mathsf {T}}\right) _{lk}}{\left( \lambda _{1}\mathbf {X}_{1}^{\left( t\right) }\cdot H_{1,12}^{\mathsf {T}}+\lambda _{2}\mathbf {X}_{2}^{\left( t\right) }\cdot H_{2,12}^{\mathsf {T}}\right) _{lk}} \end{aligned}$$
(12)

and the update equations for the individual subspaces \(W_{1}\) (when \(\nu =\left\{ 1\right\} \)) and \(W_{2}\) (when \(\nu =\left\{ 2\right\} \)) become:

$$\begin{aligned} \left( W_{1}\right) _{lk}&\leftarrow \left( W_{1}\right) _{lk}\frac{\left( \mathbf {X}_{1}\cdot H_{1,1}^{\mathsf {T}}\right) _{lk}}{\left( \mathbf {X}_{1}^{\left( t\right) }\cdot H_{1,1}^{\mathsf {T}}\right) _{lk}}\end{aligned}$$
(13)
$$\begin{aligned} \left( W_{2}\right) _{lk}&\leftarrow \left( W_{2}\right) _{lk}\frac{\left( \mathbf {X}_{2}\cdot H_{1,2}^{\mathsf {T}}\right) _{lk}}{\left( \mathbf {X}_{2}^{\left( t\right) }\cdot H_{1,2}^{\mathsf {T}}\right) _{lk}} \end{aligned}$$
(14)

We note the intuition carried in these update equations. First, it can be verified by inspection that at the ideal convergence point when \(\mathbf {X}_{i}=\mathbf {X}_{i}^{(t)}\), the multiplicative factors (second term on the RHS) in these equations become unity, thus no more updates are necessary. Secondly, updating a particular shared subspace \(W_{\nu }\) involves only relevant data sources for that share (sum over its index set \(i\in \nu \), cf. Eq. (10)). For example updating \(W_{12}\) in Eq. (12) involves both \(\mathbf {X}_{1}\) and \(\mathbf {X}_{2}\) but updating \(W_{1}\) in Eq. (13) involves only \(\mathbf {X}_{1}\); the next iteration takes into account the joint decomposition effect and regularize the parameter via Eq. (7). From this point onwards, we refer to our framework as Multiple Shared Nonnegative Matrix Factorization (MS-NMF).

3.3 Subspace Dimensionality and Complexity Analysis

Let \(M\) be the number of rows for each \(\mathbf {X}_{i}\) (although \(\mathbf {X}_{i}\)’s usually have different vocabularies but they can be merged together to construct a common vocabulary that has \(M\) words) and \(N_{i}\) be the number of columns. Then, the dimensions for \(\mathbf {W}_{i}\) and \(\mathbf {H}_{i}\) are \(M\times R_{i}\) and \(R_{i}\times N_{i}\) respectively using \(R_{i}\) as reduced dimension. Since each \(\mathbf {W}_{i}\) is an augmentation of individual and shared subspace matrices \(W_{\nu }\), we further use \(K_{\nu }\) to denote the number of columns in \(W_{\nu }\). Next, from Eq. (4), it implies that \(\sum _{\nu \in S\left( n,i\right) }K_{\nu }=R_{i}\). The value of \(K_{v}\) depends upon the sharing level among the involved data sources. A rule of thumb is to use \(K_{\nu }\approx \sqrt{M_{\nu }/2}\) according to Ref. [14] where \(M_{\nu }\) is equal to the number of features common in data configuration specified by \(\nu \). For example, if \(\nu =\{1,2\}\), \(M_{\nu }\) is equal to the number of common tags between source-\(1\) and source-\(2\).

Given above notation, the computational complexity for MS-NMF algorithm is \(\fancyscript{O}\left( M\times N_{\max }\times R_{\max }\right) \) per iteration where \(N_{\max }=\max _{i\in [1,n]}\left\{ N_{i}\right\} \) and \(R_{\max }=\max _{i\in [1,n]}\) \(\left\{ R_{i}\right\} \). The standard NMF algorithm [11] when applied on each matrix \(\mathbf {X}_{i}\) with parameter \(R_{i}\) will have a complexity of \(\fancyscript{O}\left( M\times N_{i}\times R_{i}\right) \) and total complexity of \(\fancyscript{O}\left( M\times N_{\max }\times R_{\max }\right) \) per iteration. Therefore, computational complexity of MS-NMF remains equal to that of standard NMF.

4 Applications

Focusing on the social media domain, we show the usefulness of MS-NMF framework through two applications:

  1. 1.

    Improving social media retrieval in one medium (target) with the help of other auxiliary social media sources.

  2. 2.

    Retrieving items across multiple social media sources.

Our key intuition in the first application is to use MS-NMF to improve retrieval by leveraging statistical strengths of tag co-occurrences through shared subspace learning while retaining the knowledge of the target medium. Intuitively, improvement is expected when auxiliary sources share underlying structures with the target medium. These auxiliary sources can be readily found from the Web. For cross-media retrieval, the shared subspace among multiple media provides a common representation for each medium and enables us to compute cross-media similarity between items of different media.

4.1 Improving Social Media Retrieval with Auxiliary Sources

Let the target medium for which retrieval is to be performed be \(\mathbf {X}_{k}\). Further, let us assume that we have other auxiliary media sources \(\mathbf {X}_{j}\), \(j\ne k\), which share some underlying structures with the target medium. We use these auxiliary sources to improve the retrieval precision from the target medium. Given a set of query keywords \(S_{Q}\), a vector \(q\) of length \(M\) (vocabulary size) is constructed by putting tf-idf values at each index where vocabulary contains a word from the keywords set or else putting zero. Next, we follow Algorithm 1 for retrieval using MS-NMF.

figure a

4.2 Cross-Social Media Retrieval and Correspondence

Social media users assign tags to their content (blog, images and videos) to retrieve them later and share them with other users. Often these user generated content are associated with real world events, e.g., travel, sports, wedding receptions, etc. In such a scenario, when users search for items from one medium, they are also interested in semantically similar items from other media to obtain more information. For example, one might be interested in retrieving ‘olympics’ related blogs, images and videos at the same time (cross-media retrieval) as together they service the user information need better.

A naïve method of cross-media retrieval is to match the query keywords with the tag lists of items of different media. Performance of this method is usually poor due to poor semantic indexing caused by noisy tags, polysemy and synonymy. Subspace methods such as LSI or NMF, although robust against these problems, do not support cross-media retrieval in their standard form. Interestingly, MS-NMF provides solutions to both the problems. First, being a subspace based method, it is less affected by the problems caused by noisy tags, ‘polysemy’ and ‘synonymy’ and second, it is appropriate for cross-media retrieval as it represents items from each medium in a common subspace enabling to define a similarity for cross-media retrieval.

To relate items from medium \(i\) and \(j\), we use the common subspace spanned by \(\mathbf {W}_{ij}\). As an example, \(\mathbf {W}_{12}=\left[ W_{12}\mid W_{123}\right] \), \(\mathbf {W}_{23}=\left[ W_{23}\mid W_{123}\right] \) and \(\mathbf {W}_{13}=\left[ W_{13}\mid W_{123}\right] \) for three data source case, illustrated in Fig. 1c. More generally, if \(S\left( n,i,j\right) \) is the set of all subsets in \(S\left( n\right) \) involving both \(i\) and \(j\), i.e. \(S\left( n,i,j\right) \triangleq \left\{ \nu \in S\left( n\right) \mid i,j\in \nu \right\} \), the common subspace between \(i\)th and \(j\)th medium \(\mathbf {W}_{ij}\) is then given by horizontally augmenting all \(W_{\nu }\) such that \(\nu \in S\left( n,i,j\right) \). Similarly, representation of \(\mathbf {X}_{i}\) (or \(\mathbf {X}_{j}\)) in this common subspace, i.e. \(\mathbf {H}_{i,ij}\) (or \(\mathbf {H}_{j,ij}\)), is given by vertically augmenting all \(H_{i,\nu }\) (or \(H_{j,\nu }\)) such that \(\nu \in S\left( n,i,j\right) \). For \(n=3\), \(\mathbf {H}_{1,12}^{\mathsf {T}}=\left[ H_{1,12}^{\mathsf {T}}|H_{1,123}^{\mathsf {T}}\right] \), \(\mathbf {H}_{2,12}^{\mathsf {T}}=\left[ H_{2,12}^{\mathsf {T}}|H_{2,123}^{\mathsf {T}}\right] \) and so on.

Given the set of query keywords \(S_{Q}\), we prepare the query vector \(q\) as described in Sect. 4.1. Given query vector \(q\), we wish not only to retrieve relevant items from \(i\)th domain, but also from \(j\)th domain. In the language of MS-NMF, this is performed by projecting \(q\) onto the common subspace matrix \(\mathbf {W}_{ij}\) to get its representation \(h\) in the common subspace. Next, we compute similarity between \(h\) and the columns of matrix \(\mathbf {H}_{i,ij}\) and \(\mathbf {H}_{j,ij}\) (the representation of media items in the common subspace) to find out similar items from medium \(i\) and \(j\) respectively and the results are ranked based on these similarity scores either individually or jointly (see Algorithm 2).

figure b

5 Experiments

5.1 Datasets

We conduct our experiments on a cross-social media dataset consisting of the textual tags of three disparate media genres: text, image and video. To create the dataset, three popular social media websites namely, Blogspot,Footnote 1 FlickrFootnote 2 and YouTube,Footnote 3 were used. To obtain the data, we first queried all three websites using common concepts—‘Academy Awards’, ‘Australian Open’, ‘Olympic Games’, ‘US Election’. To have pairwise sharing in the data, we additionally queried Blogspot and Flickr with concept ‘Christmas’, YouTube and Flickr with concept ‘Terror Attacks’ and Blogspot and YouTube with concept ‘Earthquake’. Lastly, to have some individual data of each medium, we queried Blogspot, Flickr and YouTube with concepts ‘Cricket World Cup’, ‘Holi’ and ‘Global Warming’ respectively. Total number of unique tags (\(M\)) combined from the three datasets were 3,740. Further details of the three datasets are provided in Table 1.

Table 1 Description of Blogspot, Flickr and YouTube data sets

5.2 Parameter Setting

We denote YouTube, Flickr and Blogspot tf-idf weighted [2] tag-item matrices (similar to widely known term-document matrices generated from the tag-lists) by \(\mathbf {X}_{1}\), \(\mathbf {X}_{2}\) and \(\mathbf {X}_{3}\) respectively. For learning MS-NMF factorization, recall the notation \(K_{\nu }\) which is dimensionality of the subspace spanned by \(W_{\nu }\); following this notation, we use the individual subspace dimensions as \(K_{1}=6,K_{2}=8,K_{3}=8\), pair-wise shared subspace dimension as \(K_{12}=15,K_{23}=18,K_{13}=12\) and all sharing subspace dimension as \(K_{123}=25\). To learn these parameters, we first initialize them using the heuristic described in Sect. 3.3 based on the number of common and individual tags and then do cross-validation based on retrieval precision performance.

5.3 Experiment 1: Improving Social Media Retrieval Using Auxiliary Sources

To demonstrate the usefulness of MS-NMF for social media retrieval application, we carry out our experiments in a multi-task learning setting. Focusing on YouTube video retrieval task, we choose YouTube as target dataset while Blogspot and Flickr as auxiliary datasets. To perform retrieval using MS-NMF, we follow Algorithm 1.

Baseline Methods and Evaluation Measures

  • The first baseline performs retrieval by matching the query with the tag-lists of videos (using vector-space model) without learning any subspace.

  • The second baseline is the retrieval based on standard NMF. The retrieval algorithm using NMF remains similar to the retrieval using MS-NMF as it becomes a special case of MS-NMF when there is no sharing, i.e. \(\mathbf {W}_{1}=W_{1}\), \(\mathbf {H}_{1}=H_{1,1}\) and \(R_{1}=56\).

  • The third baseline is the recently proposed JS-NMF [7] which learns shared and individual subspaces but allows only one auxiliary source at a time. Therefore, we use two instances of JS-NMF (1) with Blogspot as auxiliary source (2) with Flickr as auxiliary source. Following [7], we obtained the best performance with parameters setting : \(R_{Y}=56,R_{F}=65,R_{B}=62\) and \(K_{YB}=37,K_{YF}=40,K_{BF}=43\) where \(R_{Y},R_{F},R_{B}\) are total subspace dimensionalities of YouTube, Flickr and Blogspot respectively and \(K_{YB},K_{YF},K_{BF}\) are the shared subspace dimensionalities.

To compare above baselines with the proposed MS-NMF, we use precision-scope (P@N), mean average precision (MAP)and 11-point interpolated precision-recall [2]. The performance of MS-NMF is compared with the baselines by averaging the retrieval results over a query set of 20 concepts given by \(\mathbb {Q}=\) {‘beach’, ‘america’, ‘bomb’, ‘animal’, ‘bank’, ‘movie’, ‘river’, ‘cable’, ‘climate’, ‘federer’, ‘disaster’, ‘elephant’, ‘europe’, ‘fire’, ‘festival’, ‘ice’, ‘obama’, ‘phone’, ‘santa’, ‘tsunami’}.

Experimental Results

Figure 2 compares the retrieval performance of MS-NMF with the three baselines in terms of evaluation criteria mentioned above. It can be seen from Fig. 2 that MS-NMF clearly outperforms the baselines in terms of all three evaluation criteria. Since tag based matching method does not learn any subspaces, its performance suffers from the ‘polysemy’ and ‘synonymy’ problems prevalent in tag space. NMF, being a subspace learning method, performs better than tag based method but does not perform better than shared subspace methods (JS-NMF and MS-NMF) as it is unable to exploit the knowledge from auxiliary sources. When comparing JS-NMF with MS-NMF, we see that MS-NMF clearly outperforms both the settings of JS-NMF. This is due to the fact that JS-NMF is limited to work with only one auxiliary source and can not exploit the knowledge available in multiple data sources. Although, JS-NMF, using one auxiliary source at a time, improves the performance over NMF but real strength of the three media sources is exploited by MS-NMF which performs the best among all methods. Better performance achieved by MS-NMF can be attributed to the shared subspace model finding better term co-occurrences and reducing the tag subjectivity by exploiting knowledge across three data sources. Further insight into the improvement is provided through entropy and impurity results given in Sect. 5.5.

Fig. 2
figure 2

YouTube retrieval results with Flickr and Blogspot as auxiliary sources a Precision-Scope and MAP b 11-point interpolated Precision-Recall; for tag-based matching (baseline 1), standard NMF (baseline 2), JS-NMF [7] with Blogspot (baseline 3a); with Flickr (baseline 3b) and proposed MS-NMF

5.4 Experiment 2: Cross-Social Media Retrieval

For cross-media retrieval experiments, we use the same dataset as used in our first experiment but choose more appropriate baselines and evaluation measures. Subspace learning using MS-NMF remains same, as the factorization is carried out on the same dataset using the same parameter setting. We follow Algorithm 2 which utilizes MS-NMF framework to return the ranked list of cross-media items.

Baseline Methods and Evaluation Measures

To see the effectiveness of MS-NMF for cross-media retrieval, the first baseline is tag-based matching performed in a typical vector-space model setting. The second baseline is the framework in Ref. [13] where a subspace is fully shared among three media without retaining any individual subspace. We shall denote this baseline as \(\mathrm {LIN\_ETAL09}\). We present cross-media results for both pair-wise and across all three media. When presenting pair-wise results, we choose JS-NMF [7] (subspace learning remains same as in the first experiment) as a third baseline by applying it on the media pairs.

To evaluate our cross-media algorithm, we again use P@N, MAPand11-point interpolated precision-recallmeasures. To explicitly state these measures for cross-media retrieval, we define precision and recall in cross-media scenario.Consider a query term \(q\in \mathbb {Q}\), let its ground truth set be \(G_{i}\) for \(i\)th medium. If a retrieval method used with query \(q\) results in an answer set \(A_{i}\) from \(i\)th medium, the precision and recall measures across \(n\) media are defined as

$$\begin{aligned} \text {Precision}=\frac{\sum \limits _{i=1}^{n}|A_{i} || G_{i}|}{\sum \limits _{i=1}^{n}|A_{i}|},\quad \text {Recall}=\frac{\sum \limits _{i=1}^{n}|A_{i} || G_{i}|}{\sum \limits _{i=1}^{n}|G_{i}|} \end{aligned}$$
(15)

Experimental Results

Cross-media retrieval results across media pairs are shown in Fig. 3 whereas those from across all three media (Blogspot, Flickr and YouTube) are shown in Fig. 4. To generate the graphs, we average the retrieval results over the same query set \(\mathbb {Q}\) as defined for YouTube retrieval task in Sect. 4.1. It can be seen from Fig. 3 that MS-NMF significantly outperforms all baselines including JS-NMF on cross-media retrieval task for each media-pair. This performance improvement is consistent in terms of all three evaluation measures. Note that, to learn the subspaces, MS-NMF uses all three media data whereas JS-NMF uses the data only from the media pair being considered. The ability to exploit knowledge from multiple media helps MS-NMF achieving better performance. When retrieval precision and recall are calculated across all three media domains, MS-NMF still performs better than the tag-based matching as well as \(\mathrm {LIN\_ETAL09}\). Note that JS-NMF can not be applied on three media simultaneously.

Fig. 3
figure 3

Pairwise cross-media retrieval results: Blogspot–Flickr (first row)(a, b), Blogspot–YouTube (second row) (c, d) and Flickr–YouTube (third row) (e, f); for tag-based matching (baseline 1), \(\mathrm {LIN\_ETAL09}\) [13] (baseline 2), JS-NMF [7] (baseline 3) and MS-NMF

5.5 Topical Analysis

To provide further insights into the benefits achieved by MS-NMF, we examine the results at the topical level. Every basis vector of the subspace (when normalized to sum one) can be interpreted as a topic. We define a metric for measuring the impurity of a topic as

$$\begin{aligned} P\left( T\right) =\frac{1}{L\left( L-1\right) }\sum _{\underset{x\ne y}{x,y}}\text {NGD}\left( t_{x},t_{y}\right) \end{aligned}$$
(16)

where \(L\) denotes the number of tags in a topic \(T\) for which corresponding basis vector element greater than a thresholdFootnote 4 and \(\text {NGD}\left( t_{x},t_{y}\right) \) is Normalized Google Distance [4] between tags \(t_{x}\) and \(t_{y}\).

Fig. 4
figure 4

Cross-media retrieval results plotted across all three data sources (Blogspot, Flickr and YouTube) for tag-based matching (baseline 1), \(\mathrm {LIN\_ETAL09}\) [13] (baseline 2) and MS-NMF. a Precision–scope/MAP, b 11-point precision–recall curve

Fig. 5
figure 5

A comparison of MS-NMF with NMF and \(\mathrm {LIN\_ETAL09}\) [13] in terms of entropy and impurity distributions. a Entropy distribution, b Impurity distribution

We compute the entropy and impurity for each subspace basis and plot their distributions in Fig. 5 using the box-plots. It can be seen from the figure that topics learnt by MS-NMF have on average lesser entropy and impurity than their NMF and LIN_ETAL09 counterparts for all three datasets. Although, LIN_ETAL09 can model multiple data sources but it uses a single subspace to model each source without retaining their differences. As a consequence of this, the variabilities of the three sources get averaged out and thereby increase the entropy and impurity of the resulting topics. In contrast, MS-NMF having the flexibility of partial sharing, averages the commonalities of three data sources only up to their true sharing extent and thus results in purer and compact (less entropy) topics.

6 Conclusion and Future Works

We have presented a matrix factorization framework to learn individual and shared subspaces from multiple data sources (MS-NMF) and demonstrated its application to two social media problems: improving social media retrieval by leveraging related data from auxiliary sources and cross-media retrieval. We provided an efficient algorithm to learn the joint factorization and proved its convergence. Our first application has demonstrated that MS-NMF can help improving retrieval in YouTube by transferring knowledge from the tags of Flickr and Blogspot. Outperforming JS-NMF [7], it justifies the need for a framework which can simultaneously model multiple data sources with any arbitrary sharing. The second application shows the utility of MS-NMF for cross-media retrieval by demonstrating its superiority over existing methods using Blogspot, Flickr and YouTube dataset. The proposed framework is quite generic and has potentially wider applicability in cross-domain data mining e.g. cross-domain collaborative filtering, cross-domain sentiment analysis, etc. In current form, MS-NMF requires the shared and individual subspace dimensionalities to be obtained using cross-validation. As a future work, we shall formulate the joint factorization probabilistically by appealing to Bayesian nonparametric theory and infer these parameters automatically from the data.