Keywords

1 Introduction

With the rapid development of social networks, massive information (e.g., texts, images and videos) generated by users is emerging on various social media platforms. The activities people participating in and the contents people producing play a significant role of analyzing people’s interests and preferences, which are of great importance to provide personalized recommendation and on-line retrieval for them. In particular, microblogging is now one of the most popular social media services, where people are keen on posting daily activities, sharing opinions and focusing on hot and interesting topics. For example, Sina WeiboFootnote 1, a commonly used social media platform in China, has attracted a great amount of users to participate in. Released by Sina Weibo Data Center, the number of monthly active users approaches to 222 million up to October 2015. Besides, the social media applications involve multi-modal data, where the visual information is vital to strengthen the description of short texts.

In order to explore user attributes, prior works construct topic modeling from users’ previous behaviors and preferences. For example, Latent Dirichlet Allocation (LDA) [7] is a widely used generative probabilistic model for text corpora. By modifying LDA, there are other traditional topic models to tackle the problem of short texts from social media data, such as author topic model [19] and twitter-user model [24]. Besides, previous works also focus on dynamic topic models to analyze the change of topics in data streams, such as Dynamic User Attribute Model (DUAM) [12] which models the dynamics using time windows, and dynamic User Clustering Topic model (UCT) [27] to capture the dynamics of users’ interests by integrating the interests at previous time periods with newly collected data in text streams. In addition, there are topic models proposed to explore the correlations among different modalities. For instance, mm-LDA [2] and corr-LDA [6] are presented to learn the correspondence between textual and visual information. Cross-Media-LDA (CMLDA) [5] is also proposed to discover the intrinsic correlations among multiple media types for social event summarization. Some similar methods proposed by Bian et al. are demonstrated in [3, 4].

Recently, there is a great interest in deep learning, which succeeds in many applications. The Deep Belief Network (DBN) [11] and the Deep Boltzmann Machine (DBM) [20] are deep networks both designed to model binary observations, whose hidden units are also typically restricted to be binary. However, different from conventional deep networks, the Poisson Gamma Belief Network (PGBN) [29] is proposed to construct a deep networks architecture with nonnegative real hidden units to automatically tune both the width of each layer and the depth of the network. Despite PGBN learns the representation of count observations, it is a unimodal network and not applicative to short texts of social data. To deal with multi-modal social data, we propose a novel multi-modal User Attribute Model (mmUAM). Different from traditional methods of constructing user interest model that only take account of one layer of topic modelling, our model is designed to capture the correlations among multiple modalities. To facilitate this study, we collect a real dataset from Sina Weibo, on which extensive experiments show the superiority over state-of-the-art methods.

The main contributions of our work are summarized as follows.

  1. 1.

    We propose a novel multi-modal deep learning approach, named multi-modal User Attribute Model (mmUAM), through which we manage to automatically infer user attributes.

  2. 2.

    The proposed mmUAM captures the semantic correlations between texts and images, which enables us to learn effective textual and visual representation for more comprehensive user profiling.

  3. 3.

    We construct a Sina Weibo microblog dataset with multi-modality information. The promising results on this dataset demonstrate the efficacy of our proposed approach.

2 Related Work

Text-based User Profiling. With the tremendous growth of social networks, how to provide more accurate services for users is tough challenging. Previous works have been studied to explore users interests through extracting users’ characteristics and preferences from user-generated texts on social media platforms. Generative topic models, such as LDA [7], provide an explicit representation of a document. However, such topic models fail to tackle the sparsity problem of short texts. Many variations of LDA have been proposed. For example, He et al. [10] propose a modified topic model, named Bi-labeled LDA, which utilizes users’ relationship information to learn interest tags. Rosen-Zvi et al. [19] extend LDA to propose the author-topic model, which models the content of documents, including the author information. While, Xu et al. [24] introduce a modified author-topic model, twitter-user mode, which outperforms LDA and author-topic model. Besides, some other studies also make attempts to exploit external knowledge to enrich the s of short texts. Abel et al. [1] analyze Twitter activities in semantic way by integrating Twitter posts with related news articles. Instead of introducing external knowledge, Cheng et al., [8] model the generation of word co-occurrence patterns for topic modeling in order to address the sparsity problem of short texts. However, the above topic models are mostly applied to text corpus.

Image-based User Profiling. Deep Convolutional Neural Networks (CNNs) [13] have recently achieved a great success in large scale image feature learning. Consequently, many researches focus on building user profiling by extracting visual information. For instance, Geng et al. [9] propose a deep learning strategy to learn visual features for user profiling on PinterestFootnote 2 in fashion domain. A Socially Embedded Visual Representation learning (SEVIR) approach [15] has been proposed to capture the semantics and user intentions based on learning image representation, which tackles the sparsity and unreliability problems. Li et al. [14] construct a Gaussian relational topic model by utilizing user-shared images to infer users’ interests. Moreover, a pinboard recommendation system for Twitter users is presented in [25], which combines two different social media platforms in order to recommend users for more relevant and interesting topics. Also, the way of exploiting user-tagged Web images for video indexing can be learned in [26]. Despite the visual information exploiting user interests is definitely significant, more works should take account of multiple modalities.

Multi-modal User Profiling. As more and more social media data is integrated with texts, images and videos, most of the works have shifted their focus to dealing with multi-modal data. In [6], the correspondence Latent Dirichlet Allocation (corr-LDA) is a three hierarchical probabilistic mixture model to describe the correlations between images and annotations. Similarly, multi-modal Latent Dirichlet Allocation (mm-LDA) [2] is proposed to learn the joint distribution of images and their associated texts, which is used for social relation mining. In addition, Bian et al. [5] present a novel probabilistic modal, named Cross-Media-LDA (CMLDA), which aims to explore intrinsic correlations between texts and images for multimedia microblog summarization.

Besides, some deep networks are constructed to learn features among multiple modalities. Ngiam et al. [17] propose a cross-modality deep learning methods based on Restricted Boltzmann Machines (RBMs). Subsequently, Srivastava et al. [23] propose a multi-modal Deep Boltzmann Machine (DBM) model for images and texts. They construct multi-modal DBM by building an image-specific two-layer DBM that uses Gaussian RBM and a text-specific two-layer DBM that utilizes Replicated Softmax model. Similarly, Pang et al. [18] use multi-modal DBN to learn joint representation of the visual, auditory and textual features for user-generated web videos. In addition, the Deep Belief Network (DBN) presented in [22] to create a joint representation for texts and images, is different from DBM in that DBN is a directed model.

Nevertheless, the hidden units of DBM and DBN are typically restricted to be binary. These multi-modal deep learning approaches are not successfully applied to a real dataset for social service. We work on learning features of multiple modalities input data from large-scale real dataset and construct multi-modal deep networks to tackle the sparsity of short texts to explore more relevant interests that meet users’ demand. To achieve better inference of our proposed deep topic modal, we employ upward-downward Gibbs sampling.

3 The Proposed Model

3.1 Overview

We employ the conventional bag-of-word method to deal with the texts and images to automatically infer user attributes. To construct our model, we utilize Sina Weibo data with user-generated short texts and corresponding images. Firstly, we extract both texts and images raw features as bag of words and bag of visual words, respectively. Formally, each document is under two different topic distributions. Note that \( \varvec{\varTheta } \), which is a shared latent variable between visual and textual modalities, is concatenated by textual hidden unit \( \varvec{\theta _{w-j}} \) and visual hidden unit \( \varvec{\theta _{v-j}} \). Topics \( \varvec{\varPhi }_{w} \) are specific to textual modality and topics \( \varvec{\varPhi }_{v} \) are unique to visual modality. Then, we build our proposed mmUAM in deep networks with five layers. The performance of multi-modal fusion in five different layers is presented in Sect. 4.3. As we use probabilistic models, upward-downward Gibbs sampling [29] is adopted to infer various parameters.

Fig. 1.
figure 1

The graphical illustration of the proposed mmUAM. (a) the mmUAM hierarchical model; (b) a presentation of layer t = 1 in the mmUAM.

As we all know, microblogs always consist of short texts and relevant images, in which each text is restricted to 140 characters. Thus, each document is a piece of microblog and is composed of textual content, visual content, or the mixing of textual and visual information. In particular, the observation of an image is represented as a multivariate vector of visual words, which is denoted as \( v_j \) in jth document. Similarly, the observation of a text is defined as a vector \( w_j \) in jth document. The correlations between the \( K_0 \) features of \( (v_1, v_2,...,v_J) \) can be represented by the columns of \( \varvec{\varPhi }_{v} \). In the same way, the correlations between the \( K_0 \) features of \( (w_1, w_2,...,w_J) \) are captured by the columns of \( \varvec{\varPhi }_{w} \). Note that \( \varvec{\varTheta }_j\in R_{+}^{K_{t}} \) is the \( K_t \) hidden units of sample the jth document. We use the Poisson likelihood to connect the observed textual content \( w_j^{(1)}\in Z^{K_0} \) (visual content \( v_j^{(1)}\in Z^{K_0} \)) to the product \( \varvec{\varPhi }_{w}^{(1)}\varvec{\theta }_{w-j}^{(1)} \) (\( \varvec{\varPhi }_{v}^{(1)}\varvec{\theta }_{v-j}^{(1)} \)) at layer one as follows

$$\begin{aligned} w_{j}^{(1)}\sim Pois \left( \varvec{\varPhi }_{w}^{(1)}\varvec{\theta }_{w-j}^{(1)} \right) , \quad v_{j}^{(1)}\sim Pois \left( \varvec{\varPhi }_{v}^{(1)}\varvec{\theta }_{v-j}^{(1)} \right) . \end{aligned}$$

3.2 Multi-modal User Attribute Model

Our proposed model is based on PGBN [29], a deep networks architecture that is designed only for text analysis. Then, Zhou et al. [30] propose augmentable gamma belief networks to learn multilayer deep representations for high-dimensional sparse count vectors and nonnegative real vectors. Nevertheless, the augmentable gamma belief networks are not adapted to social data. As a result, we propose a multi-modal user attribute model for multiple modalities data on social media. For microblogging document, we make an assumption that the generated topics are composed of two domains, including textual topics generated from microblog texts, and visual topics generated from posted images. In order to capture correlations of these two modalities, we learn a shared representation between textual and visual information. We use \( \varvec{\varTheta }_j \) to represent the shared gamma distribution between textual and visual information in the jth microblogging document. With T hidden layers, we give the example of our proposed mmUAM fusing in the first hidden layer as follows

$$\begin{aligned} \begin{array}{c} \varvec{\varTheta }_{j}^{(T)}\sim Gam\left( \varvec{r}, 1/c_{j}^{(T+1)} \right) ,\\ ...\\ \varvec{\varTheta }_{j}^{(t)}\sim Gam\left( \left[ \begin{array}{l} \varvec{\varPhi }_{w}^{(t+1)}\varvec{\theta }_{w-j}^{(t+1)}\\ \varvec{\varPhi }_{v}^{(t+1)}\varvec{\theta }_{v-j}^{(t+1)} \end{array}\right] , 1/c_{j}^{(t+1)} \right) ,\\ ...\\ \varvec{\varTheta }_{j}^{(1)}\sim Gam\left( \left[ \begin{array}{l} \varvec{\varPhi }_{w}^{(2)}\varvec{\theta }_{w-j}^{(2)}\\ \varvec{\varPhi }_{v}^{(2)}\varvec{\theta }_{v-j}^{(2)} \end{array}\right] , p_{j}^{(2)}/(1-p_{j}^{(2)}) \right) . \end{array} \end{aligned}$$
(1)

The graphic representation of the mmUAM is depicted in Fig. 1. For \( t=1,2,...,T-1 \), the hidden units \( \varvec{\varTheta }_{j}^{(t)} \in R_{+}^{K_{t}} \) of layer t are under gamma distribution, which factorize the shape parameters into the concatenation of \( \varvec{\varPhi }_{w}^{(t+1)}\varvec{\theta }_{w-j}^{(t+1)} \) and \( \varvec{\varPhi }_{v}^{(t+1)}\varvec{\theta }_{v-j}^{(t+1)} \). With \( c_{j}^{(2)}=(1-p_{j}^{(2)})/p_{j}^{(2)} \), \( p_{j}^{(2)} \) and \( \left\{ 1/c^{(t)} \right\} _{3,T+1} \) are probability parameters and gamma scale parameters respectively. For the top layer, the gamma shape parameters of hidden units are vector \( \varvec{r} = (r_1,...,r_K^{(T)})' \). The columns of \( \phi _{w}^{(t+1)} \) and \( \phi _{v}^{(t+1)} \) decide the correlations between the \( K_t \) latent features of \( (\varvec{\varTheta }_{1}^{(t)},...,\varvec{\varTheta }_{J}^{(t)}) \).

In order to simplify parameter inference, we impose the constraints on \( \varvec{\varPhi }_{w}^{(t)} \) and \( \varvec{\varPhi }_{v}^{(t)} \) that every column of \( \varvec{\varPhi }_{w}^{(t)} \) and \( \varvec{\varPhi }_{v}^{(t)} \) has a unit \( L_1 \) norm. Thus, for \( t \in \left\{ 1,...,T-1\right\} \), the hierarchical model is completed as follows

$$\begin{aligned}\begin{gathered} \phi _{w-k}^{(t)}\sim Dir(\eta ^{t},...,\eta ^{t}) ,\quad \phi _{v-k}^{(t)}\sim Dir(\xi ^{t},...,\xi ^{t}),\\ c_0\sim Gam(e_0,1/f_0),\quad \gamma _0 \sim Gam(a_0,1/b_0), \quad r_{k}\sim Gam(\gamma _0/K_T,1/c_0). \end{gathered}\end{aligned}$$

For \( t \in \left\{ 3,...,T+1\right\} \), we have

$$\begin{aligned} p_{j}^{(2)}\sim Beta(a_0,b_0), \quad c_{j}^{(t)}\sim Gam(e_0,1/f_0). \end{aligned}$$
(2)

we divide T hidden layers into T related subproblems, thus every subproblem has the similar way of solution.

Lemma 1

(augment-and-conquer the mmUAM). With \( p_{j}^{(1)}=1-e^{-1} \) and

$$\begin{aligned} p_{j}^{(t+1)}=-ln(1-p_{j}^{(t)})/\left[ c_{j}^{(t+1)}-ln(1-p_{j}^{(t)}) \right] . \end{aligned}$$
(3)

For \( t \in \left\{ 1,...,T \right\} \), we can define that the observed (if \( t=1 \)) or some latent (if \( t\ge 2 \)) textual contents \( w_j^{t} \in Z^{K_t-1} \) are under the Poisson distribution with the product \( \varvec{\varPhi }_w^{t}\varvec{\theta }_{w-j}^{t} \), and the observed (if \( t=1 \)) or some latent (if \( t\ge 2 \)) visual word counts \( v_j^{t} \in Z^{K_t-1} \) are under the Poisson distribution with the product \( \varvec{\varPhi }_w^{t}\varvec{\theta }_{v-j}^{t} \).

$$\begin{aligned} w_{j}^{(t)}\sim Pois \left[ - \varvec{\varPhi }_{w}^{(t)}\varvec{\theta }_{w-j}^{(t)} ln\left( 1-p_{j}^{(t)} \right) \right] , \end{aligned}$$
(4)
$$\begin{aligned} v_{j}^{(t)}\sim Pois \left[ - \varvec{\varPhi }_{v}^{(t)}\varvec{\theta }_{v-j}^{(t)} ln\left( 1-p_{j}^{(t)} \right) \right] . \end{aligned}$$
(5)

Proof

The definition (4), (5) are absolutely true for layer one. Assume that (4), (5) are true for layer \( t \ge 2 \), then each textual count \( w_{ij}^{(t)} \) and visual count \( v_{ij}^{(t)} \) are separately augmented into the summation of \( K_t \) latent textual and visual counts. Thus, the summation of \( K_t \) two different latent counts is smaller than or equal to \( w_{ij}^{(t)} \) and \( v_{ij}^{(t)} \).

$$\begin{aligned}\begin{gathered} w_{ij}^{(t)}= \sum _{k=1}^{K_t}w_{ijk}^{(t)}, \quad w_{ijk}^{(t)}\sim Pois\left[ -\phi _{w-ik}^{(t)}\theta _{w-kj}^{(t)}ln(1-p_{j}^{(t)}) \right] ,\\ v_{ij}^{(t)}= \sum _{k=1}^{K_t}v_{ijk}^{(t)}, \quad v_{ijk}^{(t)}\sim Pois\left[ -\phi _{v-ik}^{(t)}\theta _{v-kj}^{(t)}ln(1-p_{j}^{(t)}) \right] . \end{gathered}\end{aligned}$$

where \(i\in \left\{ 1,...,K_{t-1}\right\} \). Then, we have

$$\begin{aligned}\begin{gathered} m_{kj}^{(t)(t+1)}=w_{\cdot jk}^{(t)}=\sum _{i=1}^{K_{t-1}}w_{ijk}^{(t)}, m_{j}^{(t)(t+1)}=\left( w_{\cdot j1}^{(t)},...,w_{\cdot jK_t}^{(t)}\right) ',\\ n_{kj}^{(t)(t+1)}=v_{\cdot jk}^{(t)}=\sum _{i=1}^{K_{t-1}}v_{ijk}^{(t)}, n_{j}^{(t)(t+1)}=\left( v_{\cdot j1}^{(t)},...,v_{\cdot jK_t}^{(t)}\right) '. \end{gathered}\end{aligned}$$

\( m_{kj}^{(t)(t+1)} \) denotes the counts in layer t that factor \( k \in \left\{ 1,...K_t\right\} \) appears in document j, and \( v_{kj}^{(t)(t+1)} \) represents the counts in layer t that factor \( k \in \left\{ 1,...K_t\right\} \) appears in document j. On account of \( \sum _{i=1}^{K_{t-1}}\phi _{w-ik}^{(t)}=1 \) and \( \sum _{i=1}^{K_{t-1}}\phi _{v-ik}^{(t)}=1 \), we utilize the method in [31] to marginalize out \( \varvec{\varPhi }_w^{t}\) and \( \varvec{\varPhi }_v^{t}\). As a result,

$$\begin{aligned} m_{j}^{(t)(t+1)} \sim Pois\left[ -\varvec{\theta }_{w-j}^{(t)}ln(1-p_{j}^{(t)}) \right] ,\quad n_{j}^{(t)(t+1)} \sim Pois\left[ -\varvec{\theta }_{v-j}^{(t)}ln(1-p_{j}^{(t)}) \right] . \end{aligned}$$

Then, by employing the above Poisson likelihood, we further marginalize out \( \varvec{\theta }_{w-j}^{(t)} \) and \( \varvec{\theta }_{v-j}^{(t)} \) that follows the gamma distribution.

$$\begin{aligned} m_{j}^{(t)(t+1)} \sim NB\left[ \varvec{\varPhi }_w^{(t+1)}\varvec{\theta }_{w-j}^{(t+1)},p_{j}^{(t+1)}) \right] , \end{aligned}$$
(6)
$$\begin{aligned} n_{j}^{(t)(t+1)} \sim NB\left[ \varvec{\varPhi }_v^{(t+1)}\varvec{\theta }_{v-j}^{(t+1)},p_{j}^{(t+1)}) \right] . \end{aligned}$$
(7)

As demonstrated in [28], (6) and (9) can also be generated from their compound Poisson distribution as

$$\begin{aligned} m_{kj}^{(t)(t+1)}=\sum _{x=1}^{w_{kj}^{(t+1)}}u_x, u_x \sim Log(p_j^{(t+1)}),w_{kj}^{(t+1)} \sim Pois \left[ \varvec{\phi }_{w-k:}^{(t+1)}\varvec{\theta }_{w-j}^{(t+1)}ln(1-p_{j}^{(t+1)}) \right] , \end{aligned}$$
$$\begin{aligned} n_{kj}^{(t)(t+1)}=\sum _{y=1}^{v_{kj}^{(t+1)}}u_y, u_y \sim Log(p_j^{(t+1)}),v_{kj}^{(t+1)} \sim Pois \left[ \varvec{\phi }_{v-k:}^{(t+1)}\varvec{\theta }_{v-j}^{(t+1)}ln(1-p_{j}^{(t+1)}) \right] . \end{aligned}$$

Hence, if (4), (5) are true for layer t, they are also true for layer \( t+1 \).

Inspired by the lemmas and theorems in [28, 31], we propagate the latent textual counts \( w_{ij}^{(t)} \) and visual counts \( v_{ij}^{(t)} \) of layer t upward to layer \( t+1 \) as

$$\begin{aligned} \begin{array}{c} \left\{ \left( w_{ij1}^{(t)},...,w_{ijK_t}^{(t)} \right) |w_{ij}^{(t)},\varvec{\phi }_{w-i:}^{(t)},\varvec{\theta }_{w-j}^{(t)} \right\} \\ \sim Mult \left( w_{ij}^{(t)},\frac{\phi _{w-i1}^{(t)}\theta _{w-1j}^{(t)}}{\sum _{k+1}^{K_t}\phi _{w-ik}^{(t)}\theta _{w-kj}^{(t)}},...,\frac{\phi _{w-iK_t}^{(t)}\theta _{w-K_tj}^{(t)}}{\sum _{k+1}^{K_t}\phi _{w-ik}^{(t)}\theta _{w-kj}^{(t)}} \right) , \end{array} \end{aligned}$$
(8)
$$\begin{aligned} \left( w_{kj}^{(t+1)}|m_{kj}^{(t)(t+1)},\varvec{\phi }_{w-k:}^{(t+1)},\varvec{\theta }_{w-j}^{(t+1)} \right) \sim CRT \left( m_{kj}^{(t)(t+1)}, \varvec{\phi }_{w-k:}^{(t+1)}\varvec{\theta }_{w-j}^{(t+1)} \right) , \end{aligned}$$
(9)
$$\begin{aligned} \begin{array}{c} \left\{ \left( v_{ij1}^{(t)},...,v_{ijK_t}^{(t)} \right) |v_{ij}^{(t)},\varvec{\phi }_{v-i:}^{(t)},\varvec{\theta }_{v-j}^{(t)} \right\} \\ \sim Mult \left( v_{ij}^{(t)},\frac{\phi _{v-i1}^{(t)}\theta _{v-1j}^{(t)}}{\sum _{k+1}^{K_t}\phi _{v-ik}^{(t)}\theta _{v-kj}^{(t)}},...,\frac{\phi _{v-iK_t}^{(t)}\theta _{v-K_tj}^{(t)}}{\sum _{k+1}^{K_t}\phi _{v-ik}^{(t)}\theta _{v-kj}^{(t)}} \right) , \end{array} \end{aligned}$$
(10)
$$\begin{aligned} \left( v_{kj}^{(t+1)}|n_{kj}^{(t)(t+1)},\varvec{\phi }_{v-k:}^{(t+1)},\varvec{\theta }_{v-j}^{(t+1)} \right) \sim CRT \left( n_{kj}^{(t)(t+1)}, \varvec{\phi }_{v-k:}^{(t+1)}\varvec{\theta }_{v-j}^{(t+1)} \right) . \end{aligned}$$
(11)

3.3 Parameter Inference

In conventional topic models, variational inference and collapsed Gibbs sampling are often used for parameter inference. To estimate the latent variables under the multivariate observations, we utilize upward-downward Gibbs sampling [29] with the width of the first layer being restricted to \( K_{1max} \). The sampling process of mmUAM is as below.

Sample \( w_{ijk}^{(t)} \) and \( v_{ijk}^{(t)}\) . For all the layers, we can use (10) to sample \( w_{ijk}^{(t)} \) and (12) to sample \( v_{ijk}^{(t)} \). But for the first hidden layer, the observed counts \( w_{ij}^{(1)} \) is considered as word tokens at the ith term in the jth document, where the size of textual vocabulary is denoted as \( I=K_0 \), and the observed counts \( v_{ij}^{(1)} \) is treated as visual word tokens at the ith term (the size of visual vocabulary \( I'=K_0' \)) in the jth document. We define \( z_{w-js} \) and \( z_{v-js} \) as the topic index for \( i_{js} \) and \( i_{js^{'}} \) (\( s \in \left\{ 1,...,w_{\cdot j}^{(1)}\right\} \), \( s^{'} \in \left\{ 1,...,v_{\cdot j}^{(1)}\right\} \)).

$$\begin{aligned} P\left( z_{w-js}=k|- \right) \propto \frac{\eta ^{(1)}+(w_{i_{js}\cdot k}^{(1)})_{-js}}{I\eta ^{(1)}+(w_{\cdot \cdot k}^{(1)})_{-js}} \left( (w_{\cdot jk}^{(1)})_{-js}+\phi _{w-k:}^{(2)}\theta _{w-j}^{(2)} \right) , \end{aligned}$$
(12)
$$\begin{aligned} P\left( z_{v-js^{'}}=k|- \right) \propto \frac{\xi ^{(1)}+(v_{i_{js^{'}}\cdot k}^{(1)})_{-js^{'}}}{I'\xi ^{(1)}+(v_{\cdot \cdot k}^{(1)})_{-js^{'}}}\left( (v_{\cdot jk}^{(1)})_{-js^{'}}+\phi _{v-k:}^{(2)}\theta _{v-j}^{(2)} \right) . \end{aligned}$$
(13)

where \( k\in \left\{ 1,...,K_{1max}\right\} \). We let \( w_{ijk}^{(1)}=\sum _{s}\delta \left( i_{js}=i,z_{w-js}=k \right) \) and \( v_{ijk}^{(1)}=\sum _{s^{'}}\delta \left( i_{js^{'}}=i,z_{v-js^{'}}=k \right) \). \( w_{ijk}^{(1)}\) and \( v_{ijk}^{(1)}\) represent the number of times when term i is assigned to the topic k in document j. Besides, we use \( w_{-js} \) and \( v_{-js} \) to separately denote the count of w and v except when term i appears in document j. Especially when \( T=1 \), we use Poisson Factor Analysis (PFA) with gamma-negative binomial process [28] to replace \( \phi _{w-k:}^{(2)}\theta _{w-j}^{(2)} \) and \( \phi _{v-k:}^{(2)}\theta _{v-j}^{(2)} \) with \( r_k \). For simplification, if \( T=1 \), we set \( K_{1max} \) factors and let \( r_{k}\sim Gam(\gamma _0/K_{1max},1/c_0) \).

Sample \( \varvec{\phi }_{w-k}^{(t)}\) . We sample the textual topic \( \varvec{\phi }_{w-k}^{(t)} \) as

$$\begin{aligned} \left( \varvec{\phi }_{w-k}^{(t)} | -\right) \sim Dir\left( \eta ^{(t)} + w_{1\cdot k}^{(t)},...,\eta ^{(t)} + w_{K_{t-1}\cdot k}^{(t)}\right) . \end{aligned}$$

Sample \( \varvec{\phi }_{v-k}^{(t)}\) . In the same way, we sample the visual topic \( \varvec{\phi }_{v-k}^{(t)} \) as

$$\begin{aligned} \left( \varvec{\phi }_{v-k}^{(t)} | -\right) \sim Dir\left( \xi ^{(t)} + v_{1\cdot k}^{(t)},...,\xi ^{(t)} + v_{K_{t-1}\cdot k}^{(t)}\right) . \end{aligned}$$

Sample \( w_{ij}^{(t+1)} \) and \( v_{ij}^{(t+1)}\) . We sample \( \varvec{w}_{j}^{(t+1)} \), \( \varvec{v}_{j}^{(t+1)} \) separately using (9), (11).

Sample \( \varvec{r}\) . \( c_0 \) and \( \gamma _0 \) are sampled using (3), whose detailed introduction is in [28].

$$\begin{aligned} \left( r_v | -\right) \sim Gam\left( \gamma _0/K_T + w_{i\cdot }^{(T+1)} + v_{i\cdot }^{(T+1)}, c_0-\sum _jln\left( 1-p_{j}^{(T+1)} \right) ^{-1}\right) . \end{aligned}$$

Sample \( \varvec{\varTheta }_j^{(t)}\) . Using the latent counts propagated upward and the gamma-Poisson conjugacy, we downward sample the hidden units \( \varvec{\varTheta }_j \) as

$$\begin{aligned} \left( \varvec{\varTheta }_j^{(t)} | -\right) \sim Gam\left( \left[ \begin{array}{l} \varvec{\varPhi }_{w}^{(t+1)}\varvec{\theta }_{w-j}^{(t+1)}\\ \varvec{\varPhi }_{v}^{(t+1)}\varvec{\theta }_{v-j}^{(t+1)} \end{array}\right] +\left[ \begin{array}{l} m_{j}^{(t)(t+1)}\\ n_{j}^{(t)(t+1)} \end{array}\right] , c_{j}^{(t+1)}-ln\left( 1-p_{j}^{(t)} \right) ^{-1}\right) . \end{aligned}$$

Sample \( p_j^{(2)} \) and \( c_j^{(t)}\) . We calculate \( p_j^{(t)} \) (\( t\ge 3 \)) and \( c_j^{(2)} \) using (6), and sample \( p_{j}^{(2)} \) and \( c_{j}^{(t)} \), where \( t\ge 3 \) as

$$\begin{aligned}\begin{gathered} \left( p_{j}^{(2)}|- \right) \sim Beta\left( a_0 +\left[ \begin{array}{l} m_{\cdot j}^{(1)(2)}\\ n_{\cdot j}^{(1)(2)} \end{array}\right] , b_0+\varTheta _{\cdot j}^{(2)} \right) ,\\ \left( c_{j}^{(t)}|- \right) \sim Gam\left( e_0 + \varTheta _{\cdot j}^{(t)},\left[ f_0+\varTheta _{\cdot j}^{(t-1)} \right] ^{-1} \right) . \end{gathered}\end{aligned}$$
Fig. 2.
figure 2

The generated visual words and textual keywords for the mmUAM-1 of five different layers.

4 Experiments

4.1 Dataset Construction

As we know, Sina Weibo is one of the most popular social media platforms in China, where we collect a dataset and conduct our experiments on the real data. We crawl 1349 users, the data including users’ basic information and their posted microblogs from January 2015 to December 2016, in which each microblog contains both texts and images. After filtering inactive users and separating each user’s microblogs into two documents according to posting time, we have 193798 documents. In order to comprehensively evaluate the generated user attributes, we utilize both the crawled tags and the posted microblogs to manually label the users’ interests.

For preprocessing the textual dataset, we firstly utilize jieba participleFootnote 3 to segment the Chinese words, and then we eliminate the non-Chinese characters, stop words, and the low-frequency words that appear less than five times. For visual feature description, Scale-Invariant Feature Transform (SIFT) [16] is used to extract discriminant local features of images, thus generating 128-dimensional SIFT descriptors. To construct a codebook of visual words, we utilize k-means with each descriptor being a cluster center quantized into a visual word. As a result, each image is represented with the count of visual words in the codebook.

4.2 Evaluation Metrics and Compared Methods

The standard classification algorithm evaluation methods like precision, recall and F1-measure would not be sufficient to understand the performance of multi-label problems. Thus, we adopt the following evaluation measures proposed in [21], where we set \( X=\left\{ x_1, x_2,...,x_k\right\} \) as a set of output attributes and \( Y=\left\{ y_1, y_2,...,y_k\right\} \) as a set of ground-truth attributes.

Average Precision. We use average precision to calculate the mean value of ranking H of ground-truth attributes in predicted attributes. |N| is the number of documents.

$$\begin{aligned} ap\left( H \right) =\frac{1}{N}\sum _{i=1}^{N}\frac{1}{\left| Y_i \right| }\sum _{y\in Y_i}\frac{\left| \left\{ y^{'}\in Y_i|rank_f\left( x_i,y^{'} \right) \le rank_f\left( x_i,y \right) \right\} \right| }{rank_{f}\left( x,y \right) }. \end{aligned}$$
(14)

One Error. The measure evaluates how many times the top-ranked predicted attributes were not in the set of possible attributes Y. We express one-error of a hypothesis f as \( one-err(f) \).

$$\begin{aligned} oe\left( f \right) =\frac{1}{N}\sum _{i=1}^{N}\left\{ \left[ argmax_{y\in Y}f\left( x_i,y \right) \right] \notin Y_i\right\} . \end{aligned}$$
(15)

Ranking Loss. This evaluation is to minimize the average fraction of crucial pairs which are misordered.

$$\begin{aligned} rl=\frac{1}{N}\sum _{i=1}^{N}\frac{1}{\left| Y_i \right| \overline{\left| Y_i \right| }}\left| \left\{ \left( y,y^{'} \right) |f\left( x_i,y \right) \le f\left( x_i,y^{''} \right) ,\left( y,y^{''} \right) \in Y_i\times \overline{Y_i} \right\} \right| . \end{aligned}$$
(16)

Coverage. The coverage measures in a sequence queue, to go down the list of predicted attributes in order to cover all the possible attributes assigned to a document.

$$\begin{aligned} co\left( f \right) =\frac{1}{N}\sum _{i=1}^{N}max rank_{f}\left( x_i,y \right) -1. \end{aligned}$$
(17)

To demonstrate the effectiveness of our proposed mmUAM, we compare our algorithm with the following methods.

Poisson Gamma Belief Network (PGBN). The PGBN is applied on our crawled Sina Weibo dataset for short texts.

Multi-modal User Attribute Model (mmUAM). We adapt the multi-modal fusion in five layers, separately. The ways of five different fusion are denoted as mmUAM-1, mmUAM-2, mmUAM-3, mmUAM-4, mmUAM-5. Specifically, mmUAM-1 is expressed as the shared representation for texts and images learned in the first layer, and other mmUAMs are denoted in the same way.

4.3 Experimental Results

In this paper, we employ the layer-wise training method for the mmUAMs, with which we set a fixed budget on the width of layer one \( K_{1max}=400 \) and the depth of the network \( T =5\) . Besides, we set hyper-parameters as \( e_0=f_0=1 \), \( a_0=b_0=0.01 \), and \( \eta ^{(t)}=\xi ^{(t)}=0.05 \) for all the layers.

As the mmUAM learns the shared representation of textual and visual information, we do the qualitative analysis to show the effectiveness of our proposed model. We randomly selected two users to analyze the topics generated by the mmUAM-1 of five different layers. As an example, we choose 3 generated visual words and the top 6 textual words of the specific topic showed in Fig. 2. We can see that the visual representation and textual representation strengthen the description of user attributes. Obviously, the mmUAM can effectively model the semantic correlations of social data in multiple modalities.

We compare our mmUAM with PGBN and different ways of layer-fusion to evaluate the quality of our generating user attributes. Figure 3 displays the performance in term of average precision, one error, ranking loss, coverage. On the crawled Sina Weibo dataset, our proposed mmUAM with five ways of multi-modal fusion all performs better than the PGBN. The results also confirm the fact that multi-modal topic modeling works better than unimodal approaches. As for mmUAMs, there is slight difference among results on the four evaluation metrics. Especially, mmUAM-1 achieves the best result of the average precision and the one error. For the evaluation of the ranking loss and the coverage, mmUAM-5 achieves the lowest results that represent the best quality of the classification. As a result, the first-layer-fusion and the last-layer-fusion capture better correlations between texts and images than other layer-fusion.

Fig. 3.
figure 3

The performance is evaluated by average precision, one error, ranking loss, coverage, respectively.

5 Conclusion

In this paper, we proposed a novel multi-modal user attribute (mmUAM) model which automatically generated user interests from multi-modal social media data, by capturing correlations between textual and visual information. In particular, we improved the PGBN network to extract topics of interests better in line with users’ characteristics. We conducted experiments on our crawled microblog dataset, where the results demonstrated the superiority of our mmUAM as compared to the state-of-the-art methods.