1 Introduction

Social influence (SI) is a common feature of everyday life: we either try to influence others or are influenced by them many times each day. For example, colleagues have a strong influence on one’s work, while friends have a strong influence on one’s daily life. This influence can be somewhat banal such as what type of shoes to buy—or more significant—such as whether to vote for one candidate versus others. Rashotte (2007) defined SI as a change in an individual’s thoughts, feelings, attitudes, and behaviors that results from interaction with another individual or a group. Influence is an invisible, complex, and subtle phenomenon that governs social dynamics and user behaviors. Besides, SI takes many different forms and can be seen in processes of conformity, socialization, peer pressure, obedience, leadership, persuasion, minority influence, among others (Han and Li 2018).

A social network (SN) is a social structure made up of individuals or organizations, which are connected by one or more specific types of interdependency, such as friendship, kinship, common interest, likes/dislikes, or relationships of beliefs, knowledge or prestige (Travers and Milgram 1967). As the Internet evolved, online social networks (OSNs) emerged such as Twitter, LinkedIn, and Facebook. They have attracted a lot of attention since they allow users to share ideas, activities, events, and interests within their networks. In contrast to traditional (offline) social networks, OSNs store a register of the interaction between users based on their content shared between them, and how this content is propagated on the Internet, which is a result of SI.

Many researchers have tried to test or examine whether there is an influence, and how people influence each other in SN. Understanding how users influence each other and quantifying it can benefit various applications. In the field of data mining, some recognized applications include viral marketing (or targeted advertising in general) (Freeman 1978; Brown and Reingen 1987), recommender systems (Pálovics et al. 2014), analysis of information diffusion in social media like Twitter and Facebook (Tang et al. 2009), events detection (Weng et al. 2010), experts finding (Franks et al. 2013), link prediction (Qiu et al. 2018), and ranking of feeds.

Other disciplines outside social psychology, such as computation, politics, marketing recognize that social influence is key to their concerns. However, they are building models of influence largely without recognizing the extensive conceptual and empirical work done by social psychologists. Typical models in SIA such (Albert and Barabási 2002; Faloutsos et al. 1999; Newman 2003; Strogatz 2001; Tang et al. 2009; Hu et al. 2013, 2015; Cai et al. 2017; Qiu et al. 2018) focus on macro-level models such as degree distributions, diameter, clustering coefficient, and communities or fail to capture the complexity in the SI process. However, in online real situations, the spread of influence occurs through populations over a span of time with each individual serving as both a source and a target showing special properties.

Therefore, novel methods are required to characterize and quantify the process of the SI that can be extended to OSNs, which incorporate social theories to create a model that portrays this phenomenon as close to reality. Thus, the general objective of this research is to design and implement a novel model that characterizes the social influence process based on sociological theories and probabilistic reasoning theory for quantifying each user over the other users inside an online social network. The following specific objectives are defined for this work: First, to analyze the social influence process from the social science point of view in online social networks for selecting a set of involved factors; Second, to formalize the concept of social influence and sociological theories about social influence, such as user’s behavior, user’s profile, user’s temporal variation interest, and temporal evolution of relationships; Third, to design a mathematical model that integrates the formalized factors to represent the social influence process; Finally, to generate a probabilistic reasoning system with the designed model and an existing inference algorithm to ask queries to the model.

In the purse of these objectives, we present a novel algorithm called PRIN, a generative model describing the dynamic of the social influence process. Thus, PRIN allows measuring topical user-level social influence strength through the modeling of the previous objectives. Our final aim is to bring together large volumes of unstructured data such as content shared, and heterogeneous information such as structural link and diffusion links in this model. Additionally, research was performed in social sciences such as physiology, sociology, and anthropology. Therefore, PRIN reflects a broader and deeper knowledge about human behavior in its mathematical inception. We conducted an extensive set of experiments to evaluate the effectiveness of our model to discover user influence. The final results suggest that our model can improve the prediction performance, thus allowing to mine additional information inside social networks.

The rest of this paper is organized as follows: Sect. 2 summarizes related work about social influence analysis. Section 3 formulates social influence problem. Section 4 introduces the proposed model in detail. Section 5 presents the proposed social influence analysis architecture. Sections 6 and  7 present the experimental results and discussion of the validation of the model. Finally, Sect. 8 presents the conclusion and future work of this work.

2 Related work

Social influence analysis has been widely studied in the literature. We have explored related work in two main dimensions (March 1955). The first dimension is the explanatory models, and the second is predictive models for social influence in online social networks. We briefly discuss them, respectively, in this section.

2.1 Explanatory models

The explanatory models aim to examine the social influence process and elucidate the factors that affect it in an attempt to explain this phenomenon. Yang and Leskovec (2010) addressed influence as a form of information diffusion with temporal dynamics. Wen and Lin (2010) show that combining different social media improves the social influence measure. Crandall et al. (2008) studied the correlation between social similarity and influence. However, they focus on qualitative identification of influence existence, and they do not provide a quantitative measure of the influential strength. Xiang et al. (2010) and Goyal et al. (2010) further investigate how to learn the influence probabilities from the history of user actions. However, these methods either do not consider the influence at the topic-level or ignore the influence propagation. Kempe et al. (2003) focused on how influence propagated across a network, assuming an influential user is one whose initial adoption would eventually result in the most number of total adoptions by all users. However, they did not consider the topic-level influence. Hu et al. (2015) modeled topics and communities in a unified latent framework and extract inter-community influence dynamics. Cai et al. (2017) formalized the concepts of community profiling by its internal content profile and external diffusion profile. Besides the global effect of influence, many efforts have been made to estimate the concrete influence strength between individual nodes. Bi et al. (2014) integrate both content topic discovery and social influence analysis based on structural links in the same generative process at the user level.

2.2 Predictive models

The predictive models are used to predict the future social influence process in social networks based on certain factors. Researchers have addressed the problem of measuring social influence by predicting a user’s ability to disseminate information, such as retweet behavior on Twitter. Yang et al. (2010) analyzed the effects of different factors on retweeting behavior. Based on their observations, they proposed a semisupervised framework on a factor graph model to predict users’ retweet behaviors. Pezzoni et al. (2013) discussed how structural factors and retweet behavior affect information diffusion. Considering the retweeting behavior as atomic behavior, they proposed an agent-based information propagation model to generate a cascade. Fei et al. (2011) were the first to adopt microeconomics methods for social media behavior prediction. Saito et al. (2008) focused on the user-level mechanism in social influence where its near neighbors only influence each user. Tang et al. (2009) proposed a topic-specific influence measure. However, this article assumes topic distribution.

Although previous researches have applied many models to analyze social influence, our work is very different. The main contributions are summarized as follows: (1) another approach and formalization of the social influence in online social networks; (2) a novel probabilistic graphical model that characterizes the social influence process applies to online social networks based on sociological theories and previous attend for modeling this phenomenon; (3) a probabilistic reasoning system for quantifying social influence in probabilistic term; (4) a new architecture for social influence analysis focus on diffusion prediction between users inside social networks.

3 Problem formulation

The problem of social influence modeling has been open for a long time since it was proposed. It is because social influence is a relatively subjective concept, and it lacks a universally acknowledged definition. People are frequently confused by the concepts and the measuring methods of social influence. For different social networks, social influence is modeled quite differently. For example, in an electronic commerce network, the most influential users are the ones who can recommend most customers to purchase-specific products successfully (Li et al. 2017). After analyzing the characteristic of social influence and the typical characteristics of OSNs, we introduce a series of formal definitions about the concept of social influence. Then, we formulate the problem of topical mining influence in OSNs.

3.1 Concept definitions

The notation used in this paper is listed in Table 1.

Table 1 Notation used in the model

Definition 1

Social network. A social network is 3-tuple \({\mathcal {G}} = ({\mathcal {U}}, {\mathcal {M}}, {\mathcal {E}})\), where \(u \in {\mathcal {U}}\) is an user and \(m \in {\mathcal {M}}\) is a user published message, and \(e \in {\mathcal {E}}\) is a link. The link set \({\mathcal {E}}\) denotes interaction between users, and it can be derived from different types of user interactions, which are application dependent.

Definition 2

Topic. A topic \(k \in {\mathcal {K}}\) is a |W|-dimensional multinomial distribution \(\omega _k\) over a vocabulary, where each component \(\omega _{k,w}\) is the probability of a word \(w\in W\) belonging to k.

Definition 3

Message. A message \(m \in {\mathcal {M}}_u\) contains a set of words w from a given vocabulary, along with a time stamp \(t_{u,i}\), meaning that \(m_{u,i}\) is the ith message of the user u.

A link is a directed connection \(e_{u,v}\in {\mathcal {E}}\), representing a communication from user u to v. We define two types of links in \({\mathcal {E}}\): structural links between users and diffusion links between users’ messages.

Definition 4

Structural link. Let \(s_{u,v} \in {\mathcal {S}}\) be a structural link from user u to user v defined by a directed edge in a social graph. For each possible edge \(s_{u,v} \in {\mathcal {S}}\), if there exists an edge between u and v, \(s_{u,v} = 1\); otherwise \(s_{u,v} = 0\). Every user has a unique preference to generate structural links with other users based on the interest that a user u has in the published content or non-content of user v. As a result, each \(s_{u,v}\) is associated with:

  • a Bernoulli distribution \(\mu _{u,v}\), which characterizes the user u preference for following v based on the v shared content or non-content.

  • a binary variable \(y_{u,s}\). When \(y_{u,s} = 1\) indicates that the link creation is based on the user u content, whereas \(y_{u,s} = 0\) means that content has nothing to do with the link.

Definition 5

Diffusion link. Let \(d_{m_{u,i},m_{v,j}} \in {\mathcal {D}}\) be a diffusion link, where message i of user u is a broadcast of message j of user v. Each d is associated with a Bernoulli distribution \(\tau _{m,k,t}(u,v)\) characterizing diffusion probability of message \(m_{v,*}\) about topic k at time t from user u. This link represents a change in the state of u due to the influenced exerted by v.

Definition 6

Profiling vector. The profiling vector \(\mathbf {p_u}\) of user \(u \in {\mathcal {U}}\) is defined as an augmenting vector, where each dimension represents some encoded feature associated with a user, such as a gender, location, topic interest, among others.

Definition 7

Users similarity. Let u and v be two users with its associated profiling vectors \(\mathbf {p_u}\) and \(\mathbf {p_v}\). The user similarity of u and v is defined as the cosine similarity of theirs profiling vectors:

$$\begin{aligned} \rho (u,v) = \frac{\mathbf {p_u} \cdot \mathbf {p_v}}{\Vert \mathbf {p_u} \Vert \Vert \mathbf {p_v} \Vert }. \end{aligned}$$
(1)

Due to homophily, users tend to form communities (Rashotte 2007). Community is a collection of users with similar behaviors and more intense interactions among its members than the rest of the global network (Bi et al. 2014). It can be characterized not only by interaction link structures but also by the content generated by its members.

Definition 8

Community. A community \(c \in {\mathcal {C}}\) is associate with:

  • a \(|{\mathcal {K}}|\)-dimensional multinomial distribution \(\theta _c\) over topics, where each component \(\theta _{c,k}\) represents the probability of c discussing topic k. This topic distribution represents the community different topical interest.

  • a \(|{\mathcal {C}}|\)-dimensional multinomial distribution \(\psi _{k,c}\) over time specific to topic k and community c (Hu et al. 2015).

In social networks, users have multiple roles and are influenced by different community contexts (Xie et al. 2013). Therefore, we employ the mixed-membership approach for a user definition.

Definition 9

User. A user \(u \in {\mathcal {U}}\) is associated with:

  • A set of messages \({\mathcal {M}}_u = \{m_{u,1}, \ldots , m_{u,\mid M_u \mid }\}\) generated by user u.

  • A set of structural links \({\mathcal {S}}_u = \{ s_{u,v} \mid v \in {\mathcal {U}} \}\). Each link \(s_{u,v} \in {\mathcal {S}}_u\) represents a social relationship between user u and user v.

  • A set of diffusion links \({\mathcal {D}}_u = \{ d_{m_{u,i},m_{v,j}} \mid t_{u,i} > t_{v,j}\), where \(v \in {\mathcal {U}}\). Each link \(d_{m_{u,i},m_{v,j}} \in {\mathcal {D}}_u\) represents the diffusion of v’s message by u.

  • A user similarity distribution function \(\rho (u, v)\) over user \(v \in {\mathcal {U}}\), which is defined by the user similarity between u and other users.

  • A \(\mid {\mathcal {C}}\mid\)-dimensional multinomial distribution \(\pi _u\) over communities, where each component \(\pi _{u,c}\) represents the probability of u belonging to community c.

  • A \(\mid {\mathcal {U}} \mid\)-dimensional multinomial distribution over users \(\chi _u\), represents the probability of the user u to be considered as important by the other users.

Furthermore, we define user–user influence based on the dynamic process of user behaviors. For example, after user v posts a tweet, one of his followers u reads the tweet and chooses to retweet it. In this case, we assume that user u is influenced by user v, which is reflected as the retweet action performed by u. We model influence in terms of a diffusion probability when a user changes his actual state to another for diffusing specific messages at a particular time.

Definition 10

User–user topical influence strength. For a message m about topic k, at time t the user–user topical influence strength is represented as the diffusion probabilities between any two users from u to v, denoted as \(\tau _{k,t}(u,v)\).

Please note that the influence is asymmetric, i.e., \(\tau (u,v) \ne \tau (v,u)\). Further, we can define the concepts of user influence based on the influence between any pair of nodes.

Definition 11

User influence. Let u be a user in a social network; we denote \(\tau _t(u)\) as the user influence, which represents the user global influential strength of user u in time t in the social network.

3.2 Problem definition

We have as input a social network, with its observed and latent behavior. As observed variables, there is a set of users \({\mathcal {U}}\), per user publishes messages \({\mathcal {M}}\), per user profiling vector \(\rho\), per user’s structural links \({\mathcal {S}}\), and per user’s diffusion links \({\mathcal {D}}\). As a latent behavior, we have factors that according to sociology theories (Rashotte 2007; Granovetter 1977; Liu et al. 2012) constitute the reasons why users are influenced: The per-message topics assignment z, per message community assignment c, per communities’ interest \(\theta\), per user community membership \(\pi\), per user preference for building a structural link \(\mu\), per user importance in the network \(\chi\). These latent factors must be integrated to characterize the social influence process to reason about the social influence process and calculate \(\tau _{k,t}(u,v)\) at the end of our observation window.

4 Probabilistic reasoning system for social influence analysis (PRIN)

4.1 Overview of PRIN

Fig. 1
figure 1

Overview of our probabilistic reasoning system for social influence analysis

Figure 1 shows an overview of the social influence problem. Our goal is to infer topical social influence strength between users. In Fig. 1, we show the input for PRIN: a set of users, each of whom publishes messages; users are connected by structural links and interact with each other by diffusion links, e.g., on Twitter, each user posts tweets, users are connected by followership links, and they retweet each other to diffuse information. Then, for each user, we output a user influence profile as the model’s parameters (e.g., user’s community membership, user’s interest, user’s topical influence), and we enable inference about the diffusion probability between users. Finally, we allow applications such as diffusion prediction.

4.2 Probabilistic model design

To model the social influence dynamic, we decided to use a probabilistic Bayesian approach. Bayesian networks are graphical models capable of displaying relationships clearly and intuitively. Further, they handle uncertainty through the established theory of probability. Additionally, they are directional, thus being capable of representing cause-effect relationships. Not only that, they can encode all variables making it possible to handle missing data entries successfully, and they facilitate the use of prior knowledge that we already have about the phenomenon. Besides, the use of Bayesian networks allows us to think of the model in terms of a “generative story” that tells how social influence is created.

Although recent successful attempts inspire some of its building blocks in modeling the dynamics of social networks, including FLDA (Bi et al. 2014), COLD (Hu et al. 2015), and CPD (Cai et al. 2017), PRIN significantly goes beyond those on more comprehensive input features and powerful modeling ability. The social influence process has three main components.

  • User component Users in a social network usually have multiple community memberships. We associate each user u with a community membership vector \(\pi _u\). We consider a user u to publish a message \(m_{u,i}\) of topic z, due to her community assignment \(c_{u,i}\) and the community content profile \(\theta _{c_{ui}}\). We assign each post to one community \(c_{u,i}\) denoting the community membership of user i when he writes a message.

  • Content component Each message \(m_{u,i}\) shared by a user u contains a bag of words \({w_{u,i,1}, \ldots , w_{u,i,l}}\), where \(l = \mid m_{u,i}\mid\) denotes the length of the message. In traditional topic models such as Latent Dirichlet Allocation (LDA) (Nallapati et al. 2008), a document is associated with a mixture of topics, and each word has a topic label. This is reasonable for long documents such as academic paper. However, on social media like microblog, a message is usually short and thus is most likely to be about a single topic (Diao et al. 2012). We therefore associate each \(m_{u,i}\) to a single latent topic variable \(z_{u,i}\) drawn from \(\theta _{c_{u,i}}\) to indicate its topic. The words are then generated from the corresponding word distribution \(\omega _{z_{u,i}}\).

  • Network component Every user has a unique preference to form a structural link with others based on content or non-content reason (Bi et al. 2014). For example, A follows B because they are similar, and they both tweet about related topics (content reason), and sometimes A follows C because C is a very famous person (non-content reason). The Bernoulli distribution \(\mu _u\) characterizes this per user preference. As a result, for the \(s_{u,v}\), we first consult \(\mu _u\) to decide on the value of the binary variable \(y_{u,v}\)

    • \(y_{u,v} = 1\) indicates that the link creation is based on the user’s content. In this case, we consider that a user u form a structural link with user v, due to their similarity \(\rho _{u,v}\).

    • \(y_{u,v} = 0\) indicates that the link creation has nothing to do with the content. In this case, we use \(\chi _v\) to capture this probability.

Finally, we consider the influence of user v on user u as the probability to form a diffusion link \(d_{m_{u,i},m_{v,j}}\) where time for user message \(m_{u,i}\) is greater than message \(m_{v,j}\), \(t_{u,i} > t_{v,j}\). This is drawn from the users’ communities membership \(\pi\), theirs communities’ topical interest \(\theta\) and their structural link probability \(s_{u,v}\). The graphical model for PRIN is depicted in Fig. 2, while its generative process is summarized in Algorithm 1.

Fig. 2
figure 2

Plate notation of PRIN model. Network-aware component in pink color, user-aware component in green and content-aware component in blue (Color figure online)

figure a

4.3 Inference task

In the inference task, we aim to infer the latent variables from the observed variables. Therefore, we use collapsed Gibbs sampling for such inference (Guille et al. 2013). Gibbs sampling is an algorithm to approximate the joint distribution of multiple variables by drawing a sequence of samples, which iteratively updates each latent variable under the condition of fixing the remaining variables. Here, we describe the inference algorithm for PRIN based on collapsed Gibbs sampling.

Given a social network \({\mathcal {G}} = ({\mathcal {U}}, {\mathcal {M}}, {\mathcal {E}})\) and the pre-defined hyper parameters \(\alpha , \beta , \gamma , \delta , \epsilon\), PRIN specifies the following full posterior distribution:

$$\begin{aligned}&P(\omega , \psi , \pi , \rho , \theta , \mu , \chi , w, z, c,s, y,\nonumber \\&\quad d \mid {\mathcal {U}}, {\mathcal {M}}, {\mathcal {E}}, \alpha , \beta , \gamma , \delta , \epsilon ). \end{aligned}$$
(2)

The joint distribution can be decomposed into a product of several factors:

$$\begin{aligned}&P(\omega , \psi , \pi , \rho , \theta , \mu , \chi , w, z, c, s, y, d \mid {\mathcal {U}}, {\mathcal {M}}, {\mathcal {E}}, \alpha , \beta , \gamma , \delta , \epsilon ) \nonumber \\&\quad \propto \big ( P(\omega \mid \beta ) P(\pi \mid \gamma ) P (\theta \mid \alpha ) P(\mu \mid \delta ) P(\psi \mid \epsilon )&\nonumber \\&\qquad P(w \mid \omega , z) P(z \mid \theta , c) P(y \mid \mu ) P(c \mid \pi ) \end{aligned}$$
(3)
$$\begin{aligned}&\qquad P(d \mid z, c, s) P(s \mid \rho , \chi , y) P(t \mid psi, c, z) P(\rho ), P(\chi )\big ). \end{aligned}$$
(4)

The task of posterior inference for PRIN is to determine the probability distribution of the hidden variables zcsd given the observed words, timestamps, and network. However, exact inference is intractable due to the difficulty of calculating the normalizing constant in the posterior distribution. We use collapsed Gibbs sampling for approximate inference. Because the model uses only conjugate priors (Doucet et al. 2000), we can reduce the number of parameters in the model by integrating out the multinomial distribution \(\varphi = {\mu , \pi , \theta , \omega , \psi }\). This has the effect of taking all possible values of \(\varphi\) into account in our sampler, without representing it as a variable explicitly and having to sample it at every iteration:

$$\begin{aligned} \begin{aligned}&P(\omega , \psi , \pi , \rho , \theta , \mu , \chi , d, w, z, c,s,\\&\quad y \mid {\mathcal {U}}, {\mathcal {M}}, {\mathcal {S}}, {\mathcal {D}}, \alpha , \beta , \gamma , \epsilon , \sigma ) \\&\quad \propto \int _{\mu } P(y \mid \mu ) P(\mu \mid \sigma ) \mathrm{d}\mu \\&\qquad \int _{\omega } P(w \mid \omega , z) P(\omega \mid \beta ) \mathrm{d}\omega \\&\qquad \int _{\pi } P(c \mid \pi ) P(\pi \mid \gamma ) \mathrm{d} \pi \\&\qquad \int _{\theta } P (\theta \mid \alpha ) P(z \mid \theta , c) \mathrm{d} \theta \\&\qquad \int _{\psi } P (\psi \mid \epsilon ) P(t \mid \psi , c, z) \mathrm{d} \psi \\&\qquad P(d \mid z, c, s) P(s \mid \rho , \chi , y) P(\rho ), P(\chi ). \end{aligned} \end{aligned}$$
(5)

We are going to focus on \(\int _{\pi } P(c_{u,n} \mid \pi _u) P(\pi _u \mid \gamma _u) \mathrm{d} \pi\), this apply for \(i \in \varphi\):

$$\begin{aligned} \begin{aligned}&\int _{\pi } P(c_{u,n} \mid \pi _u) P(\pi _u \mid \gamma _u) \mathrm{d} \pi \\&\quad = \int _{\pi } \frac{\Gamma (\sum ^{{\mathcal {C}}}_{i=1} \gamma _u)}{\prod ^{{\mathcal {C}}}_{i=1} \Gamma (\gamma _u)} \prod ^{{\mathcal {C}}}_{i=1} \pi _{u,c}^{\gamma _u - 1 + N_i} \mathrm{d} \pi \\&\quad = \frac{\Gamma (\sum ^{{\mathcal {C}}}_{i=1} \gamma _u)}{\prod ^{{\mathcal {C}}}_{i=1} \Gamma (\gamma _u)} \int _{\pi } \prod ^{{\mathcal {C}}}_{i=1} \pi _{u,c}^{\gamma _u - 1 +N_i} \mathrm{d} \pi \\&\quad = \frac{\Gamma (\sum ^{{\mathcal {C}}}_{i=1} \gamma _u)}{\prod ^{{\mathcal {C}}}_{i=1} \Gamma (\gamma _u)} \frac{\prod ^{{\mathcal {C}}}_{i=1} \Gamma (\gamma _u + N_i)}{\Gamma (\sum ^{{\mathcal {C}}}_{i=1} \gamma _u + N_i)}. \end{aligned} \end{aligned}$$
(6)

The definition of a Gibbs sampler specifies that in each iteration, we assign a new value to variable \(c_{u,j}\) by sampling from the conditional distribution. Intuitively, at the start of an iterations t, we have the collection of all current information at this point in the sampling process. When we want to sample the new value of \(c_{u,j}\), we temporarily remove all information about this community from the collection of information \(c_{-u,j}\).

$$\begin{aligned} \begin{aligned}&P(c_{u,j} =c \mid c_{-u,j}, \gamma _u)= \frac{P(c_{u,j}, c_{-u,j}, \gamma _u)}{P(c_{-u,j}, \gamma _u)} \\&\quad = \frac{ \frac{\Gamma (\sum ^{{\mathcal {C}}}_{i=1} \gamma _u)}{\prod ^{{\mathcal {C}}}_{i=1} \Gamma (\gamma _u)} \frac{\prod ^{{\mathcal {C}}}_{i=1} \Gamma (\gamma _u + N_i)}{\Gamma (\sum ^{{\mathcal {C}}}_{i=1} \gamma _u + N_i)} }{\frac{\Gamma (\sum ^{{\mathcal {C}}}_{i=1} \gamma _u)}{\prod ^{{\mathcal {C}}}_{i=1} \Gamma (\gamma _u)} \frac{\prod ^{{\mathcal {C}}}_{i=1} \Gamma (\gamma _u + N_{i, -j})}{\Gamma (\sum ^{{\mathcal {C}}}_{i=1} \gamma _u + N_{i,-j})}}\\&\quad =\frac{\prod ^{{\mathcal {C}}}_{i=1} \Gamma (\gamma _u + N_i) \Gamma (\sum ^{{\mathcal {C}}}_{i=1} \gamma _u + N_{i,-j})}{\prod ^{{\mathcal {C}}}_{i=1} \Gamma (\gamma _u + N_{i,-j}) \Gamma (\sum ^{{\mathcal {C}}}_{i=1} \gamma _u + N_i)}\\&\quad = \frac{\Gamma (\gamma _u + N^{(c_u,j)}) \Gamma (\sum ^{{\mathcal {C}}}_{i=1} \gamma _u + N_{i,-j}^{(c)})}{\Gamma (\gamma _u + N_{i,-j}^{(c_{u,j})}) \Gamma (\sum ^{{\mathcal {C}}}_{i=1} \gamma _u + N_{i}^{(c)})}\\&\quad = \frac{\gamma _u + N_{i,-j}^{(c_{u,j})}}{\sum ^{{\mathcal {C}}}_{i=1} \gamma _u + N_{i,-j}}, \end{aligned} \end{aligned}$$
(7)

where \(N_{i,-j}\) is the number message assigned to community i excluding message j.

The same process apply for every element in \(\varphi\). After sufficient number of sampling iterations. We list the update equations for each variable as below:

$$\begin{aligned}&P(c_{u,i} = c \mid z_{u,i} = k, t_{u,i} =t, c_{-u,i}, z_{-u,i}, t_{-u,i},.) \nonumber \\&\quad \propto \frac{N_u^{(c)} + \gamma }{N_u^{(.)} + \mid C \mid \gamma } \cdot \frac{N_c^{(k)} + \alpha }{N_c^{(.)} + \mid K \mid \alpha } \cdot \nonumber \\&\qquad \frac{N_{c,k}^{(t)} + \epsilon }{N_{c,k}^{(.)} + \mid T \mid \epsilon } \end{aligned}$$
(8)
$$\begin{aligned}&P(z_{u,i} = k \mid c_{u,i} = c, t_{u,i} =t, c_{-u,i}, z_{-u,i}, t_{-u,i}, w, .) \nonumber \\&\quad \propto \frac{N_c^{(k)} + \alpha }{N_c^{(.)} + \mid K \mid \alpha } \cdot \frac{N_{c,k}^{(t)} + \epsilon }{N_{c,k}^{(.)} + \mid T \mid \epsilon } \cdot \nonumber \\&\qquad \frac{\prod _{v = 1}^{M} \prod _{q=0}^{N_{u,i}^{(v)}-1} (N_k^{(v)} + q + \beta )}{ \prod _{q=0}^{N_{u,i}^{(.)}-1} (N_k^{(.)} + q + \mid M \mid \beta )} \end{aligned}$$
(9)
$$\begin{aligned}&P(d_{m_{u,i},m_{v,j}} = 1,u,v \mid z_{u,i} = k, z_{v,j} = k, c_{u,i} = c, \nonumber \\&\quad c_{v,j} = c, s_u,v = s, c_{-u,i}, z_{-u,i}, c_{-v,j}, z_{-v,j}, .) \nonumber \\&\quad \propto \frac{N_c^{(k)} + \alpha }{N_c^{(.)} + \mid K \mid \alpha } \cdot \frac{N_u^{(c)} + \alpha }{N_u^{(.)} + \mid C \mid \gamma } \cdot \nonumber \\&\qquad \frac{N_v^{(c)} + \alpha }{N_v^{(.)} + \mid C \mid \gamma } \cdot \frac{N_{u,v}^{(s)}}{N_{u,v}^{(.)} + \mid C \mid \gamma } \cdot \end{aligned}$$
(10)

Furthermore, we can estimate the user–user topical influence strength. Suppose \(\tau _{k,t}(u,v)\) represent the influence strength from user u to user v on topic k, which satisfy that \(\tau _{t}(u,v) = \sum _{k=1}^{\mid K\mid } \tau _{k,t}(u,v)\). Thus, user–user topical influence strength can be estimated by

$$\begin{aligned} \tau _{k,t}(u,v)= & {} \sum _{m_{u,i} = 1}^{M_{u,t}} P(d_{m_{u,i},m_{v,j}} = 1,u,v \mid z_{u,i} \nonumber \\=& k, z_{v,j} = k, c_{u,i} = c, c_{v,j} = c,\nonumber \\&\quad s_u,v = s, c_{-u,i}, z_{-u,i}, c_{-v,j}, z_{-v,j}, .) \end{aligned}$$
(11)

where \(N_u^{(c)}\) denotes the number of messages of user u assigned to community c; \(N_c^{(k)}\) is the number of posts assigned to community c and generated by topic k; \(N_{c,k}^{(t)}\) denotes the number of times that time stamp t is generated by community c and topic k; \(N_{u,i}^{(v)}\) is the number of times word v occurs in the message \(m_{u,i}\); \(N_k^{(v)}\) denotes the number of times words v is assigned to topic k. Marginal counts are represented with dots, e.g., \(N_{c,k}^{(.)}\) denotes the total number of time stamp generated by community c and topic k.

4.4 Time complexity

We now analyze the time complexity of the inference algorithm. In each iteration, the latent variables are sampled. Updates are performed by iterative sampling each latent variable from its conditional distribution, given the current values of the other variables. Since all the counters (e.g., \(N_c^{(k)}\)) involved in the previous equations can be caught and updated in constant time for each \(c_{u,i}\) being sampled. Thus, sampling all c takes linear time w.r.t. the number of messages. Then, sampling all z is linear in the number of words. Overall, the inference algorithm takes linear time in the amount of data.

5 PRIN architecture

Fig. 3
figure 3

Architecture of the social influence analysis proposed

We propose a component-based architecture for the SIA performed. Figure 3 contains the diagram depicting the architecture of the system. We apply the discovered influence on user behavior prediction. The system consists of three main components:

  • Data collection This module is in charge of compiling the user’s data from a social networking service for getting the input to our system. The data depends on the social network such as structural information (friendship links in case of Facebook, following links in case of Twitter or Instagram, connections between users in case of LinkedIn), diffusion information (posts, tweets, messages). It continuously crawls a social network. Then, it stores the information gathered by the crawler in a Graph Database (Miller 2013).

  • Machine learning pipeline This module generates a Bayesian network that describes the social influence process inside a social network. The stages are described below:

    1. 1.

      Data ingestion The social network data is obtained and imported from the Graph Database, and this is called raw data.

    2. 2.

      Data preparation A series of transformations in the raw data is performed to obtain it in the correct form. The transformations involve filling missing values or removing duplicate records or normalizing. This is where the feature extraction, construction, and selection takes place too to get the data model.

    3. 3.

      Code the model The structure of the Bayesian network is built from a social network. It takes as input the number of users, number of topics, number of communities.

    4. 4.

      Train, evaluate, and tune the model The data model is split into subsets of data to train the model, test it, and further validate how it performs against new data. The train set is used to calculate the parameters (probability distributions of the random variables). Then, the performance of the model is evaluated using the test and validation subsets of data to understand how accurate the prediction is. This is an iterative process until the model is calibrated.

    5. 5.

      Model deployment Once the chosen model is produced, it is stored and embedded in decision-making frameworks as a part of an analytics solution.

    6. 6.

      Manage the model and version The model is continuously monitored to observe how it behaved and calibrated accordingly to new data.

  • User interface The user interface is composed of an inference algorithm that uses the model to answer queries and a visualization tool for the results.

6 Experimental evaluation

We performed extensive experiments with real-world data sets to quantitatively evaluate the performance of PRIN. The empirical evaluation is divided into multiple stages. First, we evaluated the PRIN’s content component capability to extract topics. Moreover, we tested the PRIN’s user component effectiveness for community detection. Then, we evaluated the PRIN’s network component, where diffusion-links prediction performance was reported. At last, we performed a parameter analysis over the whole system for diffusion-links predicting.

6.1 Evaluation metrics

We used case studies to demonstrate further the effectiveness of our proposed model in real social networks using the following performance metrics.

  • Log-Likelihood The log-likelihood is, as the term suggests, the natural logarithm of the likelihood. The likelihood measures the probability of the observed data, given the model, i.e., how well a model fits the observed data. The highest the likelihood, the better the model for the given data. For many applications, the log-likelihood is more convenient to work with (Koller et al. 2009). This is because we are generally interested in where the likelihood reaches its maximum value: the logarithm is a strictly increasing function, so the logarithm of a function achieves its maximum value at the same points as the function itself, and hence the log-likelihood can be used in place of the likelihood in maximum likelihood estimation and related techniques.

  • Perplexity It is also a measure of model quality, and Natural Language Processing is often used as “perplexity per number of words.” It describes how well a model predicts a sample, i.e., how much it is “perplexed” by a sample from the observed data. A lower perplexity score indicates better generalization performance. The perplexity of a discrete probability distribution (Brown et al. 1992) is calculated as follows:

    $$\begin{aligned} 2^{H(P)} = 2^{\sum _x P(x) \log _2 P(x)}, \end{aligned}$$
    (12)

    where H(P) is the entropy of the distribution P(x), and x is a random variable over all possible events.

  • AUC (Area Under the Curve) ROC (Receiver Operating Characteristics) curve AUC–ROC curve is a performance measurement for the classification problem at various threshold settings. ROC is a probability curve, and AUC represents the degree or measure of separability. It tells how much model is capable of distinguishing between classes. The curve is created by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR). The area under the curve is used to give a score to the model. If the area under the curve is 0.5, then the TPR is equal to the FPR, and the model is doing no better than random guessing.

Generally, a model with higher log-likelihood and lower perplexity is considered to be good. Then, a rough guide for evaluating a model with the AUC value is the traditional academic point system:

  • .90–1 means the model is Excellent.

  • .80–.90 means the model is Good.

  • .70–.80 means the model is Fair.

  • .60–.70 means the model is Poor.

  • .50–.60 means the model Fails.

6.2 Baseline algorithms

We compared the results of our approach with previous work to evaluate PRIN. We choose baseline based on the following guidelines (1) They are state of the art to model heterogeneous networks; (2) They model diffusion prediction at the task level. Finally, we selected the baselines below and list our differences with them in Table 2.

  • Poisson Mixed-Topic Link Model (PMTLM) proposed by Zhu et al. (2013) defines a generative process for both text and links between users.

  • Followship-LDA (FLDA) proposed by Zhu et al. (2013) integrates both content topic discovery and social influence analysis in the same generative process. FLDA is a Bayesian generative model that extends Latent Dirichlet Allocation (LDA).

  • COmmunity Level Diffusion Model (COLD) proposed by Hu et al. (2015) probabilistic generative model capturing influence between communities at different topics.

  • Community Profiling and Detection (CPD) proposed by Cai et al. (2017) offers a generative model to formalize the concept of community profiling. They characterize a community in terms of its member users and both its internal content profile and external diffusion profile.

  • Deepinf proposed by Qiu et al. (2018) deep neural network used for predicting social influence.

  • Support Vector Machine (SVM). We use the support vector machine with a linear kernel as a classifications model.

  • Logistic Regression (LR). We also use the logistic regression (LR) to train a classification model.

Table 2 Differences with baselines

6.3 Case study: Content component for topic modeling from Amazon reviews

The goal of this case study is to evaluate the performance of the PRIN content component for topic modeling with labeled data. We selected log-likelihood and perplexity as evaluation metrics.

6.3.1 Methodology

  • Data set For evaluation of the PRIN’s content component, we used Amazon products reviews data set. The data set is a list of 34,660 consumer reviews for Amazon products like the Kindle, Fire TV Stick, and more provided by Datafiniti’s Product DatabaseFootnote 1. The data set includes basic product information, rating, review text, and more for each product. We decided to use this data set because it is labeled meaning by product; therefore, we can use this information for comparing our results. Besides, it is composed of short text like the ones that are shared in most social networks (Diao et al. 2012).

  • Data set preparation For the aim of this study we extracted the following features: categories and reviews.text. Then, we only keep reviews with less than 200 words. The remained reviews followed these characteristics:

    • The average number of words in a review is 14.64.

    • The minimum number of words in a review is 5.

    • The maximum number of words in a review is 80.

    Figure 4 presents the word cloud of the corpus, and Fig. 5 shows the review distribution per category (38 total categories). It can be observed that there is an imbalanced problem in the data set; for solving this problem, we use the oversampling technique. Table 3 shows the name of the categories presented in Fig. 5. We processed the above data set following the practice in existing work for cleaning the data (Vega and Mendez-Vazquez 2016). However, we use a Bag of word (Goldberg 2017) approach for feature representation.

  • Experimental setup Table 4 shows the values of the used parameters; batch size is the number of reviews to be used in each training chunk, alpha determines how often the model parameters should be updated, and epochs are the total number of training passes. After all, we were interested in determining what topic a given review is about, and we assigned this topic base on the highest percentage contribution of the topic in that review.

Fig. 4
figure 4

Word cloud of the Amazon review corpus

Fig. 5
figure 5

Amazon reviews per categories

Table 3 Amazon reviews categories
Table 4 Parameter setting for Case study 6.3

6.3.2 Results

Figure 6 shows the log-likelihood value under varying number of topics, and Fig. 7 shows the perplexity values under varying number of topics. Therefore, it can be observed that with \(K=15\), the model has a balance between perplexity and log-likelihood.

Fig. 6
figure 6

Log-likelihood scores of PRIN’s content component (the high, the better) from Case study 6.3

Fig. 7
figure 7

Perplexity scores of PRIN’s content component (the less, the better) from Case study 6.3

6.4 Case study: Content component for topic modeling from Twitter

The goal of this case study is to evaluate the PRIN content component performance for topic modeling in Tweets against other algorithms. However, in this case study, we do not have label data. Therefore, we evaluated the performance qualitatively following the practice of Chang et al. (2009) and quantitative using perplexity as a metric.

6.4.1 Methodology

  • Data set The data set was crawled from Twitter. We started by choosing top popular users in specific topics such as Katty Perry, Donald Trump, Bill Gates, and Cristiano Ronaldo. Using these users as seed users, we crawled a network with about 3661 active users (Huberman et al. 2008). We extracted all the information posted by them from 12-14-2018 to 12-18-2018, which gave rise to 44209 following relationships, 56918 tweets, and 24649 retweets.

  • Data set preparation For the aim of this study, we used only the users’ tweets. Then, we processed the data set following the practice in existing work for cleaning the data (Vega and Mendez-Vazquez 2016). However, we use a Bag of words (Goldberg 2017) approach for feature representation.

  • Experimental setup We used the same parameter setting as in Case study 6.3 presented in Table 4.

6.4.2 Results

We reported the PRIN’s content component capacity in extracting latent topics in the Twitter data set. For qualitative evaluation, we presented four example word clouds in Fig. 8.

Fig. 8
figure 8

Word clouds of extracted topics

For quantitative evaluation, Fig. 9 shows the perplexity values under varying number of topics. It revealed that PRIN (\(K = 100\)) has the lowest perplexity, indicating the best topic discovery performance among all the competitors. Perplexity scores for COLD and PRIN were close, and both significantly outperform PMTLM.

Fig. 9
figure 9

Perplexity scores from Case study 6.4

6.5 Case of study: User component for community detection in Twitter

The goal of this case study is to evaluate the performance of the PRIN user component for community detection. For quantitative evaluation, due to the lack of ground truth for communities inside the network, we use link prediction, a widely used quantitative measurement in the mixed-membership community setting without community labels (Biswas and Biswas 2017). Link prediction is defined to estimate the probability of a link between two users. Moreover, as there is not a predefined threshold for link prediction, we used AUC as the evaluation metric.

6.5.1 Methodology

  • Data set The Higgs Boson Twitter data set (De Domenico et al. 2013) was built after monitoring the spreading processes on Twitter before, during, and after the announcement of the discovery of a new particle with the features of the elusive Higgs Boson on July 4th, 2012. The data set contains three types of social interaction mentioning, retweeting, replying to existing “Higgs” related tweets and friends/followers social relationships among users involved in the above activities. The data set statistics are presented in Table 5. It is worth remarking that the user IDs have been anonymized, and the same user ID is used for all networks.

  • Data set preparation For the aim of this study, we used only the followers’ social relationships among users. We processed the data set as follows: for every pair of users, we generated a positive instance if there is a follower relationship; otherwise, we created a negative instance. Our target is to distinguish positive instances from negative ones.

  • Experimental setup Table 6 shows the values of the used parameters.

Table 5 Data set statistics of Higgs Twitter Data set
Table 6 Parameter setting for Case study 6.5

6.5.2 Results

Figure 10 shows the AUC values for the models. PRIN (\(C= 100\), \(K= 100\)) outperformed all other methods. Moreover, COLD and PRIN were significantly better than PMTLM.

Fig. 10
figure 10

Structural link prediction performance

6.6 Case study: Network component for influence prediction in Twitter

The goal of this case study is to evaluate the PRIN network performance for diffusion link prediction (retweets) on Twitter. For quantitative evaluation, we selected AUC as an evaluation metric, since there is not a predefined threshold for link prediction.

6.6.1 Methodology

  • Data set We used the Hibbs Boson Twitter network detailed in Case study 6.5.

  • Data set preparation We processed the data set following the practice in existing work (Zhang et al. 2013, 2015). Furthermore, for a user v who was influenced to perform a social action a at some timestamp t, we generate a positive instance. Moreover, if a user v was never observed to be active in our observation window, we create a negative instance. For instance, analyzing observation windows of 5 consecutive days, for a tweet a generated at timestamp t by a user v, we analyze if it is a retweet from another tweet. In this case, we generate a positive instance; otherwise, it will be considered as a negative instance in that observation window. Our target is to distinguish positive instances from negative ones in the observation window.

  • Experimental setup We performed nested cross-validation over a windows time per user in the sampled social network of 100 users. The window size used for the experiments was of 5 days. The historical validation consists of the following:

    1. 1.

      Extract all the timeline information (observed variables) from the set of users: message, structural links, and diffusion links.

    2. 2.

      Split the timeline into K “folds” (sets) of the size according to the window size: the most recent information will be used as a test set, and the oldest will be used as a train set.

    3. 3.

      Calculate the CPDs using the train set for the K-fold.

    4. 4.

      Calculate the predictions from the test set, removing the feature of diffusion and set the rest of the variables as evidence.

    5. 5.

      Compare the prediction with the real labels calculating the AUC.

    6. 6.

      Repeat Step 3 to 5 for every fold, merging the previous train set K with the new one \(K+1\).

    7. 7.

      Get the average AUC for all the folds.

6.6.2 Results

We compared the prediction performance of all methods across the Higgs Boson data set; Fig. 11 presents the results. PRIN achieved significantly better performance over baselines in terms of AUC, demonstrating the effectiveness of our model.

Fig. 11
figure 11

Diffusion link prediction performance from Case study 6.6

6.7 Case of study: Parameter analysis

PRIN model is mostly affected by two parameters, i.e., a number of communities C and a number of topics K. We investigated how the prediction performance varies with these parameters.

6.7.1 Methodology

  • Data set We used same Twitter data set as in Case study 6.4.

  • Data set preparation We applied the same data preparation as in Case study 6.6.

  • Experimental setup We used the same parameter values as in Case study 6.4 presented in Table 4.

6.7.2 Results

Figure 12 shows the AUC values of diffusion link prediction under different settings, such as a number of topics and a number of communities.

Fig. 12
figure 12

The impact of model parameters C and K in the task of diffusion link prediction

7 Discussion

This research was focused on the social influence’s characterization problem. We proposed a probabilistic model called PRIN. PRIN was designed to reveal the rich spectrum of the online social influence process. Then, we validated each component with case studies. We discussed their main findings in this section.

In Case study 6.3, we evaluated content component with a labeled data set. We were expected the best performed with 38 numbers of topics, due to this value was the real number of categories labeled by Amazon. However, we discovered that the model performed well with 15 topics, with a low perplexity and high log-likelihood. This is due to the fact that some of the categories overlap each other, as follows:

  • eBook Readers, Kindle E-readers, Computers and Tablets, E-Readers and Accessories, E-Readers.

  • Electronics, eBook Readers and Accessories, Covers, Kindle Store, Amazon Device Accessories, Kindle E-Reader Accessories, Kindle (5th Generation) Accessories, Kindle (5th Generation) Covers.

  • Electronics, iPad and Tablets, All Tablets, Computers/Tablets, and Networking, Tablets and eBook Readers, Computers and Tablets, E-Readers and Accessories, E-Readers, Used:Computers Accessories, Used:Tablets, Computers, iPads Tablets, Kindle E-readers, Electronics Features.

  • Electronics, iPad and Tablets, All Tablets, Computers and Tablets, Tablets, eBook Readers.

Moreover, the results of the Content component’s performance in Case study 6.4 were the lowest perplexity, against the other algorithms for topic modeling in Tweets. This means that the component is doing well compared with the state-of-the-art algorithms. Furthermore, Case study 6.5 presented the AUC values for structural link prediction. PRIN (\(C= 100\), \(K= 100\)) outperformed the others algorithms with a sample of 100 users. This can be attributed to their lack of representation of the user’s profile. Besides, as shown in Fig. 11 from Case study 6.6, our model consistently outperforms all the baselines, thanks to our modeling various diffusion factors and heterogeneous user links, in contrast with the baselines in Table 2. Then, in Case study 6.7 is observed that the model is highly dependent on the values of the K and C parameters. They together exert influence on diffusion prediction accuracy. The performance is stable under a broad range of parameter settings, indicating little tuning is required in actual deployment.

Generally, the quality of the predictions depends on two things: the degree to which the original model accurately reflects real-world situations, and the amount of data you provide. One limitation of our research was the sample size, which was too small for large-scale social networks. However, from the results of those limited number of users, it seems that the model can reflect the social influence process with high values of AUC.

Moreover, our research only focuses on one type of social networks such as TwitterFootnote 2, the results can be generalized to any other social networks such as DiggsFootnote 3, RedditFootnote 4. On Twitter, the user action is defined as whether a user posts/re-tweets a blog on a specific topic. On Digg and Reddit, the action is defined as whether a user submits/votes a story on a topic.

8 Conclusion and future work

In this paper, we study the social influence modeling problem. We introduce a Probabilistic Reasoning System for Social Influence Analysis (PRIN) to describe the problem using a graphical probabilistic model. The model leverages both heterogeneous link information, time, and content associated with each user in the network to mine topic-level influence strength in online social networks.

The extensive experimental results demonstrate the effectiveness of our model. PRIN achieved the best performance in the tasks of link prediction and text perplexity among several competitors.

The existing models fail to give a complete view of social influence interaction dynamics. The proposed model provides a visual and more in-depth understanding of user interactions, which can intuitively capture the process that drives social influence.

For future work, it is known that users’ behaviors are distributed in different networks, we are interested in merging the information from various social networks and leverage the correlation between them to better performance of the influence learning. Then, since the proposed model is generative, this means that arbitrary queries can be answered, not only about the probability of a diffusion. It will be interesting to explore other questions such as community memberships and interest detection in a topic. Another interesting issue is to employ semisupervised learning to incorporate user feedback into our approach. Then, the content component could be replaced for a supervised algorithm which could improve the accuracy of the predictions. Besides, since social networks are growing in size, it will be interesting to explore a distributed implementation of the proposed architecture to handle large-scale social networks.