Keywords

1 Introduction

The recent rapid increasing popularity of online social networks (OSN) and new communication technologies availability (smartphone, tablet, etc.) provided the world population with a great facility to share and to communicate with others throughout the whole world and to share and exchange information, opinions, products, ideas, and services. The rich content of OSN that can be in diverse formats (text, image, video, audio, etc.) has been attracting researchers to study and analyze large-scale social structures and users’ behaviors in order to – among other purposes – understand the flow of information through social networks. A lot of real-world applications like public opinion monitoring, recommender systems, and political campaigns often make use of OSNs for influence diffusion (Fu et al. 2016; Can and Alatas 2019; Saxena and Saxena 2020). Many existing models for influence diffusion have been proposed for various applications. In this chapter, we look into influence diffusion in the setting of viral marketing.

In essence, viral marketing is an efficient solution to advertisement for commercial companies through OSN where companies try to promote their products and services through word-of-mouth propagation among friends or followers. One of the fundamental objective of viral marketing is to find a set of users with the maximum influence in the network where the output is called K-seed set users with K as the optimal number of users chosen for influencing other users in the network.

The goals of this chapter are to highlight some of the most used and most recent work that has been done in social influence diffusion and influence maximization model for viral marketing and to call attention to the topic of deep learning for social influence. To the best of our knowledge, this work is a literature review of influence analysis for viral marketing in online social networks. Firstly, we give a better understanding of the preliminary knowledge concerning social influence analysis, and we illustrate the categories of relevant research works on influence analysis in the context of viral marketing. After that, we categorize and compare a number of relevant research works on influence maximization algorithms in social networks and the identification of influential spreaders.

In the following section, we describe the main concepts that will be addressed throughout the chapter, namely, online social networks, social network analysis, social influence analysis, viral marketing, and the definition problem of influence maximization. Section 3 presents related work to diffusion influence modeling, identification of influential spreaders, and influence maximization. Section 4 exhibits the current methodology using deep learning for influence spreading modeling. Finally, we conclude the chapter in Sect. 5.

2 Background

In this section, we begin with an overview of the basic terminology.

2.1 Online Social Networks

OSNs can be defined within the context of systems, but in general, they can be defined as a network of interactions or relationships, where nodes consist of actors or persons and the links consist of the relationships or the interactions between the actors through the network. According to (Kemi 2016), an OSN alternatively referred to as a virtual community or profile site, a social network is a website on the Internet that brings people together in a central location to talk, share ideas and interests, or make new friends. With the emergence of the World Wide Web (WWW), OSNs have dramatically expanded in popularity around the world. For further details about online social networks and their features, we recommend readers to check (Fu et al. 2016; Boyd and Ellison 2007).

2.2 Social Network Analysis

Social network analysis (SNA) or social network mining is the research of social relations between nodes or people. The study of social network mining technologies focuses on the level of individuals, groups, organizations, and whole networks. It is the drawing and determining associations and drifts among users and other interlinked information entities using networks and graph theory (Otte and Rousseau 2002). Many research applications benefit from SNA techniques such as community or group detection (Cai et al. 2016; Dabaghi-Zarandi and Rafsanjani 2019), expert finding (Yuan et al. 2020), link prediction (Daud et al. 2020), recommender systems (Pourhojjati-Sabet and Rabiee 2020), predicting trust and distrust among individuals (Towhidi et al. 2020; Girdhar et al. 2019), influence propagation (Ortiz-Gaona et al. 2020; Abd Al-Azim et al. 2020; Singh et al. 2019; Gulati and Eirinaki 2018), etc.

2.3 Social Influence Analysis

According to (Sun and Tang 2011), influence is usually reflected in changes in social action patterns (i.e., user behavior) in a social network. Typically, it refers to the phenomenon that an individual’s emotions, opinions, or behaviors are affected by others (Qiu et al. 2018). This means that people are influenced by their social circle and tend to imitate the actions of those in their immediate vicinity. As a result of its wide range of real-world applications, social influence analysis has received considerable attention in the past (Peng et al. 2016) such as domain expert finding (Al-Taie et al. 2018), personal recommendation (Cheng et al. 2016), emotion prediction (Qiyao et al. 2016), and viral marketing (Talukder et al. 2017; Bhattacharya et al. 2019; Menta and Singh 2017) for which it became an important strategy.

2.4 Viral Marketing

Viral marketing (VM) is one of the various real-world applications of social influence analysis. According to (Wang and Street 2018), VM is a process of influence diffusion over social networks. It is a relatively recent solution to advertisement within online social networks. It has been applied to business-to-consumer transactions.

Using a social network to spread the word about a product or service is a form of viral marketing. In other words, it employs customers in a market to promote a product (McKay et al. 2019). For example, if a company has a certain number of new products, they could hand them out to a customer, and then the influence model maximization can predict optimal objective to get these products in order to spread the product’s influence over a specific network. There is a dearth of new diffusion methods in the literature, particularly for dynamic and massive networks. Additionally, it provides information on the various mining techniques that can be used for viral marketing.

On the other hand, the goal of viral marketing is to minimize marketing cost while maximizing the profit. The main idea of viral marketing is to find a set of customers for giving free samples within the budget B to maximize the expected total sales of the product, in other words use the K-seed set users for influencing other users in the network. Because the majority of the promotional work is done by customers, this type of “word-of-mouth” advertising can be far more cost-effective than more traditional ones. Friends’ recommendations are more trustworthy than those made by a company selling the product (Richardson and Domingos 2002). Figure 1 shows the viral marketing process.

Fig. 1
figure 1

Viral marketing via social network process

2.5 Influence Maximization Problem

A social network is depicted as a graph G (V, E) with a set of nodes V that represents individuals and a set of edges E that represents the relationship shared among the nodes in the graph. The influence maximization problem takes as input a graph G (V, E). The goal of this problem is to identify a subset of users in graph G who have the greatest amount of influence. An initial influence on this problem should yield the maximum number of nodes in the graph G that are influenced by a K-sized seed set, which is the solution to this problem’s problem (Du et al. 2019). Figure 2 shows the input and output for influence maximization problem. The important objective of the problem of influence maximization is to find a set of users with that maximum influence in a graph.

Fig. 2
figure 2

Input and output of influence maximization problem

3 State of the Art

In this section, we review the most important and most cited existing approaches for influence diffusion, influence maximization, and identification of influential users.

3.1 Influence Diffusion Model

Influence diffusion has become an important technique for viral marketing. The Oxford Dictionary defines diffusion as “the spread of something.” In social network analysis, diffusion is the process of information diffusion via the network. The majority of current research in SNA focuses on information and influence diffusion in online social networks (Guille et al. 2013; Arnaboldi et al. 2014; Xu et al. 2014; Sun et al. 2019; Dhamal et al. 2016; Gaeta 2018; Kong et al. 2020). According to (AlSuwaidan and Ykhlef 2016), diffusion models were originally used in social networks to simulate the process of information and influence propagation in the network. Many models and algorithms for influence diffusion have been proposed (More and Lingam 2019; Toalombo et al. 2020; Li and Liu 2019; Pan et al. 2020). In these models, each node is either active or inactive over iterations. An inactive node becomes active as more of its neighbors became active. In (Richardson and Domingos 2002), the authors provide the first algorithmic treatment to deal with the influence propagation problem. They built probabilistic models and used these models to choose the best viral marketing plan. Then the authors in (Kempe et al. 2003) studied influence propagation by focusing on the modeling influence by two fundamental stochastic influence cascade graph-based models, named independent cascade model (ICM) and linear threshold model (LTM). These models are based on directed graphs where each node can be activated or not with a monotonicity assumption (i.e., activated nodes cannot be deactivated). They formulated the problem as a network G = (V, E, p), where V is the set of nodes and E is the set of edges between nodes and p is the probability that node v can successfully activate node u, denoted by p (u,v). If a node accepts data from other nodes, it is considered active; otherwise, it is considered inactive (Du et al. 2019).

3.1.1 Linear Threshold Model (LTM)

In the linear threshold model (LTM), each edge or link e(u,v) is associated with a weight W(u, v), such that the sum of the weights of incoming neighbors of node v is less than or equal to 1 and each node v is also associated with a threshold θv. The linear threshold model starts with some active nodes with all other nodes being inactive and a random choice of thresholds θ.

The LTM samples the value of v of each user v uniformly at random probability from [0,1]. In step 0, it sets the status of nodes in S as active and others as inactive. Then, it updates the status of each user iteratively. In step t, all nodes that were active in step t-1 remain active, and any user v that was inactive in step t-1 switches to active. The influence spread of seed set S under the LT model (i.e., σ(S)) is the expected number of activated nodes when S is initially activated.

3.1.2 Independent Cascade Model (ICM)

In the independent cascade model (ICM), a probability p(u,v) is associated with each edge e(u,v), whereas u and v are two nodes in the graph. p(u,v) is the probability of the ability that u succeeds in activating v. In this model, a node v is activated by each of its incoming neighbors independently by introducing an influence probability p(u,v) to each edge e(u,v). This model’s diffusion instance unfolds in discrete steps according to influence probabilities and seed sets S at time step 0. Each active node u at step t will activate each of its outgoing neighbor v that is inactive in step t-1 with probability p(u,v). The activation process can be considered as flipping a coin with head probability p(u,v): if the result is heads, then v is activated; otherwise, v stays inactive. When there are no more nodes that can be activated, the diffusion instance ends. The expected number of activated nodes when S is used as the initial active node set and the above stochastic activation process is applied is the influence spread of seed set S under the ICM.

3.1.3 Epidemic Model

An epidemic model is a perfect tool for a simplified description of the diffusion strategy that contagious diseases follow in a population. As an epidemic spreads from an infected individual to another healthy one (i.e., non-infected before), the information can also spread from one individual to another through the same network that interconnects them. Epidemic models assume the existence of an implicit network between individuals (i.e., no explicit connections) and assume that exposure to infection (information being diffused) is enough to become infected (informed) and potentially transmit the infection to someone else. The underlying principles of those techniques are the basis of the models used in marketing for the prediction of new product adoption in the communities (Loucif 2016). One of the most popular model SIR or the model of Kermack and McKendrick (Cano 2020) is a mathematical approach created particularly for studying the plague disease that broke out in Bombay. This mathematical model is built upon a set of hypotheses, namely:

  1. (a)

    In the population, all individuals are sensitive equally to the infection.

  2. (b)

    The infection leads either to death or to a permanent immunity.

  3. (c)

    When healthy and infected individuals are living together, there will be a number of healthy individuals who will become infected.

In this model, the individuals can be found in three different states:

Suspected (S)

An individual is said susceptible, which means that he is very capable to be infected with the disease. Generally, infections can originate from outside of the population in which the disease spreads (e.g., by genetic mutation, contact with an animal, etc.). Denote S (t) the number of individuals who may be infected with the disease at time t and Sf (t) the current fraction of the population that is susceptible.

$$ Sf\ (t)=S\ (t)/N $$
(1)

Infected (I)

After an individual is affected by the disease, he becomes infectious, i.e., has the ability to infect other susceptible individuals in the population. Let I (t) be the number of infected individuals at time t. If (t) refers accordingly to the infected fraction of the population.

$$ If\ (t)=I(t)/N $$
(2)

Recovered (R)

It refers to the individuals who are either cured of the disease and acquired a full or partial immunity against the infection (can no longer be infected) or removed after being killed by the infection. Let Rf(t) be the fraction of the healed (or withdrawn) population where R(t) refers to its size.

$$ Rf\ (t)=R(t)/N $$
(3)

The diffusion of the disease within the population is dynamic: the fractions of susceptible, infectious, and healed individuals evolve over time with respect to the contacts through which the disease passes from infected individuals to healthy ones.

It is worth mentioning that at every moment (Zafarani et al. 2014),

$$ 1= Sf\ (t)+ If(t)+ Rf(t) $$
(4)

3.2 Identification of Influential Spreaders

A challenging issue in viral marketing is effectively identifying a set of influential users. By sending the advertising messages to this set, one can reach out to the largest area of the network. It is now much easier to identify the network’s most influential spreaders (Bhat et al. 2020). In this part, we have selected the most recent work in this field.

The authors in (Okamoto et al. 2008) combine existing methods on calculating exact values and approximate values of closeness centrality and presented a new algorithm to rank the top-k nodes in the network with the highest closeness centrality.

In (Bae and Kim 2014), in an effort to better understand the spread of a node’s influence in a network, the researchers developed a new measure called coreness centrality. For their experiment, they used unweighted undirected graph for both real and artificial networks. Their approach is based on the idea that a powerful spreader has more connections to the nodes that reside in the core of network. The K-shell indices of nodes neighbors could be good indicators of its spreading ability. To evaluate their proposed measure, they applied the (SIR) model for investigating an epidemic spreading process. They evaluated the performance of the ranking measures in 12 real networks with different sizes as shown in Table 1.

Table 1 The comparison of performance for related identification of influential spreader models

The study in (Basaras et al. 2013) introduced a new centrality measure; it is a combination of coreness and betweenness centrality. To evaluate their technique’s accuracy, they compared it to K-shell decomposition and a baseline measure based solely on the node degree on a large number of complex networks. They used the susceptible-infected-recovered model for an infection originating from both a single spreader and multiple spreaders to investigate the spreading process.

The authors in (Zeng and Zhang 2013) proposed a mixed degree decomposition (MDD) procedure in which both the residual degree and the exhausted degree are considered. By simulating the epidemic spreading process on real networks (Dolphins, Jazz, NetSci, Email, HEP, PGP, TAP, Y2H, Power, Internet, E. coli, C. elegans, AstroPh), they used the K-shell to generate the influence.

The authors in (Liu et al. 2018) proposed a local h-index centrality (LH-index) method for identifying and ranking the top influential spreaders in networks by calculating the h-indices of the node. The new proposed local h-index (LH-index) method simultaneously considers two factors: the h-index value of the node itself and the h-index values of its neighbors. On the one hand, the h-index of one node indicates the direct influences exerted by its nearest influential neighbors. On the other hand, the h-index values of its neighbors indicate the two-hop indirect influences exerted by further influential neighbors. The performance of their method showed its superiority in both real-world and simulated networks. They adopted the SIR model to evaluate the real spreading ability of the ranking nodes.

The work presented in (Belfin and Bródka 2018) uses multiple measures of centrality to look for overlapping communities and combine them to find a suitable superior seed set. They used degree, eigenvector centrality, and clustering coefficient. The basic idea of the strategy is to find out a fraction of superior nodes of the input network, called superior seed set, around which local communities can be computed.

Finally, the authors of (Bhat et al. 2020) presented the Improved Hybrid Rank algorithm, which combines two centralities, namely, the extended neighborhood coreness centrality and the h-index centrality. For the simulation of their proposed method, they used the SIR (susceptible-infected-recovered) model for both undirected and directed real-world networks. They have tested their algorithm based on various performance matrices like Kendall-Tau’s correlation coefficient, spreader’s location diversity, and infected scale.

The comparison of identification of influential spreader models is listed in Table1.

3.3 Influence Maximization

In this section, we present influence maximization-related research in viral marketing, and we review the important available research progress.

Domingos and Richardson (2001) were among the first to model customers’ network value, they used markov random field for modeling the influence between customers such as nodes representing the customers. After that, they extended their previous techniques, achieving a large reduction in computational cost, and applied them to data from a knowledge-sharing site. They founded optimal marketing plan, and they used continuously valued marketing actions and reduce computational cost (Richardson and Domingos 2002). The most cited papers on the matter, maximizing the spread of influence through a social network written by (Kempe et al. 2003), in which the random degree-based and distance centrality algorithms are used as baselines, which led to the development of the greedy algorithm for influence maximization. More generally, they developed an algorithm for selecting the optimal seed set S from nodes in the graph. They proved that the optimization problem is NP-hard under LTM and ICM, and they presented a greedy algorithm that guarantees that the influence spread is within (1–1/e − ɛ) of the optimal influence spread, where e is the base of natural logarithm and ɛ depends on the accuracy of their Monte Carlo estimate of the influence spread given a seed set. A series of more efficient studies have been done, because the greedy algorithm is infeasible even for medium-sized networks of tens of thousands of nodes and edges (Wang and Street 2018).

The work presented in (Leskovec et al. 2007) exploited submodularity to create an algorithm that can handle large-scale problems, achieve near-optimal placements, and be 700 times faster than a simple greedy algorithm. They proposed the Cost-Effective Lazy Forward (CELF) algorithm based on a “lazy forward” optimization. The obtained solutions are guaranteed to achieve at least a fraction of 1/2(11/e) of the optimal solution. They evaluated their algorithm on several large-scale real-world problems, including a model of a water distribution network and real blog data.

Authors in (Goyal et al. 2011) introduced an algorithm called CELF++ that further optimizes CELF by exploiting submodularity property. By avoiding redundant re-calculations of CELF’s marginal gains, the algorithm CELF++ has an advantage.

The authors in (Chen et al. 2009) studied the efficient influence maximization in social networks from two complementary directions. One is to improve the original greedy algorithm in (Kempe et al. 2003) and its improvement in (Leskovec et al. 2007) to further reduce running time for the greedy algorithm. After that, they proposed a new degree discount heuristic derived from the independent cascade model that improves influence. They also proposed an algorithm called new greedy algorithm for the influence maximization.

The study in (Chen et al. 2010a) proposed a new heuristic algorithm that is easily scalable to millions of nodes and edges in their experiments. An easy-to-use tunable parameter allows users to balance the running time and the spread of the algorithm’s influence in the general ICM. For the experiments, they used four real-world network and a synthetic dataset (NetHEPT, DBLP, Epinions, and Amazon).

The work in (Chen et al. 2010b) shows that to compute the expected influence spread for a given set is P-hard. But it can be expressed as a submodular monotone function of S, which can be used to guarantee the results using a simple greedy algorithm. As an added bonus, we now have the LDAG algorithm, the first-ever scalable heuristic algorithm designed specifically for LT model influence maximization. They firstly disclosed that the computation of influence in directed acyclic graphs (DAGs) can be done in linear time. Based on that, they created a local DAG for every node of the network and restricted the influence of the node in this local area. Gap-filling seed selection was used to update the nodes’ incremental influence spread after the DAGs were built, along with an accelerated solution.

The authors in (Barbieri and Bonchi 2014) modeled the viral marketing process of product adoption based on social influence and the feature of the products. They proposed feature-aware propagation model F-TM, and they defined the influence maximization with viral product design (MAXINF-VPD) problem and the study of its properties under F-TM propagation model, using two real-world semantically rich datasets from the domain of social music consumption (Last.fm) and social movie consumption (Flixster).

Squillero and Burelli (2016) explored the problem of influence maximization using a genetic algorithm (GA), which makes use of simple genetic operators commonly found in discrete optimization. They evaluated the genetic algorithm on two large, real-world network datasets from the Stanford Network Analysis Platform (SNAP) repository.

The authors in (Wang and Street 2018) proposed a model in which they quantified influence and tracked its diffusion and aggregation. (MAT) multiple-path asynchronous threshold, for viral marketing on social network. The MAT model captures not only direct influence but also indirect influence passed along messengers and they developed an efficient heuristic IV-greedy to tackle the influence maximization problem. The experiments of the MAT and the IV-greedy were conducted on four real-life networks, and they illustrated an important performance in terms of influence spread and time efficiency.

Saxena and Saxena (2020) proposed an influence maximization model by combining a node connection and its actual past activity pattern. Firstly, they proposed a diffusion model, namely, HAC-Rank algorithm, for the selection of initial adopters. Furthermore, they proposed a new Hurst-based influence maximization for studying the influence spread of seed nodes, wherein the activation of a node depends upon its connections and the self-similarity trend shown by its past activity. The performance of the HAC-Rank has been evaluated under IC, and the HBIM diffusion model achieved an average influence spread of 20.3%. Under the proposed HBIM model, HAC-Rank achieved 49.8% average influence spread in comparison to other state-of-the-art algorithms.

The comparison of maximization influence models is listed in Table 2.

Table 2 The comparison of performance for related influence maximization models

4 Deep Learning Approach for Influence Analysis

Recent work in social network analysis and influence analysis has been applied using the deep learning (Najafabadi et al. 2015; Hayat et al. 2019; Gao et al. 2020; Wang et al. 2019; Keikha et al. 2020; Wu et al. 2019; Zhang et al. 2020). A fundamental step for social network analysis using the deep learning is to encode network data into a low-dimensional representation (Tan et al. 2019).

The work presented in (Luceri et al. 2019) studied the impact of social influence on offline dynamics to study human real-life behavior. They used the deep learning technique for modeling social influence and predicting human behavior on real-world activities. They proposed a social influence deep learning framework that combines deep learning with network science for modeling and forecasting social influence on real-life activities, their social influence deep learning (SIDL) framework based on DNNs.

Authors in (McKay et al. 2019) proposed a newer solution for the influence maximization problem using machine learning. Their objective was to create a deep learning model to solve the influence maximization problem for viral marketing in a faster, more efficient, and more current leading algorithm; they are comparing the result of their model named learning algorithm with three main algorithms: the random selector, sum of edge, and greedy algorithm. For their study they built an artificial neural network in order to test their model against the order to other algorithm following they used the real network (DBLP a computer science bibliography website) for obtained the real results on real data. For the comparison of their model and the main other model (algorithm) existed, they compare with two measures the time efficiency and influence spread in the network. Their model reduce the time running and maximize the number of nodes activated. Their model activates 25.46% of the network’s threshold, the greedy, sum of edge and the best random algorithm activates more than 4% (2.98%, 2.87%, and 3.78%), these results of smaller network. The result of their model in the large scale network is 48.18%, sum of edge 26.42% and 32.58%. The result experiment proved that their model is more efficient in the total amount of influence spread and in time efficiency.

In (Tian et al. 2020) motivated by the application of viral marketing, first they proposed two topic-aware social influence propagation models based on IC and LT models. Second, they proposed a new graph-embedding network, called Diffusion2Vec, which can extract features for each user in social network automatically. It’s also important to note that they came up with a method for calculating the influence of a candidate user based on their embeddings. Finally, they adopted an algorithm of reinforcement learning, called double DQN with prioritized experience replay to train models. For the experiments, they used real-world social network from Twitter.

5 Conclusion

Online social networks are popular services that have been studied heavily in recent years, as more and more people communicate with friends, colleagues, and family through different existing social networks. We found several interesting papers surveying aspects of social networks. In this chapter, we reviewed the major aspects of OSNs, namely, online social network, social network analysis, social influence analysis, and viral marketing, and we defined the problem space in the social influence analysis. We also surveyed the main existing models and algorithms for influence modeling, influence maximization, and influential spreaders for viral marketing. The main objective of this chapter was to summarize the most important algorithms and models of influence analysis for viral marketing for beginners in this area. As we have learned in this chapter, there are many new problems and challenges on social influence analysis in our future work; we aim at proposing a new model of influence spreading or a new method for the identification of influential nodes in online social network for viral marketing. We hope this chapter will be very useful in clarifying this exciting area of research and serve as a solid foundation for readers interested in this field.