Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

User activity within online socio-technical systems is exploding. Social and micro-blogging networks such as Facebook, Twitter, and Google Plus host the information sharing activity of billions of users every day. Using these network platforms, people communicate ideas, opinions, videos, and photos among their circles of friends and followers across the world. These interactions generate an unprecedented amount of data that can be used as a social observatory, providing a unique opportunity to study the mechanisms behind human interactions with a quantitative approach [14].

Research on human online interactions revolves around two main themes: social link formation and online communication. People can build virtual connections with others to subscribe to their messages (i.e., following on Twitter and Google Plus) or to claim a mutual friendships between them (i.e., friending on Facebook). The established social links enable people to easily communicate with friends, sharing and spreading information on top of the social network.

Most work on social link formation frames the problem as network evolution modeling, in which each new link creation is driven by predefined mechanisms that resemble the observations from real-world data. Meanwhile, many algorithms have been presented to predict or recommend missing links in a given network.

Communication dynamics is a long-lasting research topic in the social sciences. Many theories of how people interact, exchange ideas, and influence each other have been proposed decades ago. With the availability of big data and advanced computational power, researchers can apply, verify, and enrich classic hypotheses on human behaviors, leveraging the capacity to collect and analyze data on a large scale to reveal patterns of human interactions [3]. New interdisciplinary research fields, namely computational social science and human dynamics, have emerged in such a scenario [3, 5].

1.1 Social Link Formation

Understanding the formation of online social links is a key ingredient for modeling the evolution of online social networks, as the rules for creating new links determine the network structure in time. Various models were introduced to capture the growth and evolution of network topology, as well as different characteristics of complex networks. Most such models focused on defining basic mechanisms that drive link creation [610].

1.1.1 Classic Network Evolution Models

The random network model is the oldest attempt at characterizing a non-regular network. Although, strictly speaking, it is not an evolutionary model, it can be regarded as such when links are added sequentially to the network. It is characterized by the fact that each link exists independently with the same probability [11]. The random network inspired many subsequent studies in network science, but it was not thought to reproduce several crucial properties of social networks [12], such as the small-world effect [1315], high clustering coefficient [14, 16, 17], temporal dynamics [18, 19], information propagation [20], and heterogeneous distributions in connectivity patterns [2125]. Many such characteristics were indeed yet undiscovered at the time when the random network model was proposed.

The small-world effect, also known as “six degrees of separation,” originated from the Milgram experiment [13], in which the average length of communication chains between two random individuals was found to be around six—smaller than what expected in a regular network, such as a lattice. The Watts-Strogatz model was designed to reproduce the small-world phenomenon by rewiring each link in a regular network with a small probability [14].

A scale-free network has a power-law degree distribution, commonly seen in many real-world social networks such as the film actor network, the scientific collaboration network, and the citation network [6, 8, 21]. A highly skewed degree distribution in a social network indicates that, although the majority of nodes is poorly connected, there is a consistent group of them (compared to what happens in a random graph) that is extremely well connected, and whose collective connections account for a relevant portion of the entire set of links in the network. In a large group of people, only a few are extremely popular and most others do not have many contacts. Many models have been proposed to reproduce the heterogeneous distribution in connectivity [12, 2225]. The Barabási-Albert model generates a scale-free network by continuously adding new nodes into the system (“growth”) and connecting them with other nodes with preference to high-degree ones (“preferential attachment”) [21]. Motivated by the structure of the Web graph, the copying model adds a new node into the network at a time and links it to a random existing node or its neighbors [24, 25]. Another model proposed by Newman, Watts, and Strogatz aimed to build up a random graph with arbitrary degree settings [12]. The ranking model grows the network according to a rank of the nodes by any given prestige measure, reproducing arbitrary power-law degree distributions [23].

1.1.2 Models with Social Components

The preferential attachment mechanisms in the Barabási-Albert and ranking models have a clear rationale in the social context: people prefer to form edges with well-connected individuals, such as celebrities. However, this prescription alone is not sufficient to reproduce several other important features of real networks. Other models have been put forth to fill the gap, including ingredients such as homophily [2629], triadic closure [1517, 30, 31], hierarchical structure [32], and information diffusion [20, 33].

Homophily can be regarded as people’s propensity for linking with similar others [26, 28, 29]. The triadic closure mechanism is based on the intuition that two individuals with mutual friends have a higher probability to establish a new contact [30, 31]. This tendency was observed in both undirected and directed online social networks and incorporated into several network growth models [1517]. In particular, Leskovec et al. tested triadic closure against many other mechanisms in four different large-scale social networks. By using maximum likelihood estimation (MLE) [34], they identified triadic closure as the best rule, among those considered, to explain link creation and to reproduce the clustering coefficient and the degree distribution of the real networks under study [16].

1.1.3 Link Prediction

Missing link prediction algorithms aim at inferring new social connections that may happen in the near future given a current snapshot of the network structure. The prediction has practical applications in the online systems; one of the most popular use cases is to provide recommendations for new contacts (i.e., “People You May Know” on Facebook and LinkedIn). Common approaches consider link prediction as a classification task or ranking problem, using node similarity [35, 36], the hierarchical structure of the network [32], random walks [16], supervised random walks [37], graphical models [38], and user profile features [39].

One class of link prediction algorithms is designed on the premise that only the network topology is known. Liben-Nowell and Kleinberg [36] examined and compared a rich set of metrics for quantifying the similarity between a pair of nodes in the social network, where high similarity implies high likelihood of being connected on the basis of homophily. Tested metrics were built upon different network topological features associated with each node, including overlap between neighbor sets, preferential attachment, shortest path distance, and PageRank hitting time. The analysis identified Adamic-Adar similarity [35] as the metric providing the best performance. Clauset, Moore, and Newman [32] explored the observation that real-world networks often exhibit community structure and hierarchical organization. They proposed a link prediction method that uses knowledge about the network’s hierarchical structure.

Other link prediction algorithms depend on additional attributes of existing individuals or connections, as well as the topology of the network. Supervised random walks can incorporate the knowledge of nodes and links so that a random walker is guided to follow preferred paths with higher probability [37]. Attributes may include, for example, the number of co-authored papers in a collaboration network and the frequency of interactions between a pair of friends on Facebook. In online systems like Flickr and Last.fm, users can annotate content revealing their topical interests. Schifanella et al. found that users with similar interests in these networks are more likely to be friends and proposed to use the similarity between user annotation metadata as a predictor of missing social links [39].

1.2 Communication Dynamics

Early models concerning communication dynamics were inspired by studies of epidemics, assuming that a piece of information could pass from one individual to another through social contacts [4042]. Recently, starting from observations and theories in social sciences, a wealth of computational models have been proposed to describe human communication.

1.2.1 Threshold Model

One class of models is based on the idea of a threshold: people tend to follow the same trends as most of their friends do [43, 44]. A threshold can be defined as the number or fraction of others who must make a decision before a given actor does the same. Many empirical studies have demonstrated the existence of such a threshold in social and behavioral contagion online [4548]. Threshold models have been widely applied to understand the diffusion of rumors, norms, strikes, voting, educational attainment, migration, and other human behaviors [43, 49, 50], and extended to study the role of competition for finite attention [51].

1.2.2 Homophily

The principle of homophily states that similar people are more likely to have contact than dissimilar ones [27, 52, 53]. The existence of homophily in social groups has been supported by various empirical observations and experiments in online settings [26, 29, 39, 54, 55]. Crandall et al. proposed a homophily-based model to predict a user’s future activity and interactions with others according to user similarities [56].

A feedback loop has been claimed to result in increasing similarity among users: people grow to resemble their friends because of social (peer) influence, while being more likely to form links with similar people (homophily) [29, 56]. Such a feedback loop could lead to the so-called echo-chamber effect, by which people are exposed to limited diversity of opinions in online social networks [57, 58]. Though it is hard to fully distinguish between peer influence and homophily [59], the latter effect contributes to promoting behavioral contagion [54].

1.2.3 Weak Tie Hypothesis

Friendships vary in their intensity and intimacy. The concept of tie strength has been introduced to capture this variation: strong ties are our closest confidants and supporters, while weak ties, to whom we feel less close, comprise the majority of our personal networks. Granovetter defined the strength of social ties proportionally to the size of shared social circles and proposed the weak-tie hypothesis [31, 60], according to which weak ties do not carry as much communication as strong ties, but act as bridges between communities and thus as important channels for novel information.

Following up on Granovetter’s work, many empirical studies have tested the weak-tie hypothesis [6169]. Brown and Reingen found an important bridging function of weak ties in word-of-month referral behavior, allowing information to travel from one distinct subgroup of referral actors to another [62]. Gilbert and Karahalios tested several dimensions of tie strength on social media, revealing that both intensity of communication and intimate language are strong indicators of relationship closeness [69]. Onnela et al. analyzed a mobile call network and showed that individuals in clusters tend to communicate more, while ties between clusters have less traffic [68]. Bakshy et al. compared individual adoption rates on Facebook when an external URL shared by friends is or is not included in the newsfeed and found that although stronger ties are individually more influential in persuading others to adopt and spread information, more abundant weak ties are responsible for the propagation of novel information [61].

In summary, strong ties are believed to provide greater emotional support [69, 70] and to be more influential [61, 62, 71], while weak ties provide novel information and connect us to opportunities outside our immediate circles [31, 68, 72].

1.2.4 Limited Attention

People have limited attention during communication. This constraint may be related to a cognitive limit on the number of stable social relationships that one can sustain, as postulated by Dunbar [73] and later supported by analyses of Twitter data [74, 75]. Huberman, Romero, and Wu defined friends of a Twitter user as those who have been mentioned at least twice. They found that most users have a very small number of friends compared to a large number of followers, and the friend network is more influential than the follower network in driving Twitter usage [74]. Wu and Huberman analyzed the dynamics of collective attention on Digg.com and modeled the delay of collective attention with a single novelty factor. Their measurements indicated that novelty within groups decays with a stretched-exponential law, suggesting the existence of a natural time scale over which attention fades [76].

1.2.5 Communication Dynamics on Evolving Networks

The large majority of studies on communication dynamics consider a static underlying social network, under the assumption that the network evolves on a slower time scale than that characteristic of the information spread. Recent research has addressed the modeling of cases in which the time scales of communication dynamics and network evolution are comparable. These approaches consider the two processes as either independent [19, 77] or coupled [33, 78, 79]. In particular, the studies focused on the former case considered mainly epidemic processes in which links are deleted or rewired according to the disease status of each node [78, 79].

2 Case Study: Traffic-Based Social Link Formation

We probe into the effects of information diffusion in shaping the evolution of the social network structure. As a case study, we present a longitudinal analysis of micro-blogging data to better understand the strategies employed by users when expanding their social circles. While the network structure affects the spread of information, the network is, in turn, shaped by this communication activity. This leads us to hypothesize a mechanism whereby people tend to follow others after seeing many messages from them. Interestingly, the coupling of social link formation and information sharing allows to depict a more accurate and comprehensive view of the network evolution [33].

We analyzed a dataset collected from Yahoo! Meme,Footnote 1 including the entire history of the system from April 2009 until March 2010. A user j following a user i is represented in the follower network by a directed edge  = (i, j), indicating j can receive messages posted by i. We adopt this notation, in which the link creator is the target, to emphasize the direction of information flow. In our notation, the in-degree of a node j is the number of people followed by j. Users can repost received messages, which become visible to their followers. When user j reposts content from i, we infer a flow of information from i to j. Each link is weighted by the numbers of messages from i that are reposted or seen by j. At the end of the observation period, the Yahoo! Meme follower network consisted of 128,199 users with at least one edge, connected by a total of 3,485,361 directed edges.

Social micro-blogging networks, such as Twitter, Google Plus, Sina Weibo, and Yahoo! Meme, are designed for information sharing. As illustrated in Fig. 6.1, the dynamics on the network directly affects the dynamics of the networks, and vice versa. In this case study, we investigate the individual strategies that lead to the creation of new social links. We characterize link creation processes with a set of parameters associated with different link creation strategies, estimated by a Maximum-Likelihood approach [34]. This analysis will show that triadic closure does have a strong effect on link formation, but shortcuts based on traffic are another indispensable factor in interpreting network evolution.

Fig. 6.1
figure 1

The dynamics of and on the network are strongly coupled. The bottom layer illustrates the social network structure, where the blue arrows represent “follow” relationships with the direction of information flow. The dashed red arrow marks a newly created link. The upper layer depicts the flow of information between people in the same group, leading to the creation of the new link. The social network structure constrains communication patterns, but information propagated through the network also affect how agents behave and ultimately how the network changes and grows

2.1 Link Creation Mechanisms

When users post or repost messages, all their followers can see these posts and might decide to repost them, generating spreading paths that, when taken together, form cascade networks. When receiving a reposted message, a Meme user in such a path can see both the grandparent (G, the user two steps ahead in the path) and the origin (O, original source). A user may decide to follow a grandparent or origin, receiving their future messages directly. These new links create shortcuts connecting users at any distance in the network. A triadic closure occurs when a user follows a triadic node (Δ, the user two steps away in the follower network). The definitions of different types of link creation mechanisms are illustrated in Fig. 6.2.

Fig. 6.2
figure 2

Illustration of link creation mechanisms. A grandparent node is a special case of triadic node, from which or through which information has reached the target user. Therefore traffic-based shortcuts to grandparent nodes are a subset of triadic closures

2.1.1 Statistical Analyses of Shortcuts

To quantify the statistical tendency of users to create shortcuts, let us consider every single link creation in the data as an independent event. We test the null hypotheses that links to grandparents, origins, and triadic nodes are generated by choosing targets at random among the users not already followed by the creator.

We label each link by its creation order, 1 ≤  ≤ L, where L is the total number of links. For each link, we can compute the likelihood of following a grandparent by chance:

$$\displaystyle{ p_{G}(\ell) = \frac{N_{G}(\ell)} {N(\ell) - k(\ell) - 1}, }$$
(6.1)

where N G () is the number of distinct grandparents seen by the creator of at the moment when is about to be created; N() is the number of available users in the system when is to be created; k() is the in-degree of ’s creator at the same moment; and the denominator is the number of potential candidates to be followed.

The indicator function for each link denotes whether the link connects with a grandparent or not in the real data:

$$\displaystyle\begin{array}{rcl} \mathbf{1}_{G}(\ell)& =& \left \{\begin{array}{l l} 1&\mbox{ if $\ell$ links to a grandparent in the data}\\ 0 &\text{otherwise.}\\ \end{array} \right.{}\end{array}$$
(6.2)

The expected number of links to grandparents according to the null hypothesis can be then computed as:

$$\displaystyle{ E_{G} =\sum _{ \ell=1}^{L}p_{ G}(\ell) }$$
(6.3)

and its variance is given by:

$$\displaystyle{ \sigma _{G}^{2} =\sum _{ \ell =1}^{L}p_{ G}\left (\ell\right )\left (1 - p_{G}\left (\ell\right )\right ) }$$
(6.4)

while the corresponding empirical number is:

$$\displaystyle{ S_{G} =\sum _{ \ell=1}^{L}\mathbf{1}_{ G}(\ell). }$$
(6.5)

According to the Lyapunov central limit theorem,Footnote 2 the variable \(z_{G} = (S_{G} - E_{G})/\sigma _{G}\) is distributed according to a standard normal \(\mathcal{N}(0,1)\). For linking to origins (O) or triadic nodes (Δ), we can define z O and z Δ similarly. In all three cases, using a z-test, we can reject the null hypotheses with high confidence (p < 10−10). We conclude that links established by following grandparents, origins or triadic nodes happen much more frequently than by random connection. These link creation mechanisms have important roles in the evolution of the social network.

2.1.2 User Preference

The variables z G , z O , and z Δ , as defined above, measure how much more likely links of a given type are formed than by chance—in other words, how strong individual preferences are for following grandparents, origins or triadic nodes. To study the dependence of the link formation tendencies on the different stages of an individual’s lifetime, let us compute z G k, z O k, and z Δ k for links created by users with in-degree k, that is, those who are following k users at the time when the link is created. Figure 6.3 shows that the principle of triadic closure dominates user behavior when one follows a small number of users (k < 75). In the early stages, one does not receive much traffic, so it is natural to follow people based on local social circles, consistently with triadic closure. However, users who have been active for a long time and have followed many people (k > 75) have more channels through which they monitor traffic. This creates an opportunity to follow others from whom they have seen messages in the past.

Fig. 6.3
figure 3

Individual preferences for following grandparents (red circles), origins (blue squares), and triadic nodes (green triangles) change with the in-degree of the link creator

2.1.3 Link Efficiency

In information diffusion networks like Twitter and Yahoo! Meme, social links may have a key efficiency function of shortening the distance between information creators and consumers. An efficient link should be able to convey more information to the follower compared to less efficient links. Hence we define the efficiency of link as the average number of posts seen or reposted through during one time unit after its creation:

$$\displaystyle{ \eta _{\mathrm{seen}} = \frac{w_{\mathrm{seen}}(\ell)} {T - t(\ell)},\quad \eta _{\mathrm{repost}} = \frac{w_{\mathrm{repost}}(\ell)} {T - t(\ell)}, }$$
(6.6)

where w() is the number of messages seen or reposted through ; t() is the time when was created; and T is the time of the last action recorded in our dataset. Both seen and reposted messages are considered, as they represent different types of traffic; the former are what is visible to a user, and the latter are what a user is willing to share. We compute the link efficiency of every grandparent, origin, and triadic closure link. As shown in Fig. 6.4, both grandparent and origin links exhibit higher efficiency than triadic closure links, irrespective of the type of traffic. By shortening the paths of information flows, more posts from the content generators reach the consumers.

Fig. 6.4
figure 4

Efficiency of links created according to different mechanisms, or average number of messages (a) seen or (b) reposted per time unit. Each box shows data within lower and upper quartile. Whiskers represent the 99th percentile. The triangle and line in a box represent the mean and median, respectively. The black line and grey area across the entire figure mark the median and interquartile range of the measure across all links, respectively

2.2 Rules of Network Evolution

To infer the different link creation strategies from the observed data, we characterize users with a set of probabilities associated with different actions, and approximate these parameters by MLE [34]. For each link , we know the actual creator and the target; we can thus compute the likelihood \(f(\ell\vert \varGamma,\varTheta )\) of the target being followed by the creator according to a particular strategy Γ, given the network configuration \(\varTheta\) at the time when is created. The likelihoods associated with different strategies can be mixed according to the parameters to obtain a model of link creation behavior. Finally, assuming that link creation events are independent, we can derive the likelihood of obtaining the empirical network from the model by the product of likelihoods associated with every link. The higher the value of the likelihood function, the more accurate the model.

2.2.1 Simple Strategy

We consider five link creation mechanisms and their combinations:

  • Random (Rand): follow a randomly selected user who is not yet followed

  • Triadic closure (Δ): follow a randomly selected triadic node

  • Grandparent (G): follow a randomly selected grandparent

  • Origin (O): follow a randomly selected origin

  • Traffic shortcut (\(G \cup O\)): follow a randomly selected grandparent or origin

Other mechanisms for link creation could be similarly incorporated, such as social balance [81] and preferential attachment [21]. However, preferential attachment is built on the assumption that everyone knows the global connectivity of everyone else, which is not very realistic. The strategies considered here essentially reproduce and extend the copy model [25], approximating preferential attachment with only local knowledge.

To model link creation with a single strategy, we can use a parameter p for the probability of using that strategy, while a random user is followed with probability 1 − p. The calculation of maximum likelihood, taking the single strategy of grandparents as an example, is as follows:

$$\displaystyle\begin{array}{rcl} \mathcal{L}_{G}(p)& =& \prod _{\ell=1}^{L}\left (pf(\ell\vert G,\varTheta ) + (1 - p)f(\ell\vert \text{Rand},\varTheta )\right ) \\ & =& \prod _{\ell=1}^{L}\left (p \frac{\mathbf{1}_{G}(\ell)} {N_{G}(\ell)} + (1 - p) \frac{1} {N(\ell) - k(\ell) - 1}\right ) \\ & =& \prod _{\mathbf{1}_{G}(\ell)=1}\left ( \frac{p} {N_{G}(\ell)} + \frac{1 - p} {N(\ell) - k(\ell) - 1}\right )\prod _{\mathbf{1}_{G}(\ell)=0} \frac{1 - p} {N(\ell) - k(\ell) - 1}.{}\end{array}$$
(6.7)

Note that since a follow action can be ascribed to multiple strategies, it can contribute to multiple terms in the log-likelihood expression. For instance, a link could be counted in both \(f(\ell\vert G,\varTheta )\) and \(f(\ell\vert \text{Rand},\varTheta )\). For numerically stable computation, we maximize the log-likelihood:

$$\displaystyle\begin{array}{rcl} \log \mathcal{L}_{G}(p)& =& \sum _{\mathbf{1}_{G}(\ell)=1}\ln \left ( \frac{p} {N_{G}(\ell)} + \frac{1 - p} {N(\ell) - k(\ell) - 1}\right ) \\ & & +\sum _{\mathbf{1}_{G}(\ell)=0}\ln \frac{1 - p} {N(\ell) - k(\ell) - 1}. {}\end{array}$$
(6.8)

Similar expressions of log-likelihood can be obtained for other strategies (Δ, O, and \(G \cup O\)).

It is not trivial to obtain the best p analytically, so we explore the values of p ∈ (0, 1) numerically (see Fig. 6.5). Triadic closure dominates as a single strategy, with \(p_{\Delta } = 0.82\). Traffic-based strategies alone account for about 20 % of the links.

Fig. 6.5
figure 5

The plot of the log-likelihood \(\log \mathcal{L}(p)\) as a function of link creation strategy probabilities for models with a single strategy. The red circles mark the maximized \(\log \mathcal{L}(p)\). (a) Triadic closure, (b) Grandfather, (c) Origin, (d) Grandfather + Origin

2.2.2 Combined Strategies

For a more realistic model of the empirical data, let us consider combined strategies with both triadic closure and traffic-based shortcuts. For each link , the follower with probability p 1 creates a shortcut by linking to a grandparent or an origin (\(G \cup O\)); with probability p 2 follows a triadic node (Δ); and with probability \(1 - p_{1} - p_{2}\) connects to a random node. Taking the combined strategy with grandparent as an example, we compute the log-likelihood as:

$$\displaystyle\begin{array}{rcl} \log \mathcal{L}_{G+\varDelta }(p_{1},p_{2})& =& \log \prod _{\ell=1}^{L}\Big[p_{ 1}f(\ell\vert G \cup O,\varTheta ) + p_{2}f(\ell\vert \varDelta,\varTheta ) \\ & & +(1 - p_{1} - p_{2})f(\ell\vert \text{Rand},\varTheta )\Big].{}\end{array}$$
(6.9)

The detailed derivations of the likelihood functions and the cases of the other combined strategies are omitted for brevity.

Once again, we numerically explore the values of p 1 and p 2 in the unit square to maximize the log-likelihood. The likelihood landscape for the combined strategy considering both grandparents and origins as well as triadic closure is shown in Fig. 6.6. The parameter settings and the maximum likelihood values for all tested models are listed in Table 6.1. We can compare the quality of these models by comparing their maximized \(\log \mathcal{L}\)’s. The combined models with both traffic shortcuts and triadic closure yield the best accuracy. In these models, triadic closure accounts for 71 % of the links, grandparents and origins for 12 %, and the rest of the links are created at random.

Fig. 6.6
figure 6

The contour plot of log-likelihood \(\log \mathcal{L}(p_{1},p_{2})\) for the combined strategy of creating traffic shortcuts (\(G \cup O\)) with probability p 1 and triadic closure links (Δ) with probability p 2. The black triangle marks the optimum

Table 6.1 The best parameters in different models and corresponding values of maximized log-likelihood function

3 Discussion

Social link formation and information sharing are two major tracks of research on online interactions. The mechanisms of new link creation determine the topology of linkages among individuals, and the underlying network structure is critical for the dynamics of the diffusion process [68, 51]. At the same time, as many social links are driven by the need for more efficient information sharing in social media sites, social link formation is greatly affected by communication activity. Both the evolving structure of the social network and information diffusion have been studied for decades, but the coupling between these dynamical processes has not been well explored. In the present case study, we demonstrate a feedback loop between these two dynamics. While triadic closure is the dominant mechanism for social network evolution, it is mainly relevant in the early stages of a user’s lifetime. As time progresses, the traffic generated by communication dynamics on the network becomes an indispensable component for user linking behavior. As users become more active and influential, their links create shortcuts that make the spread of information more efficient in the network.

Studies of online interactions—how social networks evolve and how information spreads—help us gain a better understanding of social influence, user behavior, and network efficiency in the context of online systems. The coupling between dynamics of and on the network provides us with powerful insights into human interactions in the digital world.