Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

During the past decade, the advent of the “social Web” has provided considerable leeway to a rich rubric of platforms that promote communication among users on shared spaces. These interpersonal interactions often take place in the pretext of either a shared media e.g., an image (Flickr), a video (YouTube), a “blog”/“microblog” (Twitter); or are built across social ties that reflect human relationships in the physical world (Facebook). The resultant impact of the rapid proliferation of these social websites has been widespread. Individuals today, can express their opinions on personal blogs as well as can share media objects to engage themselves in discussion. Right from shopping a new car, to getting suggestions on investment, searching for the next holiday destination or even planning their next meal out, people have started to rely heavily on opinions expressed online or social resources that can provide them with useful insights into the diversely available set of options. Moreover, personal experiences as well as thoughts and opinions on external events also manifest themselves through “memes,” “online chatter” or variegated “voting” mechanisms in several peoples blogs and social profiles. As a positive outcome of all these interactional affordances provided by the online social media and social network sites, a broad podium of opportunities and ample scope have begun to emerge to the social network analysis community. Instead of focusing on longitudinal studies of relatively small groups such as participant observation [16, 31] and surveys [8], researchers today can study social processes such as information diffusion or community emergence at very large scales.6pc]Please check the corresponding author identity. This is because electronic social data can be collected at comparatively low cost of acquisition and resource maintenance, can span over diverse populations and be acquired over extended time periods. The result is that study of social processes on a scale of million nodes, that would have been barely possible a few years back, is now looming a lot of interest currently [20, 22].

Our broad goal is to study how such online communication today is reshaping and restructuring our understanding of different social processes. Communication is the process by which participating individuals create and share information with one another in order to reach a mutual understanding [6]. Typically communication involves a form of a channel, or a media by means of which information, in the form of concepts get transmitted from one individual to another. An illustrative example that describes the key ideas in the online communication process is shown in Fig. 4.1. Note, mass media channels are more effective in creating knowledge of innovations [5], whereas channels promoting social engagement are more effective in forming and changing attitudes toward a new concept, and thus in influencing the decision to adopt or reject a new concept or information.Footnote 1

Fig. 4.1
figure 1figure 1

Illustration of the two key organizing ideas that embody online interpersonal communication processes: namely, the information or concept that is the content of communication and the channel or the media via which communication takes place

It, thus, goes without saying that communication is central to the evolution of social systems. To support this empirical finding, over the years, numerous studies on online social communication processes have indicated that studying properties of the associated social system, i.e., the network structure and dynamics can be useful pointers in determining the outcome of many important social and economic relationships [1, 2]. Despite the fundamental importance laid on the understanding of these structures and their temporal behavior in many social and economic settings [810, 20, 21], the development of characterization tools, foundational theoretical models as well as insightful observational studies on large-scale social communication datasets is still in its infancy. This is because communication patterns on online social platforms are significantly distinct from their physical world counterpart – consequently often invalidating the methods, tools and studies designed to cater to longitudinal ethnographic studies on observed physical world interactions. This distinction can be viewed on several aspects relating to the nature of the online communication process itself: such as inexpensive reach to a global audience, volatility of content and easy accessibility of publishing information content online. The outcome of these differences is that today there is an ardent necessity to develop robust computational frameworks to characterize, model and conduct observational studies on online communication processes prevalent, rather pervasively, on the online domain.

The contributions of this chapter are also motivated from the potential ability of online communication patterns in addressing multi-faceted sociological, behavioral as well as societal problems. For example, the patterns of social engagement, reflected via the networks play a fundamental role in determining how concepts or information are exchanged. Such information may be as simple as an invitation to a party, or as consequential as information about job opportunities, literacy, consumer products, disease containment and so on. Additionally, understanding the evolution of groups and communities can lend us meaningful insights into the ways in which concepts form and aggregate, opinions develop as well as ties are made and broken, or even how the decisions of individuals contribute to impact on external temporal occurrences. Finally, studies of shared user-generated media content manifested via the communication channel can enable us re-think about the ways in which our communication patterns affect our social memberships or our observed behavior on online platforms.

In the light of the above observations, the following two parts summarize our key research investigations:

  • Rich Media Communication Patterns. This part investigates rich media communication patterns, i.e., the characteristics of the emergent communication, centered around the channel or the shared media artifact. The primary research question we address here is: what are the characteristics of conversations centered around shared rich media artifacts?

  • Information Diffusion. This part instruments the characterization of the concept, or the information or meme, involved in the social communication process. Our central idea encompasses the following question: how do we model user communication behavior that affects the diffusion of information in a social network and what is the impact of user characteristics, such as individual attributes in this diffusion process?

The rest of the chapter is organized as follows. In Sect. 4.2, we present the major characteristics of online communication dynamics. Next two sections deal with the methods that help us study rich media communication patterns (Sect. 4.3) and impact of communication properties on diffusion processes (Sect. 4.4). They also present some experimental studies conducted on large-scale datasets to evaluate our proposed methods of communication analysis. Finally we conclude in Sect. 4.5 with a summary of the contributions and future research opportunities.

2 Characteristics of Online Communication

We present key characteristics of the online communication process. First we present a background survey of the different aspects of online communication. Next we discuss the different forms of communication affordances that are provided by different online social spaces today and discuss an overview of prior work on the different modalities.

2.1 Background

There are several ways in which online social media has revolutionized our means and manner of social communication today: naturally making a huge impact on the characteristics of the social systems that encompass them. We discuss some of the characteristics of this widespread change in the communication process as follows:

  1. 1.

    Reach. Social media communication technologies provide scale and enable anyone to reach a global audience.

  2. 2.

    Accessibility. Social media communication tools are generally available to anyone at little or no cost, converting every individual participant in the online social interaction into a publisher and broadcaster of information content on their own.

  3. 3.

    Usability. Most social media do not, or in some cases reinvent skills, so anyone can operate the means of content production and subsequent communication, eliminating most times the need for specialized skills and training.

  4. 4.

    Recency. Social media communication can be capable of virtually instantaneous responses; only the participants determining any delay in response; making the communication process extremely reciprocative, with low lags in responses.

  5. 5.

    Permanence. unlike industrial media communication, which once created, cannot be altered (e.g., once a magazine article is printed and distributed changes cannot be made to that same article), social media communication is extremely volatile over time, because it can be altered almost instantaneously by comments, editing, voting and so on.

These key characteristics of online social communication have posed novel challenges on the study of social systems in general. To highlight some of the key statistics of different social sites available on the Web today, we compiled Table 4.1. The natural question that arises is that: how are online social communication patterns today affecting our social lives and our collective behavior? As is obvious from the statistics, traditional tools to understand social interactions in physical spaces or over industrial media or even prior work involving longitudinal studies of groups of individuals are therefore often only partially capable of characterizing, modeling and observing the modern online communication of today.

Table 4.1 Some social media statistics

In this chapter, we therefore identify two key components that subsume these diverse characteristics of the online social communication process on social media today. These two components are manifested as below:

  1. 1.

    The entity or the concept (e.g., information, or ‘meme’).

  2. 2.

    The channel or the media (e.g., textual, audio, video or image-based interactive channel).

2.2 Communication Modes in Social Networks

We discuss several different communication modes popularly existent in social networks and social media sites today. These diverse modalities of communication allow users to engage in interaction often spanning a commonly situated interest, shared activities or artifacts, geographical, ethnic or gender-based co-location, or even dialogue on external news events. In this chapter, we have focused on the following forms of communication among users, that are likely to promote social interaction:

  1. 1.

    Messages. Social websites such as MySpace feature an ability to users to post short messages on their friends’ profiles. A similar feature on Facebook allows users to post content on another user’s “Wall.” These messages are typically short and viewable publicly to the common set of friends to both the users; providing evidences of interaction via communication.

  2. 2.

    Blog Comments/Replies. Commenting and replying capability provided by different blogging websites, such as Engadget, Huffington Post, Slashdot, Mashable or MetaFilter provide substantial evidence of back and forth communication among sets of users, often relating to the topic of the blog post. Note, replies are usually shown as an indented block in response to the particular comment in question.

  3. 3.

    Conversations around Shared Media Artifact. Many social websites allow users to share media artifacts with their local network or set of contacts. For example, on Flickr a user can upload a photo that is viewable via a feed to her contacts; while YouTube allows users to upload videos emcompassing different topical categories. Both these kinds of media sharing allow rich communication activity centered around the media elements via comments. These comments often take a conversational structure, involving considerable back and forth dialogue among users.

  4. 4.

    Social Actions. A different kind of a communication modality provided by certain social sites such as Digg or del.icio.us involves participation in a variety of social actions by users. For example, Digg allows users to vote (or rate) on shared articles, typically news, via a social action called “digging.” Another example is the “like” feature provided by Facebook on user statuses, photos, videos and shared links. Such social action often acts as a proxy for communication activity, because first, it is publicly observable, and second it allows social interaction among the users.

  5. 5.

    Micro-blogging. Finally, we define a communication modality based on micro-blogging activity of users, e.g., as provided by Twitter. The micro-blogging feature, specifically called “tweeting” on Twitter, often takes conversational form, since tweets can be directed to a particular user as well. Moreover, Twitter allows the “RT” or re-tweet feature, allowing users to propagate information from one user to another. Hence micro-blogging activity can be considered as an active interactional medium.

2.3 Prior Work on Communication Modalities

In this section we will survey some prior work on the above presented communication modalities.

Conversations. Social networks evolve centered around communication artifacts. The conversational structure by dint of which several social processes unfold, such as diffusion of innovation and cultural bias, discovery of experts or evolution of groups, is valuable because it lends insights into the nature of the network at multi-grained temporal and topological levels and helps us understand networks as an emergent property of social interaction.

Comments and messaging structure in blogs and shared social spaces have been used to understand dialogue based conversational behavior among individuals [34] as well as in the context of summarization of social activity on the online platform or to understand the descriptive nature of web comments [32]. Some prior work have also deployed conversational nature of comments to understand social network structure as well as in statistical analysis of networks [15]. There has also been considerable work on analyzing discussions or comments in blogs [28] as well as utilizing such communication for prediction of its consequences like user behavior, sales, stock market activity etc.

Prior research has also discovered value in using social interactional data to understand and in certain cases predict external behavioral phenomena [11]. There has been considerable work on analyzing social network characteristics in blogs [20] as well as utilizing such communication for prediction of its consequences like user behavior, sales, stock market activity etc [3, 17]. In [17] Gruhl et al. attempt to determine if blog data exhibit any recognizable pattern prior to spikes in the ranking of the sales of books on Amazon.com. Adar et al. in [3] present a framework for modeling and predicting user behavior on the web. They created a model for several sets of user behavior and used it to automatically compare the reaction of a user population on one medium e.g., search engines, blogs etc to the reactions on another.

Social Actions. The participation of individual users in online social spaces is one of the most noted features in the recent explosive growth of popular online communities ranging from picture and video sharing (Flickr.com and YouTube.com) and collective music recommendation (Last.fm) to news voting (Digg.com) and social bookmarking (del.icio.us). However in contrast to traditional communities, these sites do not feature direct communication or conversational mechanisms to its members. This has given rise to an interesting pattern of social action based interaction among users. The users’ involvement and their contribution through non-message-based interactions, e.g., digging or social bookmarking have become a major force behind the success of these social spaces. Studying this new type of user interactional modality is crucial to understanding the dynamics of online social communities and community monetization.

Social actions [12] performed on shared spaces often promote rich communication dynamics among individuals. In prior work, authors have discussed how the voting i.e., digging activity on Digg impacts the discovery of novel information [37]. Researchers [35] have also examined the evolution of activity between users in the Facebook social network to capture the notion of how social links can grow stronger or weaker over time. Their experiments reveal that links in the activity network on Facebook tend to come and go rapidly over time, and the strength of ties exhibits a general decreasing trend of activity as the social network link ages. Social actions revealed via third party applications as featured by Facebook have also lent interesting insights into the social characteristics of online user behavior.

In this chapter, we organize our approach based on these two different modalities of online communication, i.e., conversations and social actions. We utilize the former to study the dynamic characterization of the media channel that embodies online communication. While the latter is used to study the diffusion properties of the concept or the unit of information that is transmitted in a network via the communication process. This is presented in the following two sections.

3 Rich Media Communication Patterns

An interesting emergent property of large-scale user-generated content on social media sites is that these shared media content seem to generate rich dialogue of communication centered round shared media objects, e.g., YouTube, Flickr etc. Hence apart from impact of communication on the dynamics of the individuals’ actions, roles and the community in general, there are additional challenges on how to characterize such “conversations,” understanding the relationship of the conversations to social engagement i.e., the community under consideration, as well as studying the observed user behavior responsible for publishing and participation of the content.

Today, there is significant user participation on rich media social networking websites such as YouTube and Flickr. Users can create (e.g., upload photo on Flickr), and consume media (e.g., watch a video on YouTube). These websites also allow for significant communication between the users – such as comments by one user on a media uploaded by another. These comments reveal a rich dialogue structure (user A comments on the upload, user B comments on the upload, A comments in response to B’s comment, B responds to A’s comment etc.) between users, where the discussion is often about themes unrelated to the original video. In this section, the sequence of comments on a media object is referred to as a conversation. Note the theme of the conversation is latent and depends on the content of the conversation.

The fundamental idea explored in this section is that analysis of communication activity is crucial to understanding repeated visits to a rich media social networking site. People return to a video post that they have already seen and post further comments (say in YouTube) in response to the communication activity, rather than to watch the video again. Thus it is the content of the communication activity itself that the people want to read (or see, if the response to a video post is another video, as is possible in the case of YouTube). Furthermore, these rich media sites have notification mechanisms that alert users of new comments on a video post/image upload promoting this communication activity.

We denote the communication property that causes people to further participate in a conversation as its “interestingness.” While the meaning of the term “interestingness” is subjective, we decided to use it to express an intuitive property of the communication phenomena that we frequently observe on rich media networks. Our goal is to determine a real scalar value corresponding to each conversation in an objective manner that serves as a measure of interestingness. Modeling the user subjectivity is beyond the scope of this section.

What causes a conversation to be interesting to prompt a user to participate? We conjecture that people will participate in conversations when (a) they find the conversation theme interesting (what the previous users are talking about) (b) see comments by people that are well known in the community, or people that they know directly comment (these people are interesting to the user) or (c) observe an engaging dialogue between two or more people (an absorbing back and forth between two people). Intuitively, interesting conversations have an engaging theme, with interesting people. Example of an interesting conversation from YouTube is shown in Fig. 4.2.

Fig. 4.2
figure 2figure 2

Example of an interesting conversation from YouTube. Note it involves back-and-forth dialogue between participants as well as evolving themes over time

A conversation that is deemed interesting must be consequential [13] – i.e., it must impact the social network itself. Intuitively, there should be three consequences (a) the people who find themselves in an interesting conversation, should tend to co-participate in future conversations (i.e., they will seek out other interesting people that they’ve engaged with) (b) people who participated in the current interesting conversation are likely to seek out other conversations with themes similar to the current conversation and finally (c) the conversation theme, if engaging, should slowly proliferate to other conversations.

There are several reasons why measuring interestingness of a conversation is of value. First, it can be used to rank and filter both blog posts and rich media, particularly when there are multiple sites on which the same media content is posted, guiding users to the most interesting conversation. For example, the same news story may be posted on several blogs, our measures can be used to identify those sites where the postings and commentary is of greatest interest. It can also be used to increase efficiency. Rich media sites, can manage resources based on changing interestingness measures (e.g., and cache those videos that are becoming more interesting), and optimize retrieval for the dominant themes of the conversations. Besides, differentiated advertising prices for ads placed alongside videos can be based on their associated conversational interestingness.

3.1 Problem Formulation

3.1.1 Definitions

Conversation. We define a conversation in online social media (e.g., an image, a video or a blog post) as a temporally ordered sequence of comments posted by individuals whom we call “participants.” In this section, the content of the conversations are represented as a stemmed and stop-word eliminated bag-of-words.

Conversational Themes. Conversational themes are sets of salient topics associated with conversations at different points in time.

Interestingness of Participants. Interestingness of a participant is a property of her communication activity over different conversations. We propose that an interesting participant can often be characterized by (a) several other participants writing comments after her, (b) participation in a conversation involving other interesting participants, and (c) active participation in “hot” conversational themes.

Interestingness of Conversations. We now define “interestingness” as a dynamic communication property of conversations which is represented as a real non-negative scalar dependent on (a) the evolutionary conversational themes at a particular point of time, and (b) the communication properties of its participants. It is important to note here that “interestingness” of a conversation is necessarily subjective and often depends upon context of the participant. We acknowledge that alternate definitions of interestingness are also possible.

Conversations used in this section are the temporal sequence of comments associated with media elements (videos) in the highly popular media sharing site YouTube. However our model can be generalized to any domain with observable threaded communication. Now we formalize our problem based on the following data model.

3.1.2 Data Model

Our data model comprises the tuple C, P having the following two inter-related entities: a set of conversations, C on shared media elements; and a set of participants P in these conversations. Each conversation is represented with a set of comments, such that each comment that belongs to a conversation is associated with a unique participant, a timestamp and some textual content (bag-of-words).

We now discuss the notations. We assume that there are N participants, M conversations, K conversation themes and Q time slices. Using the relationship between the entities in the tuple C, P from the above data model, we construct the following matrices for every time slice q, 1 ≤ qQ:

  • PF(q)N ×N: Participant-follower matrix, where PF(q)(i, j) is the probability that at time slice q, participant j comments following participant i on the conversations in which i had commented at any time slice from 1 to (q − 1).

  • PL(q)N ×N: Participant-leader matrix, where PL(q)(i, j) is the probability that in time slice q, participant i comments following participant j on the conversations in which j had commented in any time slice from 1 to (q − 1). Note, both PF(q) and PL(q) are asymmetric, since communication between participants is directional.

  • PC(q)N ×M: Participant-conversation matrix, where PC(q)(i, j) is the probability that participant i comments on conversation j in time slice q.

  • CT(q)M ×K: Conversation-theme matrix, where CT(q)(i, j) is the probability that conversation i belongs to theme j in time slice q.

  • TS(q)K ×1: Theme-strength vector, where TS(q)(i) is the strength of theme i in time slice q. Note, TS(q) is simply the normalized column sum of CT(q).

  • PT(q)N ×K: Participant-theme matrix, where PT(q)(i, j) is the probability that participant i communicates on theme j in time slice q. Note, PT(q) = PC(q)CT(q).

  • IP(q)N ×1: Interestingness of participants vector, where IP(q)(i) is the interestingness of participant i in time slice q.

  • IC(q)M ×1: Interestingness of conversations vector, where IC(q)(i) is the interestingness of conversation i in time slice q.

For simplicity of notation, we denote the i-th row of the above 2-dimensional matrices as X(i, : ).

3.1.3 Problem Statement

Now we formally present our problem statement: given a dataset C, P and associated meta-data, we intend to determine the interestingness of the conversations in C, defined as IC(q) (a non-negative scalar measure for a conversation) for every time slice q, 1 ≤ qQ. Determining interestingness of conversations involves two key challenges:

  1. 1.

    How to extract the evolutionary conversational themes?

  2. 2.

    How to model the communication properties of the participants through their interestingness?

Further in order to justify interestingness of conversations, we need to address the following challenge: what are the consequences of an interesting conversation?

In the following three sections, we discuss how we address these three challenges through: (a) detecting conversational themes based on a mixture model that incorporates regularization with time indicator, regularization for temporal smoothness and for co-participation; (b) modeling interestingness of participants; and of interestingness of conversations; and using a novel joint optimization framework of interestingness that incorporates temporal smoothness constraints and (c) justifying interestingness by capturing its future consequences.

3.2 Conversational Themes

In this section, we discuss the method of detecting conversational themes. We elaborate on our theme model in the following two sub-sections – first a sophisticated mixture model for theme detection incorporating time indicator based, temporal and co-participation based regularization is presented. Second, we discuss parameter estimation of this theme model.

3.2.1 Chunk-Based Mixture Model of Themes

Conversations are dynamically growing collections of comments from different participants. Hence, static keyword or tag based assignment of themes to conversations independent of time is not useful. Our model of detecting themes is therefore based on segmentation of conversations into “chunks” per time slice. A chunk is a representation of a conversation at a particular time slice and it comprises a (stemmed and stop-word eliminated) set of comments (bag-of-words) whose posting timestamps lie within the same time slice. Our goal is to associate each chunk (and hence the conversation at that time slice) with a theme distribution. We develop a sophisticated multinomial mixture model representation of chunks over different themes (a modified pLSA [18]) where the theme distributions are (a) regularized with time indicator, (b) smoothed across consecutive time slices, and (c) take into account the prior knowledge of co-participation of individuals in the associated conversations.

Let us assume that a conversation ci is segmented into Q non-overlapping chunks (or bag-of-words) corresponding to the Q different time slices. Let us represent the chunk corresponding to the i-th conversation at time slice q(1 ≤ qQ) as λi, q. We further assume that the words in λi, q are generated from K multinomial theme models θ1, θ2, , θK whose distributions are hidden to us. Our goal is to determine the log likelihood that can represent our data, incorporating the three regularization techniques mentioned above. Thereafter we can maximize the log likelihood to compute the parameters of the K theme models.

However, before we estimate the parameter of the theme models, we refine our framework by regularizing the themes temporally as well as due to co-participation of participants. This is discussed in the following two sub-sections.

Temporal Regularization. We incorporate temporal characterization of themes in our theme model [27]. We conjecture that a word in the chunk can be attributed either to the textual context of the chunk λi, q, or the time slice q – for example, certain words can be highly popular on certain time slices due to related external events. Hence the theme associated with words in a chunk λi, q needs to be regularized with respect to the time slice q. We represent the chunk λi, q at time slice q with the probabilistic mixture model:

$$p(w : {\lambda }_{i,q},q) ={ \sum \nolimits }_{j=1}^{K}p(w,{\theta }_{ j}\vert {\lambda }_{i,q},q),$$
(4.1)

where w is a word in the chunk λi, q and θj is the jth theme. The joint probability on the right hand side can be decomposed as:

$$\begin{array}{c} \begin{array}{rlrlrl} p(w,{\theta }_{j}\vert{\lambda }_{i,q},q)& = p(w\vert {\theta }_{j}) \cdot p({\theta}_{j}\vert {\lambda }_{i,q},q) \\ & = p(w\vert {\theta }_{j})\cdot ((1 - {\gamma }_{q}) \cdot p({\theta }_{j}\vert {\lambda}_{i,q}) + {\gamma }_{q} \cdot p({\theta }_{j}\vert q)),\end{array} \end{array}$$
(4.2)

where γq is a parameter that regulates the probability of a theme θj given the chunk λi, q and the probability of a theme θj given the time slice q. Note that since a conversation can alternatively be represented as a set of chunks, the collection of all chunks over all conversations is simply the set of conversations C. Hence the log likelihood of the entire collection of chunks is equivalent to the likelihood of the M conversations in C, given the theme model. Weighting the log likelihood of the model parameters with the occurrence of different words in a chunk, we get the following equation:

$$L(C) =\log p(C) ={ \sum \nolimits }_{{\lambda }_{i,q}\in C}{\sum \nolimits }_{w\in {\lambda }_{i,q}}n(w,{\lambda }_{i,q}) \cdot \log {\sum \nolimits }_{j=1}^{K}p(w,{\theta }_{ j}\vert {\lambda }_{i,q},q),$$
(4.3)

where n(w, λi, q) is the count of the word w in the chunk λi, q and p(w, θj | λi, q, q) is given by (4.2). However, the theme distributions of two chunks of a conversation across two consecutive time slices should not too divergent from each other. That is, they need to be temporally smooth. For a particular topic θj this smoothness is thus based on minimization of the following L2 distance between its probabilities across every two consecutive time slices:

$${d}_{T}(j) ={ \sum \nolimits }_{q=2}^{Q}{(p({\theta }_{ j}\vert q) - p({\theta }_{j}\vert q - 1))}^{2}.$$
(4.4)

Incorporating this distance in (4.3) we get a new log likelihood function which smoothes all the K theme distributions across consecutive time slices:

$$\begin{array}{rcl} & & {L}_{1}(C) ={ \sum \nolimits }_{{\lambda }_{i,q}\in C}{\sum \nolimits }_{w\in {\lambda }_{i,q}}n(w,{\lambda }_{i,q}) \cdot \log {\sum \nolimits }_{j=1}^{K}(p(w,{\theta }_{ j}\vert {\lambda }_{i,q},q) \\ & & \qquad \qquad +\exp (-{d}_{T}(j))). \end{array}$$
(4.5)

Now we discuss how this theme model is further regularized to incorporate prior knowledge about co-participation of individuals in the conversations.

Co-Participation Based Regularization. Our intuition behind this regularization is based on the idea that if several participants comment on a pair of chunks, then their theme distributions are likely to be closer to each other.

To recall, chunks being representations of conversations at a particular time slice, we therefore define a participant co-occurrence graph G(C, E) where each vertex in C is a conversation ci and an undirected edge ei, m exists between two conversations ci and cm if they share at least one common participant. The edges are also associated with weights ωi, m which define the fraction of common participants between two conversations. We incorporate participant-based regularization based on this graph by minimizing the distance between the edge weights of two adjacent conversations with respect to their corresponding theme distributions.

The following regularization function ensures that the theme distribution functions of conversations are very close to each other if the edge between them in the participant co-occurrence graph G has a high weight:

$$R(C) ={ \sum \nolimits }_{{c}_{i},{c}_{m}\in C}{\sum \nolimits }_{j=1}^{K}{({\omega }_{ i,m} - (1 - {(f({\theta }_{j}\vert {c}_{i}) - f({\theta }_{j}\vert {c}_{m}))}^{2}))}^{2},$$
(4.6)

where fj | ci) is defined as a function of the theme θj given the conversation ci and the L2 distance between fj | ci) and fj | cm) ensures that the theme distributions of adjacent conversations are similar. Since a conversation is associated with multiple chunks, thus fj | ci) is given as in [26]:

$$f({\theta }_{j}\vert {c}_{i}) = p({\theta }_{j}\vert {c}_{i}) ={ \sum \nolimits }_{{\lambda }_{i,q}\in {c}_{i}}p({\theta }_{j}\vert {\lambda }_{i,q}) \cdot p({\lambda }_{i,q}\vert {c}_{i}).$$
(4.7)

Now, using (4.5) and (4.6), we define the final combined optimization function which minimizes the negative of the log likelihood and also minimizes the distance between theme distributions with respect to the edge weights in the participant co-occurrence graph:

$$O(C) = -(1 - \varsigma ) \cdot {L}_{1}(C) + \varsigma \cdot R(C),$$
(4.8)

where the parameter ς controls the balance between the likelihood using the multinomial theme model and the smoothness of theme distributions over the participant graph. It is easy to note that when ς = 0, then the objective function is the temporally regularized log likelihood as in (4.5). When ς = 1, then the objective function yields themes which are smoothed over the participant co-occurrence graph. Minimizing O(C) for 0 ≤ ς ≤ 1 would give us the theme models that best fit the collection.

Now to learn the hidden parameters of the theme model in (4.8), we use a different technique of parameter estimation based on the Generalized Expectation Maximization algorithm (GEM [26]). Details of the estimation can be referred to in [13].

3.3 Interestingness

In this section we describe our interestingness models and then discuss a method that jointly optimizes the two types of interestingness incorporating temporal smoothness.

3.3.1 Interestingness of Participants

We pose the problem of determining the interestingness of a participant at a certain time slice as a simple one-dimensional random walk model where she communicates either based on her past history of communication behavior in the previous time slice, or relies on her independent desire of preference over different themes (random jump). This formulation is described in Fig. 4.3.

Fig. 4.3
figure 3figure 3

Random walk model for determining interestingness of participants

We conjecture that the state signifying the past history of communication behavior of a participant i at a certain time slice q, denoted as A(q − 1) comprises the variables: (a) whether she was interesting in the previous time slice, IP(q − 1)(i), (b) whether her comments in the past impacted other participants to communicate and their interestingness measures, \({\mathbf{{P}_{F}}}^{(q-1)}(i,:) \cdot {\mathbf{{I}_{P}}}^{(q-1)}\),Footnote 2 (c) whether she followed several interesting people in conversations at the previous time slice q − 1, \({\mathbf{{P}_{L}}}^{(q-1)}(i,:) \cdot {\mathbf{{I}_{P}}}^{(q-1)}\), and (d) whether the conversations in which she participated became interesting in the previous time slice q − 1, PC(q − 1)(i; : ) ⋅IC(q1). The independent desire of a participant i to communicate is dependent on her theme distribution and the strength of the themes at the previous time slice q − 1: \({\mathbf{{P}_{T}}}^{(q-1)}(i,:) \cdot \,{\mathbf{{T}_{S}}}^{(q-1)}\).

Thus the recurrence relation for the random walk model to determine the interestingness of all participants at time slice q is given as:

$${ \mathbf{{I}_{P}}}^{(q)} = (1 - \beta ) \cdot {\mathbf{A}}^{(q-1)} + \beta \cdot ({\mathbf{{P}_{ T}}}^{(q-1)} \cdot {\mathbf{{T}_{ S}}}^{(q-1)}),$$
(4.9)

where,

$${ \mathbf{A}}^{(q-1)} = {\alpha }_{ 1} \cdot {\mathbf{{P}_{L}}}^{(q-1)} \cdot {\mathbf{{I}_{ P}}}^{(q-1)} + {\alpha }_{ 2} \cdot {\mathbf{{P}_{F}}}^{(q-1)} \cdot {\mathbf{{I}_{ P}}}^{(q-1)} + {\alpha }_{ 3} \cdot {\mathbf{{P}_{C}}}^{(q-1)} \cdot {\mathbf{{I}_{ C}}}^{(q1)}.$$
(4.10)

Here α1, α2 and α3 are weights that determine mutual relationship between the variables of the past history of communication state A(q − 1), and β the transition parameter of the random walk that balances the impact of past history and the random jump state involving participant’s independent desire to communicate. In this paper, β is empirically set to be 0.5.

3.3.2 Interestingness of Conversations

Similar to interestingness of participants, we pose the problem of determining the interestingness of a conversation as a random walk where a conversation can become interesting based on two states as shown in Fig. 4.4. Hence to determine the interestingness of a conversation i at time slice q, we conjecture that it depends on whether the participants in conversation i became interesting at q − 1, given as, \({\mathbf{{P}_{C}}}^{(q-1)}{(i,:)}^{t} \cdot {\mathbf{{I}_{P}}}^{(q-1)}\), or whether the conversations belonging to the strong themes in q − 1 became interesting, which is given as, \(diag({\mathbf{{C}_{T}}}^{(q-1)}(i,:) \cdot {\mathbf{{T}_{S}}}^{(q-1)}) \cdot {\mathbf{{I}_{C}}}^{(q-1)}\). Thus the recurrence relation of interestingness of all conversations at time slice q is:

$${ \mathbf{{I}_{C}}}^{(q)} = \psi \cdot {{\mathbf{{P}_{ C}}}^{(q-1)}}^{t} \cdot {\mathbf{{I}_{ P}}}^{(q-1)} + (1 - \psi ) \cdot {diag}({\mathbf{{C}_{ T}}}^{(q-1)} \cdot {\mathbf{{T}_{ S}}}^{(q-1)}) \cdot {\mathbf{{I}_{ C}}}^{(q-1)},$$
(4.11)

where ψ is the transition parameter of the random walk that balances the impact of interestingness due to participants and due to themes. Clearly, when ψ = 1, the interestingness of conversation depends solely on the interestingness of the participants at q − 1; and when ψ = 1, the interestingness depends on the theme strengths in the previous time slice q − 1.

Fig. 4.4
figure 4figure 4

Random walk model for determining interestingness of conversations

3.3.3 Joint Optimization of Interestingness

We observe that the measures of interestingness of participants and of conversations described in previous sections involve several free (unknown) parameters. In order to determine optimal values of interestingness, we need to learn the weights α1, α2 and α3 in (4.10) and the transition probability for the conversations in (4.11). Moreover, the optimal measures of interestingness should ensure that the variations in their values are smooth over time. Hence we present a novel joint optimization framework, which maximizes the two interestingness measures for optimal (α1, α2, α3, ψ) and also incorporates temporal smoothness.

The joint optimization framework is based on the idea that the optimal parameters in the two interestingness equations are those which maximize the interestingness of participants and of conversations jointly. Let us denote the set of the parameters to be optimized as the vector, X = [α1, α2, α3, ψ]. We can therefore represent IP and IC as functions of X. We define the following objective function g(X) to estimate X by maximizing g(X):

$$g(\mathbf{X}) = \rho \cdot \|\mathbf{{I}_{P}}{(\mathbf{X})\|}^{2} + (1 - \rho ) \cdot \|\mathbf{{I}_{ C}}{(\mathbf{X})\|}^{2},$$
(4.12)

s.t. \(0 \leq \psi 1,{\alpha }_{1},{\alpha }_{2},{\alpha }_{3} \geq 0,\mathbf{{I}_{P}} \geq 0,\mathbf{{I}_{C}} \geq 0,{\alpha }_{1} + {\alpha }_{2} + {\alpha }_{3} = 1\).

In the above function, ρ is an empirically set parameter to balance the impact of each interestingness measure in the joint optimization. Now to incorporate temporal smoothness of interestingness in the above objective function, we define a L2 norm distance between the two interestingness measures across all consecutive time slices q and q − 1:

$$\begin{array}{rcl} & {d}_{P} ={ \sum \nolimits }_{q=2}^{Q}(\|{\mathbf{{I}_{P}}}^{(q)}{(\mathbf{X})\|}^{2} -\|{\mathbf{{I}_{P}}}^{(q-1)}{(\mathbf{X})\|}^{2}),& \\ & {d}_{C} ={ \sum \nolimits }_{q=2}^{Q}(\|{\mathbf{{I}_{C}}}^{(q)}{(\mathbf{X})\|}^{2} -\|{\mathbf{{I}_{C}}}^{(q-1)}{(\mathbf{X})\|}^{2}).&\end{array}$$
(4.13)

We need to minimize these two distance functions to incorporate temporal smoothness. Hence we modify our objective function,

$${g}_{1}(\mathbf{X}) = \rho \cdot \|\mathbf{{I}_{P}}{(\mathbf{X})\|}^{2} + (1 - \rho ) \cdot \|\mathbf{{I}_{ C}}{(\mathbf{X})\|}^{2} +\exp (-{d}_{ P}) +\exp ({d}_{C}),$$
(4.14)

where \(0 \leq \psi 1,{\alpha }_{1},{\alpha }_{2},{\alpha }_{3} \geq 0,\mathbf{{I}_{P}} \geq 0,\mathbf{{I}_{C}} \geq 0,{\alpha }_{1} + {\alpha }_{2} + {\alpha }_{3} = 1\).

Maximizing the above function g1(X) for optimal X is equivalent to minimizing − g1(X). Thus this minimization problem can be reduced to a convex optimization form because (a) the inequality constraint functions are also convex, and (b) the equality constraint is affine. The convergence of this optimization function is skipped due to space limit.

Now, the minimum value of − g1(X) corresponds to an optimal X* and hence we can easily compute the optimal interestingness measures IP* and IC* for the optimal X*. Given our framework for determining interestingness of conversations, we now discuss the measures of consequence of interestingness followed by extensive experimental results.

3.4 Consequences of Interestingness

An interesting conversation is likely to have consequences. These include the (commenting) activity of the participants, their cohesiveness in communication and an effect on the interestingness of the themes. It is important to note here that the consequence is generally felt at a future point of time; that is, it is associated with a certain time lag (say, δ days) with respect to the time slice a conversation becomes interesting (say, q). Hence we ask the following three questions related to the future consequences of an interesting conversation:

Activity. Do the participants in an interesting conversation i at time q take part in other conversations relating to similar themes at a future time, q + δ We define this as follows,

$${ \mathit{Act}}^{q+\delta }(i) = \frac{1} {{\varphi }_{i,q+\delta }}{ \sum \nolimits }_{k=1}^{\vert {\varphi }_{i,q+\delta }\vert }{\sum \nolimits }_{j=1}^{\vert {P}_{i,q}\vert }{\mathbf{{P}_{ C}}}^{(q+\delta }(j,k),$$
(4.15)

where Pi, q is the set of participants on conversation i at time slice q, and φi, q + δ is the set of conversations m such that, m ∈ φi, q + δ if the KL-divergence of the theme distribution of m at time q + δ from that of i at q is less than an empirically set threshold: D(CT(q)(i, : ) | | CT(q + δ)(m, : )) ≤ ε.

Cohesiveness. Do the participants in an interesting conversation i at time q exhibit cohesiveness in communication (co-participate) in other conversations at a future time slice, q + δ In order to define cohesiveness, we first define co-participation of two participants, j and k as,

$${O}^{(q+\delta )}(j;k) = \frac{{\mathbf{{P}_{P}}}^{(q+\delta )}(j,k)} {{\mathbf{{P}_{C}}}^{(q+\delta )}(j,k)},$$
(4.16)

where PP(q + δ)(j, k) is defined as the participant-participant matrix of co-participation constructed as, \({P}_{C}^{(q+\delta )} \cdot {({P}_{C}^{(q+\delta )})}^{t}\). Hence the cohesiveness in communication at time q + δ between participants in a conversation i is defined as,

$$C{o}^{(q+\delta )}(i) = \frac{1} {\vert {P}_{i,q}\vert }{\sum \nolimits }_{j=1}^{{P}_{i,q} }{\sum \nolimits }_{k=1}^{\vert {P}_{i,q}\vert }{O}^{(q+\delta )}(j;k).$$
(4.17)

Thematic Interestingness. Do other conversations having similar theme distribution as the interesting conversation ci (at time q), also become interesting at a future time slice q + δ We define this consequence as thematic interestingness and it is given by,

$${ \mathit{TInt}}^{(q+\mathit{delta})}(i) = \frac{1} {{\varphi }_{i,q+\mathit{delta}}}{ \sum \nolimits }_{j=1}^{\vert {\varphi }_{i,q+\mathit{delta}}\vert }{I}_{ C}^{(q+\delta )}(j).$$
(4.18)

To summarize, we have developed a method to characterize interestingness of conversations based on the themes, and the interestingness property of the participants. We have jointly optimized the two types of interestingness to get optimal interestingness of conversations. And finally we have discussed three metrics which account for the consequential impact of interesting conversations. Now we would discuss the experimental results on this model.

3.5 Experimental Studies

The experiments performed to test our model are based on a dataset from the largest video-sharing site, YouTube, which serves as a rich source of online conversations associated with shared media elements. We crawled a total set of 132,348 videos involving 8,867,284 unique participants and 89,026,652 comments over a period of 15 weeks from June 20, 2008 to September 26, 2008. Now we discuss the results of experiments conducted to test our framework. First we present the results on the interestingness of participants, followed by that of conversations.

The results of interestingness of the participants of conversations are shown in a visualization in Fig. 4.5. We have visualized a set of 45 participants over the period of 15 weeks by pooling the top three most interesting participants from each week. The participants are shown column-wise in the visualization with decreasing mean number of comments written from left to right. The intensity of the red block represents the degree of interestingness of a participant at a particular time slice. The figure also shows plots of the comment distribution and the interestingness distributions for the participants at each time slice.

Fig. 4.5
figure 5figure 5

Interestingness of 45 participants from YouTube, ordered by decreasing number of comments from left to right, is visualized. Interestingness is less affected by number of comments during periods of several external events

In order to analyze the dynamics of interestingness, we also qualitatively observe its association with a set of external events collected from The New York Times, related to Politics. The events along with their dates are shown in Table 4.2.

Table 4.2 Political events in the time period of analysis

From the results of interestingness of participants, we observe that interestingness closely follows the number of comments on weeks which are not associated with significant external events (weeks 1–4, 6–10). Whereas on other weeks, especially the last three weeks 13, 14 and 15, we observe that there are several political happenings and as a result the interestingness distribution of participants does not seem to follow well the comment distribution. Hence we conclude that during periods of significant external events, participants can become interesting despite writing fewer comments – high interestingness can instead be explained due to their preference for the conversational theme which reflects the external event.

The results of the dynamics of interestingness of conversations are shown in Fig. 4.6. We conceive a similar visualization as Fig. 4.5 presented previously. Conversations are shown column-wise and time row-wise (15 weeks). A set of 45 conversations are pooled based on the top three most interesting conversations at each week. From left to right, the conversations are shown with respect to decreasing number of comments. We also show a temporal plot of the mean interestingness per week in order to understand the relationship of interestingness to external happening from Table 4.2.

Fig. 4.6
figure 6figure 6

Interestingness of 45 conversations from YouTube, ordered by decreasing number of comments from left to right, is visualized. Mean interestingness of conversations increases during periods of several external events

From the visualization in Fig. 4.6, we observe that the mean interestingness of conversations increase significantly during weeks 11–15. This is explained when we observe the association with large number of political happening in the said period (Table 4.2). Hence we conclude that conversations in general become more interesting when there are significant events in the external world – an artifact that online conversations are reflective of chatter about external happenings.

In closing for this problem, note that today there is significant online chatter, discussion and thoughts that are expressed over shared rich media artifacts, e.g., photos, videos etc, often reflecting public sentiment on socio-political events. While different media sites can provide coverage over the same information content with variable degrees of associated chatter, it becomes imperative to determine suitable methods and techniques to identify which media sources are likely to provide information that can be deemed to be “interesting” to a certain user. Suppose a user Alice is interested in identifying “interesting” media sources dissipating information on public sentiments regarding the recent elections in Iran back in 2009. To serve Alice’s needs, we need to be able to characterize chatter or conversations that emerge centered around rich media artifacts, that she would find useful. We believe the proposed framework can serve the needful to tackle the modern day information needs on the social Web.

Nevertheless, it goes without saying that human communication activity, manifested via such “conversations” involves mutual exchange of information, and the pretext of any social interaction among a set of individuals is a reflection of how our behavior, actions and knowledge can be modified, refined, shared or amplified based on the information that flows from one individual to another. Thus, over several decades, the structure of social groups, society in general and the relationships among individuals in these societies have been shaped to a great extent by the flow of information in them. Diffusion is hence the process by which a piece of information, an idea or an innovation flows through certain communication channels over time among the individuals in a social system.

The pervasive use of online social media has made the cost involved in propagating a piece of information to a large audience extremely negligible, providing extensive evidences of large-scale social contagion. There are multifaceted personal publishing modalities available to users today, where such large scale social contagion is prevalent: such as weblogs, social networking sites like MySpace and Facebook as well as microblogging tools such as Twitter. These communication tools are open to frequent widespread observation to millions of users, and thus offer an inexpensive opportunity to capture large volumes of information flows at the individual level. If we want to understand the extent to which ideas are adopted via these communication affordances provided by different online social platforms, it is important to understand the extent to which people are likely to be affected by decisions of their friends and colleagues, or the extent to which “word-of-mouth” effects will take hold via communication. In the following section we propose models of diffusion of information in the light of how similar user attributes, that embody observed “homophily” in networks, affect the overall social process.

4 Information Diffusion

The central goal in this section is to investigate the relationship between homophily among users and the social process of information diffusion. By “homophily,” we refer to the idea that users in a social system tend to bond more with ones who are “similar” to them than ones who are dissimilar. The homophily principle has been extensively researched in the social sciences over the past few decades [7, 24, 25]. These studies were predominantly ethnographic and cross-sectional in nature and have revealed that homophily structures networks. That is, a person’s ego-centric social network is often homogeneous with regard to diverse social, demographic, behavioral, and intra-personal characteristics [24] or revolves around social foci such as co-location or commonly situated activities [14]. Consequently, in the context of physical networks, these works provide evidence that the existence of homophily is likely to impact the information individuals receive and propagate, the communication activities they engage in, and the social roles they form.

Homophilous relationships have also been observed on online media such as Facebook, Twitter, Digg and YouTube. These networks facilitate the sharing and propagation of information among members of their networks. In these networks, homophilous associations can have a significant impact on very large scale social phenomena, including group evolution and information diffusion. For example, the popular social networking site Facebook allows users to engage in community activities via homophilous relationships involving common organizational affiliations. Whereas on the fast-growing social media Twitter, several topics such as “#Elections2008,” “#MichaelJackson,” “Global Warming” etc have historically featured extensive postings (also known as “tweets”) due to the common interests of large sets of users in politics, music and environmental issues respectively.

These networks, while diverse in terms of their affordances (i.e., what they allow users to do), share some common features. First, there exists a social action (e.g., posting a tweet on Twitter) within a shared social space (i.e., the action can be observed by all members of the users’ contact network), that facilitates a social process (e.g., diffusion of information). Second, these networks expose attributes including location, time of activity and gender to other users. Finally, these networks also reveal these users attributes as well as the communication, to third party users (via the API tools); thus allowing us to study the impact of a specific attribute on information diffusion within these networks.

The study of the impact of homophily on information diffusion can be valuable in several contexts. Today, due to the plethora of diverse retail products available online to customers, advertising is moving from the traditional “word-of-mouth” model, to models that exploit interactions among individuals on social networks. To this effect, previously, some studies have provided useful insights that social relationships impact the adoption of innovations and products [19]. Moreover there has been theoretical and empirical evidence in prior work [36] that indicates that individuals have been able to transmit information through a network (via messages) in a sufficiently small number of steps, due to homophily along recognizable personal identities. Hence a viral marketer attempting to advertise a new product could benefit from considering specific sets of users on a social space who are homophilous with respect to their interest in similar products or features. Other contexts in which understanding the role of homophily in information diffusion can be important, include, disaster mitigation during crisis situations, understanding social roles of users and in leveraging distributed social search.

4.1 Preliminaries

4.1.1 Social Graph Model

We define our social graph model as a directed graph G(V, E),Footnote 3 such that V is the set of users and eijE if and only if user ui and uj are “friends” of each other (bi-directional contacts). Let us further suppose that each user uiV can perform a set of “social actions,” \(\mathcal{O} =\{ {O}_{1},{O}_{2},\ldots \}\), e.g., posting a tweet, uploading a photo on Flickr or writing on somebody’s Facebook Wall. Let the users in V also be associated with a set of attributes \(\mathcal{A} =\{ {a}_{k}\}\) (e.g., location or organizational affiliation) that are responsible for homophily. Corresponding to each value υ defined over an attribute \({a}_{k} \in \mathcal{A}\), we construct a social graph G(ak = υ) such that it consists of the users in G with the particular value of the attribute, while an edge exists between two users in G(ak) if there is an edge between them in G.Footnote 4 E.g., for location, we can define sets of social graphs over users from Europe, Asia etc.

In this section, our social graph model is based on the social media Twitter. Twitter features a micro-blogging service that allows users to post short content, known as “tweets,” often comprising URLs usually encoded via bit.ly, tinyurl, etc. The particular “social action” in this context is the posting of a tweet; also popularly called “tweeting”. Users can also “follow” other users; hence if user ui follows uj, Twitter allows ui to subscribe to the tweets of uj via feeds; ui is then also called a “follower” of uj. Two users are denoted as “friends” on Twitter if they “follow” each other. Note that, in the context of Twitter, using the bi-directional “friend” link is more useful compared to the uni-directional “follow” link because the former is more likely to be robust to spam—a normal user is less likely to follow a spam-like account. Further, for the particular dataset of Twitter, we have considered a set of four attributes associated with the users:

Location of users, extracted using the timezone attribute of Twitter users. Specifically, the values of location correspond to the different continents, e.g., Asia, Europe and North America.

Information roles of users, we consider three categories of roles: “generators,” “mediators” and “receptors.” Generators are users who create several posts (or tweets) but few users respond to them (via the @ tag on Twitter, which is typically used with the username to respond to a particular user, e.g., @BillGates). While receptors are those who create fewer posts but receive several posts as responses. Mediators are users who lie between these two categories.

Content creation of users, we use the two content creation roles: “meformer” (users who primarily post content relating to self) and “informer” (users posting content about external happenings) as discussed in [29].

Activity behavior of users, i.e., the distribution of a particular social action over a certain time period. We consider the mean number of posts (tweets) per user over 24 h and compute similarities between pairs of users based on the Kullback-Leibler (KL) divergence measure of comparing across distributions.

4.1.2 Attribute Homophily

Attribute homophily [24, 25] is defined as the tendency of users in a social graph to associate and bond with others who are “similar” to them along a certain attribute or contextual dimension e.g., age, gender, race, political view or organizational affiliation. Specifically, a pair of users can be said to be “homophilous” if one of their attributes match in a proportion greater than that in the network of which they are a part. Hence in our context, for a particular value of \({a}_{k} \in \mathcal{A}\), the users in the social graph G(ak) corresponding to that value are homophilous to each other.

4.1.3 Topic Diffusion

Diffusion with respect to a particular topic at a certain time is given as the flow of information on the topic from one user to another via the social graph, and based on a particular social action. Specifically

Definition 4.1.

Given two users ui and uj in the baseline social graph G such that eijE, there is diffusion of information on topic θ from uj to ui if uj performs a particular social action Or related to θ at a time slice tm − 1 and is succeeded by ui in performing the same action on θ at the next time slice tm, where tm − 1 < tm.Footnote 5

Further, topic diffusion subject to homophily along the attribute ak is defined as the diffusion over the attribute social graph G(ak).

In the context of Twitter, topic diffusion can manifest itself through three types of evidences: (1) users posting tweets using the same URL, (2) users tweeting with the same hashtag (e.g., #MichaelJackson) or a set of common keywords, and (3) users using the re-tweet (RT) symbol. We utilize all these three cases of topic diffusion in this work.

4.1.4 Diffusion Series

In order to characterize diffusion, we now define a topology called a diffusion seriesFootnote 6 that summarizes diffusion in a social graph for a given topic over a period of time. Formally

Definition 4.2.

A diffusion series sN(θ) on topic θ and over time slices t1 to tN is defined as a directed acyclic graph where the nodes represent a subset of users in the baseline social graph G, who are involved in a specific social action Or over θ at any time slice between t1 and tN.

Note, in a diffusion series sN(θ) a node represents an occurrence of a user ui creating at least one instance of the social action Or about θ at a certain time slice tm such that t1tmtN. Nodes are organized into “slots”; where nodes associated with the same time slice tm are arranged into the same slot lm. Hence it is possible that the same user is present at multiple slots in the series if s/he tweets about the same topic θ at different time slices. Additionally, there are edges between nodes across two adjacent slots, indicating that user ui in slot lm performs the social action Or on θ at tm, after her friend uj has performed action on the same topic θ at the previous time slice tm − 1 (i.e., at slot lm − 1). There are no edges between nodes at the same slot lm: a diffusion series sN(θ) in this work captures diffusion on topic θ across time slices, and does not include possible flow occurring at the same time slice.

For the Twitter dataset, we have chosen the granularity of the time slice tm to be sufficiently small, i.e., a day to capture the dynamics of diffusion. Thus all the users at slot lm tweet about θ on the same day; and two consecutive slots have a time difference of 1 day. Examples of different diffusion series constructed on topics from Twitter have been shown in Fig. 4.7.

Fig. 4.7
figure 7figure 7

Example of different diffusion series from Twitter on three different topics. The nodes are users involved in diffusion while the edges represent “friend links” connecting two users

Since each topic θ can have multiple disconnected diffusion series sN(θ) at any given time slice tN, we call the family of all diffusion series a diffusion collection\({\mathcal{S}}_{N}(\theta ) =\{ {s}_{N}(\theta )\}\). Corresponding to each value of the attribute ak, the diffusion collection over the attribute social graph G(ak) at tN is similarly given as \({\mathcal{S}}_{N;{a}_{k}}(\theta ) =\{ {s}_{N;{a}_{k}}(\theta )\}\).

4.2 Problem Statement

Given, (1) a baseline social graph G(V, E); (2) a set of social actions \(\mathcal{O} =\{ {O}_{1},{O}_{2},\ldots \}\) that can be performed by users in V, and (3) a set of attributes \(\mathcal{A} =\{ {a}_{k}\}\) that are shared by users in V, we perform the following two preliminary steps. First, we construct the attribute social graphs {G(ak)}, for all values of \({a}_{k} \in \mathcal{A}\). Second, we construct diffusion collections corresponding to G and {G(ak)} for a given topic θ (on which diffusion is to be estimated over time slices t1 to tN) and a particular social action Or: these are given as \({\mathcal{S}}_{N}(\theta )\) and \(\{{\mathcal{S}}_{N;{a}_{k}}(\theta )\}\) respectively. The technical problem addressed in this section involves the following:

  1. 1.

    Characterization: Based on each of the diffusion collections \({\mathcal{S}}_{N}(\theta )\) and \(\{{\mathcal{S}}_{N;{a}_{k}}(\theta )\}\), we extract diffusion characteristics on θ at time slice tN given as: dN(θ) and \(\{{\mathbf{d}}_{N;{a}_{k}}(\theta )\}\) respectively (Sect. 4.4.3);

  2. 2.

    Prediction: We predict the set of users likely to perform the same social action at the next time slice tN + 1 corresponding to each of the diffusion collections \({\mathcal{S}}_{N}(\theta )\) and \(\{{\mathcal{S}}_{N;{a}_{k}}(\theta )\}\). This gives the diffusion collections at tN + 1: \(\hat{{\mathcal{S}}}_{N+1}(\theta )\) and \(\{\hat{{\mathcal{S}}}_{N+1;{a}_{k}}(\theta )\}\forall {a}_{k} \in \mathcal{A}\) (Sect. 4.4.4);

  3. 3.

    Distortion Measurement: We extract diffusion characteristics at tN + 1 over the (predicted) diffusion collections, \(\hat{{\mathcal{S}}}_{N+1}(\theta )\) and \(\{\hat{{\mathcal{S}}}_{N+1;{a}_{k}}(\theta )\}\), given as, \(\hat{{\text{ d}}}_{N+1}(\theta )\) and \(\{\hat{{\text{ d}}}_{N+1;{a}_{k}}(\theta )\}\) respectively. Now we quantify the impact of attribute homophily on diffusion based on two kinds of distortion measurements on \(\hat{{\text{ d}}}_{N+1}(\theta )\) and \(\{\hat{{\text{ d}}}_{N+1;{a}_{k}}(\theta )\}\). A particular attribute \({a}_{k} \in \mathcal{A}\) would have an impact on diffusion if \(\hat{{\text{ d}}}_{N+1;{a}_{k}}(\theta )\), avergaed over all possible values of ak: (a) has lower distortion with respect to the actual (i.e., { d}N + 1(θ)); and (b) can quantify external time series (search, news trends) better, compared to either \(\hat{{\text{ d}}}_{N+1}(\theta )\) or \(\{\hat{{\text{ d}}}_{N+1;{a}_{k}'}(\theta )\}\), where k′k (Sect. 4.4.7).

4.3 Characterizing Diffusion

We describe eight different measures for quantifying diffusion characteristics given by the baseline and the attribute social graphs on a certain topic and via a particular social action.

Volume: Volume is a notion of the overall degree of contagion in the social graph. For the diffusion collection \({\mathcal{S}}_{N}(\theta )\) over the baseline social graph G, we formally define volume vN(θ) with respect to θ and at time slice tN as the ratio of nN(θ) to ηN(θ), where nN(θ) is the total number of users (nodes) in the diffusion collection \({\mathcal{S}}_{N}(\theta )\), and ηN(θ) is the number of users in the social graph G associated with θ.

Participation: Participation pN(θ) at time slice tN [4] is the ratio of the number of non-leaf nodes in the diffusion collection \({\mathcal{S}}_{N}(\theta )\), normalized by ηN(θ).

Dissemination: Dissemination δN(θ) at time slice tN is given by the ratio of the number of users in the diffusion collection \({\mathcal{S}}_{N}(\theta )\) who do not have a parent node, normalized by ηN(θ). In other words, they are the “seed users” or ones who get involved in the diffusion due to some unobservable external influence, e.g., a news event.

Reach: Reach rN(θ) at time slice tN [23] is defined as the ratio of the mean of the number of slots to the sum of the number of slots in all diffusion series belonging to \({\mathcal{S}}_{N}(\theta )\).

Spread: For the diffusion collection \({\mathcal{S}}_{N}(\theta )\), spread sN(θ) at time slice tN [23] is defined as the ratio of the maximum number of nodes at any slot in \({s}_{N}(\theta ) \in {\mathcal{S}}_{N}(\theta )\) to nN(θ).

Cascade Instances: Cascade instances cN(θ) at time slice tN is defined as the ratio of the number of slots in the diffusion series \({s}_{N}(\theta ) \in {\mathcal{S}}_{N}(\theta )\) where the number of new users at a slot lm (i.e., non-occurring at a previous slot) is greater than that at the previous slot lm − 1, to LN(θ), the number of slots in \({s}_{N}(\theta ) \in {\mathcal{S}}_{N}(\theta )\).

Collection Size: Collection size αN(θ) at time slice tN is the ratio of the number of diffusion series sN(θ) in \({\mathcal{S}}_{N}(\theta )\) over topic θ, to the total number of connected components in the social graph G.

Rate: We define rate γN(θ) at time slice tN as the “speed” at which information on θ diffuses in the collection \({\mathcal{S}}_{N}(\theta )\). It depends on the difference between the median time of posting of tweets at all consecutive slots lm and lm − 1 in the diffusion series \({s}_{N}(\theta ) \in {\mathcal{S}}_{N}(\theta )\). Hence it is given as:

$${\gamma }_{N}(\theta ) = 1/(1 + \frac{1} {{L}_{N}(\theta )}{\sum \nolimits }_{{l}_{m-1},{l}_{m}\in {\mathcal{S}}_{N}(\theta )}({\overline{t}}_{m}(\theta ) -{\overline{t}}_{m-1}(\theta )),$$
(4.19)

where \({\overline{t}}_{m}(\theta )\) and \({\overline{t}}_{m-1}(\theta )\) are measured in seconds and \({\overline{t}}_{m}(\theta )\) corresponds to the median time of tweet at slot lm in \({s}_{N}(\theta ) \in {\mathcal{S}}_{N}(\theta )\).

These diffusion measures thus characterize diffusion at time slice tN over \({\mathcal{S}}_{N}(\theta )\) as the vector: { d}N(θ) = [vN(θ), pN(θ), δN(θ), rN(θ), sN(θ), cN(θ), αN(θ), γN(θ)]. Similarly, we compute the diffusion measures vector over \(\{{\mathcal{S}}_{N;{a}_{k}}(\theta )\}\), given by: \(\{{\text{ d}}_{N;{a}_{k}}(\theta )\}\), corresponding to each value of ak.

4.4 Prediction Framework

In this section we present our method of predicting the users who would be part of the diffusion collections at a future time slice for the baseline and attribute social graphs. Our method comprises the following steps. (1) Given the observed diffusion collections until time slice tN (i.e., \({\mathcal{S}}_{N}(\theta )\) and \({\mathcal{S}}_{N;{a}_{k}}(\theta )\)), we first propose a probabilistic framework based on Dynamic Bayesian networks [30] to predict the users likely to perform the social action Or at the next time slice tN + 1. This would yield us users at slot lN + 1 in the different diffusion series at tN + 1. (2) Next, these predicted users give the diffusion collections at tN + 1: \(\hat{{\mathcal{S}}}_{N+1}(\theta )\) and \(\{\hat{{\mathcal{S}}}_{N+1;{a}_{k}}(\theta )\}\).

We present a Dynamic Bayesian network (DBN) representation of a particular social action by a user over time, that helps us predict the set of users likely to perform the social action at a future time (Fig. 4.8a). Specifically, at any time slice tN, a given topic θ and a given social action, the DBN captures the relationship between three nodes:

Fig. 4.8
figure 8figure 8

(a) Structure of the Dynamic Bayesian network used for modeling social action of a user ui. The diagram shows the relationship between environmental features (Fi, N(θ)), hidden states (Si, N(θ)) and the observed action (Oi, N(θ)). (b) State transition diagram showing the “vulnerable” (Si = 1) and “indifferent” states (Si = 0) of a user ui

Environmental Features. That is, the set of contextual variables that effect a user ui’s decision to perform the action on θ at a future time slice tN + 1 (given by Fi, N(θ)). It comprises three different measures: (1) ui’s degree of activity on θ in the past, given as the ratio of the number of posts (or tweets) by ui on θ, to the total number of posts between t1 and tN; (2) mean degree of activity of ui’s friends in the past, given as the ratio of the number of posts by ui’s friends on θ, to the total number of posts by them between t1 and tN; and (3) popularity of topic θ at the previous time slice tN, given as the ratio of the number of posts by all users on θ, to the total number of posts at tN.

States. That is, latent states (Si, N(θ)) of the user ui responsible for her involvement in diffusion at tN + 1. Our motivation in conceiving the latent states comes from the observation that, in the context of Twitter, a user can tweet on a topic under two kinds of circumstances: first, when she observes her friend doing so already: making her vulnerable to diffusion; and second, when her tweeting is indifferent to the activities of her friends. Hence the state node at tN + 1 that impacts ui’s action can have two values as the vulnerable and the indifferent state (Fig. 4.8b).

Observed Action. That is, evidence (Oi, N(θ)) of the user ui performing (or not performing) the action, corresponding values being: {1, 0} respectively.

Now we show how to predict the probability of the observed action at tN + 1 (i.e., \hat{O}i, N + 1(θ)) using Fi, N(θ) and Si, N + 1(θ), based on the DBN model. Our goal is to estimate the following expectationFootnote 7:

$$\hat{{O}}_{i,N+1} = E({O}_{i,N+1}\vert {O}_{i,N},{\mathbf{F}}_{i,N}).$$
(4.20)

This involves computing P(Oi, N + 1 | Oi, N, Fi, N). This conditional probability can be written as an inference equation using the temporal dependencies given by the DBN and assuming first order Markov property:

$$\begin{array}{rcl} & & P({O}_{i,N+1}\vert {O}_{i,N},{\mathbf{F}}_{i,N}) \\ & & \quad ={ \sum \nolimits }_{{S}_{i,N+1}}\left [P({O}_{i,N+1}\vert {S}_{i,N+1},{O}_{i,N},{\mathbf{F}}_{i,N}).P({S}_{i,N+1}\vert {O}_{i,N},{\mathbf{F}}_{i,N})\right ]. \\ & & \quad ={ \sum \nolimits }_{{S}_{i,N+1}}P({O}_{i,N+1}\vert {S}_{i,N+1}).P({S}_{i,N+1}\vert {S}_{i,N},{\mathbf{F}}_{i,N}). \end{array}$$
(4.21)

Our prediction task thus involves two parts: predicting the probability of the hidden states given the environmental features, P(Si, N + 1 | Si, N, Fi, N); and predicting the probability density of the observation nodes given the hidden states, \(P({O}_{i,N+1}\vert {S}_{i,N+1})\), and thereby the expected value of observation nodes \hat{O}i, N + 1. These two steps are discussed in the following subsections.

4.5 Predicting Hidden States

Using Bayes rule, we apply conditional independence between the hidden states and the environmental features at the same time slice (Fig. 4.8a). The probability of the hidden states at tN + 1 given the environmental features at tN, i.e., P(Si, N + 1 | Si, N, Fi, N) can be written as:

$$P({S}_{i,N+1}\vert {S}_{i,N},{\mathbf{F}}_{i,N}) \propto P({\mathbf{F}}_{i,N}\vert {S}_{i,N}).P({S}_{i,N+1}\vert {S}_{i,N}).$$
(4.22)

Now, to estimate the probability density of P(Si, N + 1 | Si, N, Fi, N) using (4.22) we assume that the hidden states Si, N + 1 follows a multinomial distribution over the environmental features Fi, N with parameter ϕi, N, and a conjugate Dirichlet prior over the previous state Si, N with parameter λi, N + 1. The optimal parameters of the pdf of P(Si, N + 1 | Si, N, Fi, N) can now be estimated using MAP:

$$\begin{array}{rcl} & & \mathcal{L}(P({S}_{i,N+1}\vert {S}_{i,N},{\mathbf{F}}_{i,N})) \\ & & \quad =\log (P({\mathbf{F}}_{i,N}\vert {S}_{i,N})) +\log (P({S}_{i,N+1}\vert {S}_{i,N})) \\ & & \quad =\log \textrm{ multinom}(\textrm{ vec}({\mathbf{F}}_{i,N});{\phi }_{i,N}) \\ & & \qquad +\log \textrm{ Dirichlet}(\textrm{ vec}({S}_{i,N+1});{\lambda }_{i,N+1}) \\ & & \quad =\log \frac{{\sum \nolimits }_{jk}{\mathbf{F}}_{i,N;jk}!} {{\prod \nolimits }_{jk}{\mathbf{F}}_{i,N;jk}!} {\prod \nolimits }_{jk}{\phi }_{i,N;jk}^{{\mathbf{F}}_{i,N;jk} } +\log \frac{1} {B({\lambda }_{i,N+1})}{\prod \nolimits }_{jl}{S}_{i,N+1}^{{S}_{i,N;jl} } \\ & & \quad ={ \sum \nolimits }_{jk}{\mathbf{F}}_{i,N;jk}.\log {\phi }_{i,N;jk} +{ \sum \nolimits }_{jl}{S}_{i,N;jl}.\log {S}_{i,N+1;jl} + \textrm{ const.} \end{array}$$
(4.23)

where Bi, N + 1) is a beta-function with the parameter λi, N + 1. Maximizing the log likelihood in (4.23) hence yields the optimal parameters for the pdf of P(Si, N + 1 | Si, N, Fi, N).

4.6 Predicting Observed Action

To estimate the probability density of the observation nodes given the hidden states, i.e., \(P({O}_{i,N+1}\vert {S}_{i,N+1})\) we adopt a generative model approach and train two discriminative Hidden Markov Models – one corresponding to the class when ui performs the action, and the other when she does not. Based on observed actions from t1 to tN, we learn the parameters of the HMMs using the Baum-Welch algorithm. We then use the emission probability \(P({O}_{i,N+1}\vert {S}_{i,N+1})\) given by the observation-state transition matrix to determine the most likely sequence at tN + 1 using the Viterbi algorithm. We finally substitute the emission probability \(P({O}_{i,N+1}\vert {S}_{i,N+1})\) from above and P(Si, N + 1 | Si, N, Fi, N) from (4.23) into (4.21) to compute the expectation E(Oi, N + 1 | Oi, N, Fi, N) and get the estimated observed action of ui: \hat{O}i, N + 1 (4.20). The details of this estimation can be found in [33].

We now use the estimated social actions \hat{O}i, N + 1(θ) of all users at time slice tN + 1 to get a set of users who are likely to involve in the diffusion process at tN + 1 for both the baseline and the attribute social graphs. Next we use G and {G(ak)} to associate edges between the predicted user set, and the users in each diffusion series corresponding to the diffusion collections at tN. This gives the diffusion collection tN + 1, i.e., \(\hat{{\mathcal{S}}}_{N+1}(\theta )\) and \(\{\hat{{\mathcal{S}}}_{N+1;{a}_{k}}(\theta )\}\) (Sect. 4.4.1.4).

4.7 Distortion Measurement

We now compute the diffusion feature vectors \(\hat{{\text{ d}}}_{N+1}(\theta )\) or \(\{\hat{{\text{ d}}}_{N+1;{a}_{k}}(\theta )\}\) based on the predicted diffusion collections \(\hat{{\mathcal{S}}}_{N+1}(\theta )\) and \(\{\hat{{\mathcal{S}}}_{N+1;{a}_{k}}(\theta )\}\) from Sect. 4.4.4. To quantify the impact of attribute homophily on diffusion at tN + 1 corresponding to \({a}_{k} \in \mathcal{A}\), we define two kinds of distortion measures – (1) saturation measurement, and (2) utility measurement metrics.

Saturation Measurement. We compare distortion between the predicted and actual diffusion characteristics at tN + 1. The saturation measurement metric is thus given as \(1 - D(\hat{{\text{ d}}}_{N+1}(\theta ),{\text{ d}}_{N+1}(\theta ))\) and \(1 - D(\hat{{\text{ d}}}_{N+1;{a}_{k}}(\theta ),{\text{ d}}_{N+1}(\theta ))\), avergaed over all values of \(\forall {a}_{k} \in \mathcal{A}\) respectively for the baseline and the attribute social graphs. { d}N + 1(θ) gives the actual diffusion characteristics at tN + 1 and D(A, B) Kolmogorov-Smirnov (KS) statistic, defined as max( | AB | ).

Utility Measurement. We describe two utility measurement metrics for quantifying the relationship between the predicted diffusion characteristics \(\hat{{\text{ d}}}_{N+1}(\theta )\) or \(\{\hat{{\text{ d}}}_{N+1;{a}_{k}}(\theta )\}\) on topic θ, and the trends of same topic θ obtained from external time series. We collect two kinds of external trends: (1) search trends – the search volume of θ over t1 to tN + 1Footnote 8; (2) news trends – the frequency of archived news articles about θ over same periodFootnote 9. The utility measurement metrics are defined as follows:

Search trend measurement: We first compute the cumulative distribution function (CDF) of diffusion volume as \({E}_{N+1}^{D}(\theta ) ={ \sum \nolimits }_{m\leq (N+1)}\vert {l}_{m}(\hat{{\mathcal{S}}}_{N+1}(\theta ))\vert /{Q}_{D}\), where \(\vert {l}_{m}(\hat{{\mathcal{S}}}_{N+1}(\theta ))\vert \) is the number of nodes at slot lm in the collection \(\hat{{\mathcal{S}}}_{N+1}(\theta )\). QD is the normalized term and is defined as \({\sum \nolimits }_{m}\vert {l}_{m}(\hat{{\mathcal{S}}}_{N+1}(\theta ))\vert \). Next, we compute the CDF of search volume as \({E}_{N+1}^{S}(\theta ) ={ \sum \nolimits }_{m\leq (N+1)}{f}_{m}^{S}(\theta )/{Q}_{S}\), where fmS(θ) is the search volume at tm, and QS is the normalization term. The search trend measurement is defined as \(1 - D({E}_{N+1}^{D}(\theta ),{E}_{N+1}^{S}(\theta ))\), where D(A, B) is the KS statistic.

News trend measurement: Similarly, we compute the CDF of news volume as \({E}_{N+1}^{\dag \mathcal{N}}(\theta )\,=\,{\sum \nolimits }_{m\leq (N+1)}{f}_{m}^{\dag \mathcal{N}}(\theta )/{Q}_{\dag \mathcal{N}}\), where \({f}_{m}^{\mathcal{N}}(\theta )\) is the number of archived news articles available from Google News for tm, and \({Q}_{\mathcal{N}}\) is the normalization term. The news trend measurement is similarly defined as \(1 - D({E}_{N+1}^{D}(\theta ),{E}_{N+1}^{\mathcal{N}}(\theta ))\).

Using the same method as above, we compute the search and news trend measurement metrics for the attribute social graphs – given as, \(1 - D({E}_{N+1;{a}_{k}}^{D}(\theta ),\)EN + 1S(θ)) and \(1 - D({E}_{N+1;{a}_{k}}^{D}(\theta ),{E}_{N+1}^{\mathcal{N}}(\theta ))\), averaged over all values of \(\forall {a}_{k} \in \mathcal{A}\) respectively.

4.8 Experimental Studies

We present our experimental results in this section that validate the proposed framework of modeling diffusion. We utilize a dataset that is a snowball crawl from Twitter, comprising about 465K users, with 837K edges and 25.3M tweets over a time period between Oct’06 and Nov’09. For our experiments, we focus on a set of 125 randomly chosen “trending topics” that are featured on Twitter over a 3 month period between Sep and Nov 2009. For the ease of analysis, we organize the different trending topics into generalized themes based on the popular open source natural language processing toolkit called “OpenCalais” (http://www.opencalais.com/).

We discuss attribute homophily subject to variations across the different themes, and averaged over time (Oct–Nov 2009). Figure 4.9 shows that there is considerable variation in performance (in terms of saturation and utility measures) over the eight themes.

Fig. 4.9
figure 9figure 9

Mean saturation and utility measurement of predicted diffusion characteristics shown across different themes

In the case of saturation measurement, we observe that the location attribute (LOC) yields high saturation measures over themes related to events that are often “local” in nature: e.g., (1) “Sports” comprising topics such as “NBA,” “New York Yankees,” “Chargers,” “Sehwag” and so on – each of them being of interest to users respectively from the US, NYC, San Diego and India; and (2) “Politics” (that includes topics like “Obama,” “Tehran” and “Afghanistan”) – all of which were associated with important, essentially local happenings during the period of our analysis. Whereas for themes that are of global importance, such as “Social Issues,” including topics like “#BeatCancer,” “Swine Flu,” “#Stoptheviolence” and “Unemployment,” the results indicate that the attribute, information roles (IRO) yields the best performance – since it is able to capture user interests via their information generation and consumption patterns.

From the results on utility measurement, we observe that for themes associated with current external events (e.g., “Business-Finance,” “Politics,” “Entertainment-Culture” and “Sports”), the attribute, activity behavior (ACT) yields high utility measures. This is because information diffusing in the network on current happenings, are often dependent upon the temporal pattern of activity of the users, i.e., their time of tweeting. For “Human-Interest,” “Social Issues” and “Hospitality-Recreation,” we observe that the content creation attribute (CCR) yields the best performance in prediction, because it reveals the habitual properties of users in dissipating information on current happenings that they are interested in.

From these studies, we interestingly observe that attribute homophily indeed impacts the diffusion process; however the particular attribute that can best explain the actual diffusion characteristics often depends upon: (1) the metric used to quantify diffusion, and the (2) topic under consideration.

5 Summary and Future Work

Our central research goal in this chapter has been to instrument the three organizing principles that characterize our communication processes online: the information or concept that is the content of communication, and the channel i.e., the media via which communication takes place. We have presented characterization techniques, develop computational models and finally discuss large-scale quantitative observational studies for both of these organizing ideas

Based on all the outcomes of the two research perspectives that we discussed here, we believe that this research can make significant contribution into a better understanding of how we communicate online and how it is redefining our collective sociological behavior. Beyond exploring new sociological questions, the collective modeling of automatically measurable interactional data will also enable new applications that can take advantage of knowledge of a person’s social context or provide feedback about her social behavior. Communication modeling may also improve the automated prediction and recognition of human behavior in diverse social, economic and organizational settings. For collective behavior modeling, the social network can define dependencies between people’s behavior with respect to their communication patterns, and features of the social network may be used to improve prediction and recognition. Additionally, some of the statistical techniques developed in this thesis for analyzing interpersonal communication may find new application to behavior modeling (collective or otherwise) and machine learning.

In the future, we are interested in two different non-trivial problems that can provide us with a deeper and more comprehensive understanding of the online communication process. The first of the two problems deals with the idea of evolution of network structure from an ego-centric perspective, in the context of online social spaces that feature multiplex ties. The second problem is geared towards exploring how sociological principles such as homophily (or heterophily) impacts media creation (e.g., uploading a photo on Flickr, or favorting a video on YouTube) on the part of the users. We are interested to study how the observed social interactions among the individuals impact such dynamics. Note, both of the proposed problems consider an observed sociological phenomena prevalent on the social media sites, and attempts to understand it with the help of large-scale quantitative observational studies.