Keywords

1 Introduction

The concept of communication is fundamental in the study of social systems [13], and the approaches for modeling them as networks make no exception. For example, if we focus on temporal social networks a large majority of the scenarios studied in the literature are clearly describing communication processes, including conversations on social media [14], mobile telephone calls [7], as well as face-to-face interactions [22]. Even when we consider static models of social networks, such as a friendship graphs without any associated temporal information, many of the metrics used to analyze them are still based on the assumption that some information is shared through the network. For example, we can measure the ability of actors or groups of actors to efficiently spread information (closeness, diameter, Page-Rank centrality), or we can identify actors with the ability to influence existing information flows (betweenness centrality). In summary, the most typical application of social network models and in particular temporal social networks is to study systems of communication.

Despite the central role of information in communication systems, the information exchanged through the social ties has often been neglected in network analysis. The most popular methods for the analysis of social networks are only defined on simple graph models, only including actors and their relationships, and hence temporal network analysis methods only rely on the additional availability of time annotations.

Information diffusion processes are often modeled including all three components: the actors propagating the information, the times of the propagation and the content. In practice, however, the content propagated (e.g., the text itself) is used only to define how actors are connected with each other based on, for example, the order of links between blog posts [12], who re-shared the same content in social media [5] or how actors interact with messages shared across multiple social networks [18, 19]. In [23], the authors use the concept of polyadic conversation, a model where chains of Twitter user interactions (replies, mentions and retweets) during a time interval are first grouped into conversation trees, and then aggregated into a static weighted graph of interactions between authors. This type of graph aggregation has recurrently appeared in the literature of network modeling and information retrieval [14], but there is no consensus on either what is the best method to build such model (e.g., how to compute the length of a conversation in terms of time and/or tree’s depth) or how the textual content affects the grouping of actors. In summary, studying communication networks without considering what is communicated can only allow a partial understanding of the underlying social system.

To allow a more accurate representation of human communication, a model for temporal text networks was recently proposed [25]. This model describes communication events among actors, including the actors exchanging information, the textual representation of this information and the times when the communication happens. While this model is still limited to textual information, text is a very common way to communicate (for example by email, or via Twitter posts) and can also be used to represent other forms of expression, for instance oral communication that can be translated to text either manually or semi-automatically through speech-to-text algorithms, and also images, that can be turned into a set of keywords describing them [24].

While mathematically temporal text networks can be seen as extensions of temporal networks, which are themselves extensions of simple networks, there are two important differences that require the introduction of specific analysis methods. The most intuitive difference is of course the presence of text. An additional and more subtle difference lies in the semantics of the temporal annotations.

In the literature on temporal social networks the time on edges is typically used to indicate when an edge exists, e.g., that during that time the two actors are in contact and can exchange information. An implicit assumption in existing works is that information can be exchanged at any time when an edge is active, and that the exchange of information is instantaneous.

When we explicitly model communication networks, we should make a difference between edges representing the possibility of communicating and edges representing the actual production and consumption of information. In many cases the first type of edges exist between all actors; for example, we can always send an email to an existing email address. Therefore, in this chapter we focus on edges representing communication acts, that is, the actual exchange of information. These acts may have a non-negligible duration, therefore the time annotations in a temporal text network indicate when the transmission of a (text) message starts and when it finishes. Examples where this is important are messages exchanged through physical networks, where the communication channel has a physical delay, and asynchronous communication such as by email and via social media, where the text is sent at some time but in general only received at a later time.

This different semantics of the temporal edges in temporal networks and in temporal text networks requires the re-definition of some central concepts, such as time-consistent paths, which in turn leads to the definition of new specific metrics.

Finally, it is worth mentioning that NLPFootnote 1 methods such as sentiment analysis [3, 16] have been used in the past to study the evolution of tweets, songs, blogs, presidential speeches without requiring information about the underlying communication structure (who exchanges these data sources and how), using only data from time-annotated documents and time series information [11]. The temporal text network model does not only allow researchers to use NLP methods during the analysis, but it provides specific metrics to combine them with other measures from temporal networks.

This chapter introduces the concept of path in temporal text networks and various metrics to characterize them. In Sect. 2 we introduce the temporal text network model to encode communication networks. In Sect. 3 we introduce the concepts of walk and path in temporal text networks, and in Sect. 4 we define alternative ways of summarizing a path, based either on the times when the communication acts happen or on the text exchanged through a path. Finally, in Sect. 5 we conclude with an empirical comparison of some of the measures introduced in this chapter in a sample network formed by the Twitter interactions between Swedish politicians.

2 Representing Temporal Text Networks

From a mathematical point of view, a temporal text network [25] can be represented as a triple (G, x, t) where G = (A, M, E) is a directed bipartite graph representing the communication network, x : M → X is a mapping between the messages in M and a set of sequences of characters (text) in X and t : E → T represents the time associated to each edge, where T is an ordered set of time annotations. The edge directionality indicates the flow of the communication: (a i, m k) ∈ E indicates that actor a i has produced text m k, while (m k, a j) ∈ E indicates that actor a j is the recipient of message m k. Actors with out-degree larger than 0 are information producers, actors with in-degree greater than 0 are information consumers, and actors with both positive in- and out-degrees are information prosumers.

Figure 1 describes a working example we will use during the remainder of this chapter, representing a temporal text network with |A| = 8 actors, |M| = 6 messages and |E| = 15 edges. It is important to observe that, in most cases, the edges to/from a message have different time attributes; the only restriction imposed by the model is that (a i, m), (m, a j) ∈ E ⇒ t(a i, m) ≤ t(m, a j). In other words, a message can be consumed at different times by each actor (e.g., different social media users can visit their notifications page at different times), but can never be received before it has been generated (e.g., a user cannot access information that has not been shared yet).

Fig. 1
figure 1

A temporal text network model. Circles represent actors, squares represent messages and the edges between them represent the production and reception of the messages by the actors. Edges are also annotated with a time attribute t i ∈ T

This simple model can be used to differentiate between so-called unicast (messages m 2 and m 3 in the figure) and multicast (messages m 1, m 4 and m 5) communication. The model can also be used to represent a variety of communication platforms such as email and Twitter mention networks, and can be easily extended adding edges between actors or between messages to represent additional relationships such as a follower/followee network. Unless we explicitly mention it, in the remainder of the chapter we will ignore these extensions.

A similar model to represent temporal interactions is the contact sequence [4, 6] model, which expresses temporal networks as a set of directed edges (called contacts) during a finite span of time. While this model has been successfully used to study spreading processes of information [1, 10] or the structural evolution of social networks [8, 17, 26], the model ignores the role of the content of the messages.

A natural alternative to represent time in networks is to use a sequence of time-annotated graphs, forming a so-called multi-layer network [2, 9]. In time-sliced models [15], for example, each one of the aggregated networks represents a fixed interval of time, and an edge e ij is created if at least one contact has been registered between nodes i and j in the corresponding time interval. The aggregated graphs are sometimes weighted, in which case the edges have an assigned weight attribute w ij proportional to the number of original edges, their frequency or another relevant time summarization function. In longitudinal networks, instead, the relations between the same or similar actors are detected at different points of time [20, 21]. From the modeling point of view there is not much difference between the two models, apart from the fact that in time-sliced networks the time intervals of two adjacent aggregated graphs are contiguous, which is not necessarily true for longitudinal networks.

3 Path-Based Metrics

Metrics for simple networks are based on basic concepts in graph theory, such as adjacency and incidence, and on counting discrete objects such as edges. Temporal networks extend simple networks with time. This requires the extension of some basic concepts in graph theory, and as time is often represented as a real number or interval, then temporal measures also require some additional simple arithmetical operations, such as time difference.

Temporal text networks also contain a text attribute. Text is a much more complex type of data, with a large number of possible operations. For example, the comparison of two texts can be done using different models (edit distance, word overlapping, vector representation, etc.), applying different “normalization” operators (stemming, stop word removal, dictionary based word replacement) or mapping the text to other domains (for example sentiments or topics). While these choices are very important in practice, hard-coding all these details in the metrics would make the model very complex.

Therefore, as discussed in [25], when dealing with temporal text networks we assume to have at least one of the following two types of text functions. The first type corresponds to a so-called continuous analysis approach, based on the idea of having different grades of similarity between messages. In this case we assume to have a distance function d : M × M → [0, ), indicating how similar two messages are; if d = 0, the two messages are considered indistinguishable (for example because they contain the same text), and higher values of d indicate that the two messages are less similar. Notice that one can then plug specific functions into the model based on the text operations described above. An example of a message distance function is the cosine of the angle between vector representations of the two texts.

The second type of functions is targeted to a so-called discrete analysis approach, where each message is assigned to 0, 1 or more classes. For each class i we have a function c i : M →{0, 1}, which returns 1 if the message belongs to class i, 0 otherwise. One example is a topic modelling function with k topics, where c i(m) = 1 if m belongs to topic i. Notice that starting from a discretization function we can also define a text distance function, for example based on how many common topics are shared between the two input messages.

3.1 Incidence and Adjacency

In digraphs two vertices are adjacent if there is an edge between them, and two edges are incident if the tail of the first is the head of the second. In temporal text networks two vertices are adjacent at time t if there is an edge between them at that time. The concept of adjacency has also been extended to edges (also known as events or contacts): an edge entering a vertex is adjacent to an edge leaving the same vertex at a later time. This enables the definition of Δt-adjacency between edges, which is satisfied when they are adjacent and the time between them is less or equal than Δt. Note that this terminology is not completely consistent with the one in digraphs, where only vertices can be adjacent.

Temporal text networks differ from the previous cases in two regards. First, we do not need to extend the concept of adjacency to edges: we have two types of vertices (actors and messages), so for example the concept of adjacency between edges in temporal networks corresponds to adjacency between messages. This also means that we can retain the concept of incident edges from the theory of digraphs. Second, the idea of filtering those pairs of vertices that are close enough in time can also be extended to actors. In summary, all the concepts discussed above can be reduced to the following definitions.

Definition 1 (Edge Incidence)

Let e 1 = (v i, u k, t 1) and e 2 = (u k, v j, t 2) be two edges in a temporal text network. We say that e 1 is incident to e 2 if t 1 ≤ t 2.

Definition 2 (Adjacency)

Let e 1 = (v i, u k, t 1) and e 2 = (u k, v j, t 2) be two edges in a temporal text network. Then:

  1. 1.

    v i is adjacent to u k at time t 1.

  2. 2.

    v i is Δt-temporally adjacent to v j if t 2 − t 1 ≤ Δt.

  3. 3.

    v i is Δx-textually adjacent to v j if v i, v j ∈ M and d(v i, v j) ≤ Δx.

Notice that the definition of incidence and adjacency hold independently of the type of vertices (v i, u k and v j) involved. If v i, v j ∈ A are actors, then their temporal adjacency is defined by the delay between the production and consumption of the message u k ∈ M. We call an edge from an actor a to a message m a producer edge (e p), while an edge from a message m to an actor a is called a consumer edge (e c). If v i, v j ∈ M are messages, then their temporal adjacency is defined by the delay between when the intermediate actor consumes (e.g., receives) the first message and the time when it produces (e.g., sends) the second. For example, the producer edge e 4 = (a l, m 4) in Fig. 1 is incident to the consumer edge e 10 = (m 4, a n), therefore actor a l is Δt-adjacent to actor a n for all Δt ≥ t 9 − t 4.

3.2 Walks and Paths

Definition 3 (Walk)

A walk in a temporal text network (also called a temporal walk) is a sequence of edges e 1, e 2, …, e l where e i is incident to e i+1 for all i from 1 to l − 1.

In the following we will write a ∈ w to indicate that a vertex (actor or message) is present in walk w.

Notice that the definition above does not constrain the starting and ending vertices of a path to be actors or messages. However, we will often be interested in walks starting from an actor, because every message has a single producer in the model used in this chapter.

Definition 4 (Path)

A path in a temporal text network (also called a temporal path) is a walk where no vertex (message or actor) is traversed twice.

Each path establishes a precedence relation between actors indicating that the network allows a flow of information between them. Similarly, we have a precedence relation between messages indicating that the two messages can be part of the same flow of information.

Definition 5 (Temporal Precedence)

An actor a i temporally precedes another actor a j if there is a path from a i to a j. A message m i temporally precedes another message m j if there is a path from m i to m j.

Figure 2 represents the same temporal text network of Fig. 1 as a temporal sequence of edges between actors and messages. In this example, w 1 = [e 4, e 7, e 8, e 9] and w 2 = [e 4, e 10, e 11, e 12, e 14] are two walks of 4 and 5 edges.Footnote 2 The second walk is also a path, starting at an actor and ending in a message m 6, but the first walk is not a path because the last edge e 9 = (m 2, a l, t 9) visits for a second time the actor a l. Finally, notice that in this example a l precedes actor a k in path p 1 = [e 4, e 7] and vice-versa in path p 2 = [e 8, e 9], while m 3 precedes m 6 but not otherwise.

Fig. 2
figure 2

Temporal text network represented as a sequence of edges. The horizontal lines represent the actors (gray color) and messages (green color) and vertical lines represent the transmission or consumption of a message. The shaded lines indicate all existing paths beginning at actor a l at the exact time t = 4

In some cases we may want to consider only those paths with a limited delay and with a limited textual difference between adjacent messages. We can thus use the definitions of Δ-adjacency introduced above to select specific paths where sufficiently similar messages are exchanged often enough with respect to some user-defined thresholds.

4 Path Lengths

From now on we will focus on paths starting at an actor and ending at an actor. While a path can also start or end at a message, paths from and to actors are the ones providing the most accurate description of an information flow, because for every message there must always be an actor producing it, and messages that are not consumed by anyone (as message m 6 in our example) do not correspond to any exchange of information.

The length of a path in a temporal text network can be defined based on the topology, on time and on text.

The topological length is an unambiguous measure in simple and temporal networks, which are only made of vertices and edges. In a temporal text network a path contains actors, edges and messages, and the definition of length that is compatible with the one used in temporal networks corresponds to the number of messages in the path. This is because when a temporal network is translated into a temporal text network every edge is transformed into a message.

The temporal length, instead, defines the overall duration of the communication and is computed as the difference between the time of the last consumer edge and the time of the first producer edge in the path.

The topological and temporal length measures we have just described can be used to characterize the several paths that traverse our graph. In Fig. 2 we have highlighted all the existing paths starting at actor a l at exactly t = 4, including those ending in a message. For example if we compare the path p 1 = [e 4, e 10, e 11, e 13] with the path p 2 = [e 4, e 10, e 11, e 15] we can see that both have the same topological length of 2 messages. However, while both paths start at the same time e 4 = (a l, m 4, t 4), the time of the last consumer edge is different and so their temporal length: t(e 13) = t(m 5, a p, t 10) ≤ t(e 15) = t(m 5, a m, t 15).

Interestingly, in temporal text networks the temporal length of a path measures two different types of delays. On the one hand it measures the transmission time (δt) as the difference between the time of the consumer edge t(e c) and the time when the content has been produced t(e p). On the other hand, it indicates the idle time (τ) of the actors involved in the communication between two consecutive edges.

Definition 6 (Transmission Time)

Let e 1 = (a i, m, t 1) and e 2 = (m, a j, t 2) two incident edges, with m ∈ M. Then the quantity t 2 − t 1 is called transmission time.

Definition 7 (Idle Time)

Let e 1 = (m i, a, t 1) and e 2 = (a, m j, t 2) two incident edges, with a ∈ A. Then the quantity t 2 − t 1 is called idle time.

Once one has defined transmission and idle times, one can also compute the sum of all transmission times in a path, the sum of all idle times in a path, as well as the ratio between these values and the temporal length of the path. Back to our previous example, we can observe that the total transmission time of the messages in the first path δ 1 = (t 9 − t 4) + (t 10 − t 9) = 6 is three units smaller than in the second path δ 1 = (t 9 − t 4) + (t 13 − t 9) = 9 while their idle time is the same τ F = t 9 − t 9 = 0; which explains why the first path had a smaller temporal length.

The last type of length concerns the textual content in the path. Every time a message is exchanged, this increases the temporal length of the corresponding amount of time. Similarly, every time a new text is included in the path, this increases the textual information in it.

Definition 8 (Textual Length)

Given a text distance function, the textual length of a communication path is defined as the sum of the distances between the texts of all pairs of adjacent messages in the path.

This definition quantifies the variations between adjacent messages. At the same time, it is possible that the texts of the message keep being updated when transmitted through the path, but never significantly deviates from the original message. In this case, an alternative definition of length can be used to compute the maximum distance between any pair or messages.

In the case of discrete text analysis, where each message can belong to some classes (for example topics), this idea of estimating how homogeneous the text is across the path can be computed using a classical measure of entropy, for example the Shannon index:

Definition 9 (Entropy)

Let c 1, …, c n be text discretization functions mapping text into one of n classes. Given a path p, we define \(\rho _i(p) = \frac {\sum _{m \in p} c_i(x(m))}{M_p}\), where M p is the number of messages in p. The textual entropy of path p is then defined as:

$$\displaystyle \begin{aligned} H(p) = - \sum_{i=1}^{L} \rho_i(p) \ln{\rho_i(p)} \end{aligned} $$
(1)

According to this definition, if all messages that are part of a path belong to the same class (e.g., to the same topic), then the textual entropy will be 0, indicating a homogeneous path when we look at its text. Higher values of entropy would indicate that multiple classes (e.g., topics) are included in the path. This information can be useful in various analysis tasks, including the identification of information flows (when the same textual content is transferred through the network) or community detection, where one wants a community to be homogeneous not only with respect to the topology but also with the exchanged messages.

Once we decide which definition of length to use, this defines what are the shortest paths between any pair of actors, which implies that we can compute all the existing network measures based on shortest paths, including closeness centrality, betweenness centrality, eccentricity, diameter, etc. For the definitions of these metrics we refer the reader to any basic books on network analysis.

5 Empirical Study

In this section, we show an empirical comparison of the measures introduced in this chapter in a real communication network. Our sample dataset consists of all the public Twitter mentions (messages including another Twitter @username) written by Swedish politicians during January, 2019. The period of observation takes place 4 months after the Swedish general elections in 2018, and includes the time when the new government coalition was formed.Footnote 3 Our final network consists of |A| = 886 actors, including 26 politicians (8 information producers and 18 prosumers) and 860 mentioned users (all of them consumers), |M| = 1707 Twitter messages with their corresponding text and |E| = 4, 882 edges between actors and messages. Modelling the reception time is more difficult, because many social media platforms like Twitter do not provide information about when and who consumed a piece of information. In our experiments we assumed that the consumption time of all messages is the same as the production time, which might not be necessarily true (e.g., users are not always connected to all their social media and, even if they are, the tweet might be lost in the myriad of information provided by the user’s wall).

Figure 3 shows, for each one of the 6773 pairs of actors temporally reachable, a comparison of their topological and temporal shortest path length. It includes 5787 (85.4%) paths with only two edges, representing two Δ0-textually (and temporally) adjacent actors who have been in direct communication. The average temporal path length of the remaining paths increases with the number of hops (topological length) while its statistical dispersion is reduced, as we usually observe in other type of temporal networks (e.g., contact networks). For example, the 56 pairs of actors connected through 3 messages (6 hops) have an average communication time (shortest temporal length) of approximately 14 days. The order of magnitude of these numbers can be explained by the skewed distribution of roles (producer, consumer and prosumers) of the actors in the data and the small sample of the original social network.

Fig. 3
figure 3

Temporal length. Summary of the temporal length distribution for all shortest paths found in the Swedish politicians network, grouped by their topological path length. All topological paths involve an even number of hops because we are measuring only pairs of reachable actors

Another important component to understand communication networks is the specific content their members intend to share with each other. For example, in a conversation within a group of close people the content (text) of the messages will be probably different between communications, while news spreading processes will probably have a more similar topic distribution. The consistency of the topics in an information cascade phenomenon, therefore, can be a good metric to describe the dynamics of a complex system.

Following the methodology described in Sect. 4, we have first identified the topics of the messages exchanged in our sample network and then, computed the textual length using the Shannon index described in Eq. (1) to identify the shortest paths of each pair of temporally reachable actors. While identifying the topics, we have used the hashtags as proxies, which is a simple and well accepted solution in many contexts; but as we will see, problematic in practice. As we mentioned in the previous section, the definition of textual length assumes that there is a discretization function mapping the text into at least one topic. Hence, because many tweets do not contain any hashtag, their topic assignment is empty.

Figure 4 shows the empirical cumulative distribution function (ECDF) of the textual shortest path in our sample network. In this particular example, only 420 observations of 6773 were computed, as many paths have an unidentified length, either because none of the messages have a topic assigned or because they contain only one message.

Fig. 4
figure 4

Empirical cumulative distribution function (ECDF) of the textual length

We can observe that more than 75% of the textual paths computed have 0 entropy, indicating that there is one single topic in the messages of the path. A closer look does not indicate any correlation of these results with the topological or temporal length of the paths. The minimum textual length paths include, for example, all the paths with 5 messages (10 hops) and 85.45% of the paths with 4 messages, but less than 50% of the paths with 3 messages.

6 Final Remarks

In this chapter, we have revisited some of the fundamental graph measures for temporal networks and extended them to be compatible with the temporal text network model for communication systems. We have shown that using the proposed model we can directly represent, in a simple but extensible way, all the elements necessary to study communication (time, text and topology), without requiring complex graph transformations. While mathematically temporal text networks are not much different from time-varying graphs, the semantics of its interactions and the presence of textual information in the model, require the introduction of specific analysis methods. In particular, in this chapter we have focused on redefining the idea of connectivity and most of its related measures such as incidence, adjacency, paths and distance, providing alternative metrics for actors and messages when we found it was relevant and necessary. Finally, we have shown how the different distance measures can be used in practice to discover patterns of connectivity.

We believe that, beyond their direct application to different analysis tasks, these measures are fundamental to redefine other relevant measures for studying communication systems such as centrality measures or developing analysis methods like community detection algorithms.