Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

10.1 Introduction

10.1.1 Motivation

Electronic mail (also known as email) despite its long history has remained to be the most popular communication tool today. Unlike other newer communication tools such as weblog, twitter, messenger, etc., email has been widely adopted in the corporate world and often seamlessly integrated with business applications. As users email one another within and outside the corporate boundaries, they form different kinds of email networks. Within each network, users demonstrate behaviors that also affect how emails are sent and replied.

In this paper, we study user interaction behaviors in email networks and how they are relevant to predicting future email activities. An email network is essentially a directed graph with nodes and links representing users and messages from users to other users respectively. Each email is assigned a timestamp and has other attributes including sender, recipients, subject heading, and email content. We focus on two user interaction behaviors that are closely related to how users respond to one another in email networks, namely engagingness and responsiveness.

We define engagingness behavior as the ability of an user to solicit responses from other users, and responsiveness behavior as the willingness of an user to respond to other users. A user at the low (or high) extreme of engagingness behavior are known as to be disengaging (or engaging). Similarly, a user can range from unresponsive to highly responsive. As suggested by their definitions, user engagingness and responsiveness have direct or indirect implications on the way emails are sent and responded, and the strength of relationships users may have with other users in the networks. Nevertheless, these implications have not been well studied. The use of interaction behaviors to enhance email functions has been largely unexplored.

This paper therefore aims to provide a fresh approach towards modeling the engagingness and responsiveness behaviors in email networks. These models are quantitative and assign to each user an engagingness score and a responsiveness score. The scores are within the [0,1] such that 0 and 1 represent the lowest and highest scores respectively. With the scores, we can rank all users by engagingness or responsiveness. Moreover, we derive new features from these behavior scores and use them in an example email activity prediction, i.e., email reply order prediction.

The engagingness and responsiveness behavior models can be very useful in several applications. In the context of business organizations, they help to identify engaging and responsive users who may be good candidates for management roles, and to weed out lethargic users who are neither engaging and responsive making them the bottleneck in the organization. For informal social email networks, engaging and responsive users could be the high network potential candidates for viral marketing applications. Engaging users may solicit more responses for viral messages while responsive users may act fast on these messages. By selecting these users to spread viral messages to targeted user segments by word-of-mouth, marketing objectives can be achieved more effectively.

In this paper, we specifically introduce the email reply order prediction task as an application, and show that engagingness and responsiveness behavior models contribute significantly to prediction accuracy. Email reply order prediction refers to deciding which of a pair of emails received by the same user will be replied first. This prediction task effectively helps an email recipient to prioritize his replies to emails. For example, if e 1 and e 2 are two emails sent to user u k who plans to reply both. The outcome of prediction can either be e 1 replied before e 2 or vice versa. The ability to predict reply order of emails has several useful benefits, including helping users to prioritize emails to be replied, and to estimate the amount of time emails get replied. Here, our main purpose is to use the task to evaluate the utility of engagingness and responsiveness behavior models.

10.1.2 Research Objectives

This paper proposes to model engagingness and responsiveness behaviors quantitatively. In order to develop these quantitative behavior models, we first preprocess the emails so as to remove noises from the data and to construct the reply and forward relationships among emails. From the email relationships, we also derive email threads which are hierarchies of emails connected by reply and forward relationships. We then systematically develop a taxonomy of engagingness and responsiveness models using the reply relationships and email threads. These models are applied to the Enron email dataset, a publicly available dataset consisting of 517,431 emails from 151 ex-Enron employees. The email reply order prediction task is addressed as a classification problem. Our approach derives a set of features for a email pair based on the emails’ metadata as well as engaging and responsive behaviors of their senders. As we evaluate the performance of the learnt prediction models, we would like to identify the interplay between behavior features and prediction accuracy. Our approach does not depend on email content or domain knowledge which are sometime not available and time costly to process. Given that there are only two possible order outcomes, we expect any method should have an accuracy of at least 50 %. In order for email reply order prediction to be useful, a much higher prediction accuracy is required without relying on content analysis.

Both behavior modeling and email reply order prediction are novel problems in email networks. Research on engagingness and responsiveness behaviors is a branch of social network analysis that studies node properties in a network. Unlike traditional social network analysis which focuses on node and network statistics based on static information (e.g., centralities, network diameter) of social networks, behavior analysis is conducted on networks with users dynamically interacting with one another.

In the following, we summarize the important research contributions of this paper.

  • We define four of models for engagingness and responsiveness behaviors prevalent in email networks. They are (a) email based, (b) email thread based, (c) email sequence based, and (d) social cognitive model categories. For each model category, one can define different behavior models based on different email attributes. To the best of our knowledge, this is the first time engagingness and responsiveness behavior models are studied systematically.

  • We apply our proposed behavior models on the Enron email network, analyze and compare the proposed behavior models. We conduct data preprocessing on the email data and establish links between emails and their replies. In our empirical study, we found engagingness and responsiveness are distinct from each other. Most engagingness (responsiveness) models of users are shown to be consistent with each other.

  • We introduce email reply order prediction as a novel task that uses engagingness, responsiveness and other email features as input features. An SVM classifier is then learnt from the features of training email pairs and applied to test email pairs. According to our experimental results, the accuracy of our SVM classifier is about 77 % which is better than random guess (50 %). This indicates that user behaviors are useful in the prediction task.

Unlike most previous research on behavior analysis in email networks which focuses on mainly direct statistics of emails such as recipient list size, rate of emails from receiver to sender, and email size to characterize an email user [4, 13], our modeling of engagingness and responsiveness behaviors relies mainly on email reply and forward relationships not available directly in the email data. Previous research on email prediction tasks include the prediction of (a) social hierarchy of email users [12], (b) topics of emails [7], and (c) viral emails[13]. Email reply order prediction is thus a new task to be investigated. Although engagingness and responsiveness behaviors and reply order prediction task are defined in the context of email networks, our proposed approaches and results are also applicable to other form of information exchange networks such as messaging and blog networks.

10.1.3 Enron Email Dataset

Throughout this research, we use Enron email dataset in our empirical study of real data. This dataset is so far the only known publicly available email data with messages assigned with specific senders and recipients [6]. This dataset provides 517,431 emails for 151 Enron employees. Each email message has a unique message ID and contains header information such as the date and time when the message was sent, sender, recipients (To and Cc lists), subject and body in plain text format. We performed two data preprocessing steps on the email data, namely duplicate elimination and email relationship identification.

Duplicate elimination. As noticed by the previous studies on this corpus, there are many duplicate emails in either different folders of the same user (e.g., in computer generated folders such as all_documents, discussion_threads) or folders of different users (e.g., a message in sender’s sent_mail folder often appears in some recipient’s inbox or other folder). Message_IDs cannot be used to identify duplicate emails as such emails also have unique message IDs. We therefore use a strategy similar to [7] by computing the MD5 sum on email fields: Date/Time, Sender, To, Cc, Subject and Body. This will assign the same MD5 value (128 bit integer) to all duplicate emails that exactly match on these fields. After duplicate elimination, the dataset contains 257,044 unique emails.

Email relationship identification. To identify reply and forward relationships between emails, we first group all emails of each matching subject after ignoring the Reply and Forward prefixes (e.g., RE, FW, FWD, etc.) and order them by time. Each reply (or forward) email e i in the group is then assigned a reply relationship (or forward relationship) with the most recent earlier email e j such that e i ’s sender is one of the e j ’s recipients and that t(e i ) − t(e j ) ≤ 90 days where t(e) denotes the send time of e. With this approach,Footnote 1 we found 34,008 email relationship of which 27,730 and 6,278 are reply and forward relationships respectively. When a set of emails form a connected component by reply and forward relationships, we call it an email thread. From the email relationships identified, we derive 18,593 threads that connect 52,601 emails (about 20 % of all unique emails).

To evaluate our email relationship identification approach, we first compute precision that measures how many links are correct among the links detected by our method. For this, we selected a random sample of 100 link relationships from the total of 34,008 links. For every pair, we manually verified whether or not an email is sent and the other is the correct reply email. Our manual evaluation showed a precision of 91 %. To compute recall, we randomly selected 30 subject groups, each of which contains about five or ten emails. For each subject group, we manually created threads by connecting emails to their follow-up responses. This sample includes about 120 emails with 79 reply links and 21 forward links. For each link of two emails, we manually examined if the link is correctly found by our method. We found 79 % correct links which are actually present in the Enron dataset. This suggests that our identified relationships are quite accurate. In our subsequent experiments, we therefore use these identified email relationships.

10.1.4 Paper Outline

The remainder of this paper is organized as follows: In Sect. 10.2, we present engagingness and responsiveness behavior models. Subsequently, we discuss a challenging problem of predicting email reply order based on our behavior models in Sect. 10.3. The proposed models are evaluated and compared in Sect. 10.4 using a set of experiments on the Enron dataset. In particular, Sect. 10.4.3 describes experiments that evaluate the performance of email reply order prediction using different classifiers trained with different features including those based on email behaviors. In Sect. 10.5, we briefly introduce works related to our behavioral analysis problem. Finally, we offer our concluding remarks in Sect. 10.6.

10.2 Engagingness and Responsiveness Behavior Models

In this section, we describe our proposed behavior models for user engagingness and responsiveness. All the models assume that emails have been preprocessed as described in Sect. 10.1.3. We divide our models into the following categories:

  • Email based models: These models consider emails as the basic data units for measuring user behaviors. Email attributes such as sender, recipient list, date, etc., are used.

  • Email thread based models: These models consider email threads as the basic data units for measuring user behaviors. The models therefore use attributes of email thread to quantify behaviors.

  • Email sequence based models: These models examine the sequence of emails received and replied by each user and derive the user behaviors from the gaps between emails received and their replies.

  • Social cognitive models: These models consider social perception of user behaviors within the email network and measure behaviors accordingly.

Figure 10.1 shows the taxonomy of behavior models in the above categories to be further defined in the following sections. Each model (M) consists of a pair of engaging (E M) and responsive (R M) score formulas defined based on some principles. The E M and R M score values are in [0,1] range with 0 and 1 representing the lowest and highest values respectively. Table 10.1 shows a list of symbols and their meanings that we use in this paper.

Fig. 10.1
figure 1

Taxonomy of models

Table 10.1 Notations

10.2.1 Email Based Models

10.2.1 Email Count Model (EC)

The email count model is defined based on the principle that an engaging user should have most of his emails replied, while a responsive user should have most of his received emails replied. The engagingness and responsiveness formulas are thus defined by:

$${E}^{EC}(u_{ i}) = \frac{\vert RT(u_{i})\vert } {\vert S(u_{i})\vert }$$
(10.1)
$${R}^{EC}(u_{ i}) = \frac{\vert RB(u_{i})\vert } {\vert R(u_{i})\vert }$$
(10.2)

For users with empty S(u i ) (or R(u i )), E EC(u i ) (or R EC(u i )) is assigned a zero value.

10.2.1 Email Recipient Model (ER)

The intuition of this model is that an email with many recipients is likely to expect very few replies. Hence, an engaging user is one who gets replies from many recipients of his emails while an disengaging user receives very few or no reply when his emails are sent to many recipients. On the other hand, a responsive user is one who replies emails regardless of the number of recipients in the emails. A non-responsive user is one who does not reply even if the emails are directed to him only. The engagingness and responsiveness formulas are thus defined by:

$${E}^{ER}(u_{ i}) = \frac{1} {\vert S(u_{i})\vert }\displaystyle\sum _{e\in S(u_{i})}\frac{\vert \{u_{j} \in \mathit{Rcp(e)} \wedge r(e) \in RB(u_{j})\}\vert } {\vert Rcp(e)\vert }$$
(10.3)
$${R}^{ER}(u_{ i}) = \frac{1} {\vert R(u_{i})\vert }\displaystyle\sum _{{ e\in RB(u_{i})\ s.t. \atop \exists u_{j},\exists {e}^{{\prime\prime}}\in S(u_{j}),r({e}^{{\prime\prime}})=e} } \frac{\vert Rcp(e)\vert } {\mathit{MaxRcpCnt}}$$
(10.4)

where MaxRcpCnt ( = 291) denotes the largest recipient count among all Enron emails.

10.2.1 Email Reply Time Model (ET)

The reply time of an email can be an indicator of user engagingness and responsiveness. The email reply time model adopts the principle that engaging users receive the reply emails sooner than non-engaging users, while responsive users reply to the received emails quicker than non-responsive users.

Given an email e′ which is a reply of email e, e′ = r(e), the reply time of e′, \(Rpt(e^prime ) = t(e^prime ) - t(e)\). The z-normalized reply time \(\hat{Rpt}(e^prime )\) is defined by \(\frac{Rpt(e^prime )-\overline{Rpt}} {\sigma _{Rpt}}\) where \(\overline{Rpt}\) and σ Rpt are the mean and standard deviation of reply time respectively. Now, we define the engagingness and responsiveness of ET model as:

$${E}^{ET}(u_{ i}) = \frac{1} {\vert S(u_{i})\vert }\displaystyle\sum _{e\in S(u_{i})} \frac{1} {\vert Rcp(e)\vert }\displaystyle\sum _{{ u_{j}\in \mathit{Rcp(e)}, \atop \exists e^prime \in RB(u_{j}),e^prime =r(e)} }(1 - f(\hat{Rpt}(e^prime )))$$
(10.5)
$${R}^{ET}(u_{ i}) = \frac{1} {\vert R(u_{i})\vert }\displaystyle\sum _{e^prime \in RB(u_{i}),e\in R(u_{i}),r(e)=e^prime }(1 - f(\hat{Rpt}(e^prime )))$$
(10.6)

where

$$f(x) = \frac{1} {1 + {e}^{-x}}$$
(10.7)

The function f() is designed to convert the normalized reply time to the range [0,1] with 0 and 1 representing extreme slow and extreme fast reply times respectively.

10.2.1 Email Size Model (ES)

The email size model is analogous to the email reply time model except that we take the content size of emails into account rather than the reply time of emails. The principle behind this model is based on the size of reply email roughly representing the amount of a recipient’s effort. For instance, let us assume SZ(e) = k and e′ is a reply email of e. If SZ(e′) > k, the engagingness score of the sender of e will be high. The amount of content in a reply email can be used to measure the amount of eagerness of the user sending the reply email. Let \(\hat{SZ}(e)\) be the z-normalized SZ(e). We then develop the engagingness and responsiveness measures based on email size as

$${E}^{ES}(u_{ i}) = \frac{1} {\vert S(u_{i})\vert }\displaystyle\sum _{e\in S(u_{i})} \frac{1} {\vert Rcp(e)\vert }\displaystyle\sum _{{ u_{j}\in \mathit{Rcp(e)}, \atop \exists e^prime \in RB(u_{j}),e^prime =r(e)} }f(\hat{SZ}(e^prime ))$$
(10.8)
$${R}^{ES}(u_{ i}) = \frac{1} {\vert R(u_{i})\vert }\displaystyle\sum _{e^prime \in RB(u_{i}),e\in R(u_{i}),r(e)=e^prime }f(\hat{SZ}(e^prime ))$$
(10.9)

10.2.1 Email Time and Size Model (TS)

This model combines both email reply time and size into a hybrid model as

$${E}^{TS}(u_{ i}) = \frac{1} {\vert S(u_{i})\vert }\displaystyle\sum _{e\in S(u_{i})} \frac{1} {\vert Rcp(e)\vert }\displaystyle\sum _{{ u_{j}\in \mathit{Rcp(e)}, \atop \exists e^prime \in RB(u_{j}),e^prime =r(e)} }(1 - f(\hat{Rpt}(e^prime )))f(\hat{SZ}(e^prime ))$$
(10.10)
$${R}^{TS}(u_{ i}) = \frac{1} {\vert R(u_{i})\vert }\displaystyle\sum _{e^prime \in RB(u_{i}),e\in R(u_{i}),r(e)=e^prime }(1 - f(\hat{Rpt}(e^prime )))f(\hat{SZ}(e^prime ))$$
(10.11)

10.2.1 Examples

To illustrate the behavior models in Sect. 10.2.1, suppose a simple email network as shown in Fig. 10.2. In Fig. 10.2a, u i is a sender, while u 1, u 2, and u 3 are recipients. u i sends email e 1 to u 1 and u 2, and then another email e′ 1 is replied by u 1. However, u 2 does not respond to u i . In the email network, the engagingness score of the user u i is calculated as \({E}^{EC}(u_{i}) = \frac{3} {5} = 0.6\) and \({E}^{ER}(u_{i}) = \frac{\{\frac{1} {2} +\frac{2} {3} \}} {5} = 0.23\). In Fig. 10.2b, u 2, u 4, and u i are recipients, whereas u 1 and u 3 are senders. u 3 sends email e 2 to u 2 and u i . While u 2 does not reply to u 3, u i replies to u 3 and to u 2 as Cc. In the figure, the responsiveness score of the user u i is measured as \({R}^{EC}(u_{i}) = \frac{2} {2} = 1\) and \({R}^{ER}(u_{i}) = \frac{\{\frac{1} {4} +\frac{2} {4} \}} {2} = 0.38\), where we assume MaxRcpCnt = 4. In particular, we order emails by the number of recipients in the ascending order, and then assign to MaxRcpCnt the number of recipients in an email at the 90 percentile. Recall that the email count model is a macro approach, while the email recipient model is a micro approach.

Fig. 10.2
figure 2

An example for email based models. (a) Engagingness of u i . (b) Responsiveness of u i

In the email reply time model, \({E}^{ET}(u_{i}) = \frac{\frac{1} {2} \cdot (x_{1}+ \frac{1} {\infty })+\frac{2} {3} \cdot (x_{2}+ \frac{1} {\infty }+x_{3})} {2}\), where we compute x i as follows:

  • \(x_{1} = 1 - f(Rpt(t(e^prime _{1}(u_{1},\{u_{i}\})) - t(e_{1}(u_{i},\{u_{1}\})) = 5\,\text{s})) = 1 - 0.29 = 0.71\)

  • \(x_{2} = 1 - f(Rpt(t(e^prime _{2}(u_{1},\{u_{i}\})) - t(e_{2}(u_{i},\{u_{1}\})) = 10\,\text{s})) = 1 - 0.45 = 0.55\)

  • \(x_{3} = 1 - f(Rpt(t(e^prime _{2}(u_{3},\{u_{i}\})) - t(e_{2}(u_{i},\{u_{3}\})) = 20\,\text{s})) = 1 - 0.75 = 0.25\)

where \(e_{v}(u_{x},\{U_{y}\})\) denotes email e v sent by u x to recipients U y and e′ v denotes the reply of email e v . To compute function \(f(\hat{Rpt})\), we transform Rpt to z-scores. For instance, the z-score of \(Rpt(t(e^prime _{1}(u_{1},\{u_{i}\})) - t(e_{1}(u_{i},\{u_{1}\})) = 5\,\text{s}) = \frac{5\,\text{s}-\bar{x}} {\sigma } = \frac{5-11.67} {7.64} = -0.87\), where \(\bar{x}\) and σ denote the mean and standard deviation of reply times. According to our observation on reply times of Enron emails (see Table 10.2), the mean of reply times is much larger than the median. This indicates there are many outliers of reply times, and further most z scores can be negative. Thus we remove extreme reply times prior to computing z-scores. Then, \(f(-0.87) = \frac{1} {1+{e}^{-(-0.87)}} = 0.29\). In particular, the term \(\frac{1} {\infty }\) in E ET(u i ) indicates that u i sends e 1 and e 2 to u 2 but u 2 does not reply to u i . As a result, \({E}^{ET}(u_{i}) = \frac{\frac{1} {2} \cdot (0.71+0)+\frac{2} {3} \cdot (0.55+0+0.25)} {2} = 0.45\). The responsiveness of u i is calculated in the same manner. In addition, the email size model computes engagingness scores in the same manner except that the length of email content is considered instead of the reply time of emails.

Table 10.2 Distribution of reply times in the Enron email dataset

10.2.2 Email Thread Based Models

Here, we define the thread count model (TC) as an email thread based model. In the email count model, engagingness is measured by emails sent by a sender and sent emails directly replied by some recipient(s). However, direct reply is not the only type of response to an email. Email may be indirectly replied in email threads due to forwarded emails. For example, as illustrated in Fig. 10.3, user u 1 advertises a job position by sending an email to professor u 5 who subsequently forwards it to his student u 3. If u 3 replies to u 1, we say that the original email is replied indirectly in an email thread.

Fig. 10.3
figure 3

An email thread example

Email thread is defined by a tree of emails connected by reply and forward relationships. Table 10.3 shows the distribution of threads by the number of emails per thread. As we can notice, the distribution follows Zipf’s law. Majority of threads (11,302) contain only two emails. There are 3,925 threads that include three emails. The largest thread contains 37 emails.

Table 10.3 Distribution of emails per thread in the Enron email dataset

Based on email threads, the thread count model includes indirect replies to emails forwarded between users using the principle: the user is highly engaging if he receives many of his emails replied directly or indirectly by recipients, and is highly responsive if he replies or forwards most emails earlier received. In the following, the engagingness and responsiveness of a user u i are defined as:

$${E}^{TC}(u_{ i}) = \frac{1} {\vert S(u_{i})\vert }\vert \{e \in S(u_{i})\vert \exists t \in TH(u_{i}),\exists e^prime ,e\twoheadrightarrow _{t}e^prime \wedge u_{i} \in \mathit{Rcp(e^prime )}\}\vert $$
(10.12)
$${R}^{TC}(u_{ i}) = \frac{1} {\vert R(u_{i})\vert }\vert \{e \in R(u_{i})\vert \exists u_{j},e^prime ,t \in TH(u_{j}),e\twoheadrightarrow _{t}e^prime \wedge u_{j} \in \mathit{Rcp(e^prime )}\}\vert $$
(10.13)

where e ↠  t e′ returns TRUE when e is directly or indirectly connected to e′ in the thread t, and FALSE otherwise.

10.2.3 Email Sequence Based Models

Email sequence refers to the sequence of emails sent and received by a user ordered by time. To derive engagingness and responsiveness from email sequences, we consider the principle that an engaging user is expected to have his sent emails replied soon after they are received by the email recipients, and a responsive user replies soon after they receive emails. As users may not always stay online, the time taken to reply an email may vary very much. Instead, we consider the number of emails received later than an email e but are replied before e by a user as a proxy of how soon e is replied.

The above principle is thus used to develop the reply gap model (RG). Let seq i denote the email sequence of user u i . When an email received by u i is replied before other email(s) received earlier, the reply of the former is known as an out-of-order reply. Formally, for an email e received by u i , we define the number of emails received and number of out-of-order replies between e and its reply e′ in seq i , denoted by n r (u i , e) and \(n_{\overline{o}}(u_{i},e)\) respectively, as

$$n_{r}(u_{i},e) = \left \{\begin{array}{l l} \mbox{ \# emails received between }&\mbox{ if }\exists e^prime \in RB(u_{i}), \\ e\mbox{ and }e^prime \mbox{ in }seq_{i}, &\ \ \ \ r(e) = e^prime \\ - 1, &\mbox{ otherwise} \end{array} \right.$$
(10.14)
$$n_{\overline{o}}(u_{i},e) = \left \{\begin{array}{l l} \mbox{ \# emails received } &\mbox{ if }\exists e^prime \in RB(u_{i}), \\ \mbox{ between }e\mbox{ and }e^prime \mbox{ in }seq_{i}&\ \ \ \ r(e) = e^prime \\ \mbox{ and have been replied}, & \\ - 1, &\mbox{ otherwise} \end{array} \right.$$
(10.15)

The − 1 value is assigned to n r and \(n_{\overline{o}}\) when e is not replied at all. The user engagingness and responsiveness of the RG model are thus defined as:

$${E}^{RG}(u_{ i}) = \frac{\displaystyle\sum\nolimits_{e\in S(u_{i})}( \frac{1} {\vert Rcp(e)\vert }\displaystyle\sum\nolimits_{u_{j}\in \mathit{Rcp(e)}}(1 -\frac{n_{\overline{o}}(u_{j},e)} {n_{r}(u_{j},e)} ))} {\vert S(u_{i})\vert }$$
(10.16)
$${R}^{RG}(u_{ i}) = \frac{\displaystyle\sum\nolimits_{e\in R(u_{i})}(1 -\frac{n_{\overline{o}}(u_{i},e)} {n_{r}(u_{i},e)} )} {\vert R(u_{i})\vert }$$
(10.17)

For example, let \(seq_{i} =\{ e_{1},e_{2},e_{3},e_{4},e^prime _{1},e^prime _{4},e^prime _{2}\}\) be the email sequence of user u i where e k  = r(e k )’s. Note that \(\frac{n_{\overline{o}}(u_{i},e_{1})} {n_{r}(u_{i},e_{1})}\), \(\frac{n_{\overline{o}}(u_{i},e_{2})} {n_{r}(u_{i},e_{2})}\), \(\frac{n_{\overline{o}}(u_{i},e_{3})} {n_{r}(u_{i},e_{3})}\), and \(\frac{n_{\overline{o}}(u_{i},e_{4})} {n_{r}(u_{i},e_{4})}\) are \(\frac{0} {3}\), \(\frac{1} {2}\), \(\frac{-1} {-1}\), and 0 respectively. Hence, \({R}^{RG}(u_{i}) = \frac{\{(1-\frac{0} {3} )+(1-\frac{1} {2} )+(1-\frac{-1} {-1} )+(1-0)\}} {4} = 0.625\). The engagingness of u i can be computed in the same manner.

10.2.4 Social Cognitive Models

A social cognitive model is based on social cognitive theory which suggests that people learn by watching what others do [8]. Such kind of models thus measure a user’s engagingness and responsiveness behaviors by observing what the other users react to emails sent from the user and observe the email interaction among one another. In this paper, we introduce a random walk (RW) social cognitive model.

For engagingness, each user u k perceives a user u i to be more engaging than another user u j if more emails from u i are replied ahead of emails from u j based on the emails in the mailbox of u k . For instance, suppose that u k has an email sequence \(seq_{k} =\langle e_{1}(u_{1},\{u_{k}\}),e_{2}(u_{2},\{u_{k}\}),e^prime _{2}(u_{k},\{u_{2}\}),e^prime _{1}(u_{k},\{u_{1}\})\rangle\), where \(e_{v}(u_{x},\{U_{y}\})\) denotes email e v sent by u x to recipients U y and e′ v denotes the reply of email e v . u k receives e 1 before e 2 but the reply e′ 1 comes after e′ 2. This indicates that u k considers u 2 more important than u 1. Furthermore, u 2 is more engaging than u 1 from u k ’s standpoint. Based on the above observation, we say that u k observes the engagingness superiority of u 2 over u 1.

Similarly for responsiveness, u k perceives a user u 1 to be more responsive than another user u 2 if u k observes reply emails from u 1 earlier than u 2 for the same emails sent to both u 1 and u 2 which can be from u k or other users.

Formally, we represent an engagingness weighted directed graph G E = ⟨U, L E⟩ as follows:

  • U represents the set of all users.

  • L consists of directed edges. When in the mailbox of some u k , u i has x k emails replied ahead of emails from u j , we represent this by a directed edge u j  → u i .

  • The weight of u j  → u i , weight(u j  → u i ), is the sum of x k ’s for all u k ’s. The larger is weight(u j  → u i ), the more users observe that u i is more engaging than u j .

In a similar manner, we can define a responsiveness weighted directed graph G R = ⟨U, L R⟩.

The engagingness (responsiveness) weighted directed graph will be further processed to derive the degree of engagingness (responsiveness) of users. Each directed graph so far captures the perceived relative difference between users in engagingness (responsiveness). It however does not immediately assign engagingness (responsiveness) scores to the users. We therefore propose to perform random walk on the engagingness (responsiveness) graph so as to determine the user engagingness (responsiveness) values as the stationary probabilities of visiting them.

Fig. 10.4
figure 4

Social cognitive model. (a) Engagingness weighted directed graph G E. (b) Engagingness graph for random walks

The random walk process on the engagingness graph to obtain the engagingness of users denoted by E RW(u k )’s consists of the following steps:

  1. 1.

    Determine the largest node aggregated edge weight, \(\mathit{MaxWeight} = \mathit{Max}_{u_{j}}\{\sum _{u_{i}}\) weight(u j  → u i )}

  2. 2.

    For each user u j ,

    1. (a)

      sum j  = 0

    2. (b)

      For each edge u j  → u i ,

      1. (i)

        Assign a transition probability to u j  → u i as \(p(u_{j},u_{i}) = \frac{weight(u_{j}\rightarrow u_{i})} {MaxWeight}\)

      2. (ii)

        \(sum_{j} = sum_{j} + p(u_{j},u_{i})\)

    3. (c)

      Assign to the remaining weights to all users.

      Create an edge u j  → u t for all u t with \(p(u_{j},u_{t}) = \frac{1-sum_{j}} {\vert U\vert }\) if u j  → u t does not exist;

      Increment p(u j , u t ) by \(\frac{1-sum_{j}} {\vert U\vert }\) otherwise

  3. 3.

    For each user u i , initialize E new RW(u i ) randomly

  4. 4.

    Repeat the following steps:

    1. (a)

      For each u i , \({E}^{RW}(u_{i}) = E_{new}^{RW}(u_{i})\)

    2. (b)

      For each u i , \(E_{new}^{RW}(u_{i}) =\sum _{u_{j}\rightarrow u_{i}}p(u_{j},u_{i}) \cdot {E}^{RW}(u_{j})\)

  5. 5.

    Until \(\vert {E}^{RW}(u_{i}) - E_{new}^{RW}(u_{i})\vert \leq \epsilon \) Footnote 2 for all u i ’s

To illustrate the above algorithm, consider examples in Fig. 10.4. u 2 is more engaging than u 1 by weight(u 1 → u 2) = 0. 9. On the other hand, u 1 is less engaging than u 2 by weight(u 2 → u 1) = 0. 4. In Fig. 10.4a, the total engagingness weight of u 1 to all nodes u 2 and u 3 in G E is \(\mathit{weight}(u_{i}) = \mathit{weight}(u_{1} \rightarrow u_{2}) + \mathit{weight}(u_{1} \rightarrow u_{3}) = 1.4\). Similarly, the engagingness weight of u 2 and u 3 are 0.6 and 0.6 respectively. Then, the weight value of each edge is normalized by the maximum weight value, MaxW  = weight(u 1). For example, \(\mathit{weight}(u_{2} \rightarrow u_{3}) = \frac{\mathit{weight}(u_{2}\rightarrow u_{3})} {MaxW} = \frac{0.2} {1.4}\). For nodes with total weight < 1, the unused weight will be used to create edges with equal weights to all the nodes. For example u 2, it has unused weight of \(\frac{\{\mathit{MaxW}-\mathit{weight}(u_{2})\}} {\mathit{weight}(u_{1})} = \frac{\{1.4-0.6\}} {1.4}\). As a result of the new edges for the unused weight, \(\mathit{weight}(u_{2} \rightarrow u_{3}) = \frac{0.2} {1.4} + \frac{\{1.4-0.6\}} {1.4} \cdot \frac{1} {3} = 0.33\). In this process, the engagingness graph is row-stochastic because its rows are nonnegative and the sum of each row is 1. This stochastic matrix can be viewed as a transition matrix associated to a family of Markov chains, where each entry (u i ,u j ) represents the probability of a transition from state u i to state u j .

10.3 Email Reply Order Prediction

We now consider the email reply order prediction which has the following setup. Given a pair of emails (e i , e j ) sent to the same user (u) from users u i and u j respectively, we want to determine the order in which the two emails will be replied. Here, we assume that both e i and e j require some replies and u i and u j are not the same person. The outcome of prediction is either e i or e j first.

Our proposed method is to train a Support Vector Machine (SVM) classifier using labeled email pairs, and to apply the trained classifier on unseen email pairs. For each email pair, we can derive features directly from the emails themselves and their senders including the previous emails they have sent and received. There are three types of features used, namely: (a) comparative email features (\(\mathbb{E}\)), (b) comparative interaction features (\(\mathbb{I}\)) and (c) comparative behavior features (\(\mathbb{B}\)).

Table 10.4 lists the email features used in our classifier. For each email feature f k , we derive a corresponding comparative feature f k c of an email pair (e i , e j ) by

$$(e_{i},e_{j}).f_{k}^{c} = e_{ i}.f_{k} - e_{j}.f_{k}.$$

For email send time t(e) feature, we further convert the positive and negative comparative feature values to 1 and − 1 respectively. Interaction features refer to set of features derived from the sender of the email to the common recipient u r as shown in Table 10.5. In the following sections, we will discuss the non-behavior features in more depth. The behavior features refer to the eight E M and eight R M behavior scores of email senders. The comparative interaction and behavior features are defined similar to that of email features.

Table 10.4 Email features \(\mathbb{E}\)
Table 10.5 Interaction features \(\mathbb{I}\)

For instance, we formulate our email reply order prediction as a binary classification problem. Each email pair is assigned a class label such that

$$\left \{\begin{array}{ll} \mbox{ Class } = -1&\text{if }t(e_{i}) - t(e_{j}) < 0 \\ \mbox{ Class } = 1 &\text{if }t(e_{i}) - t(e_{j}) > 0\end{array} \right..$$

The class label stands for u’s preference to reply e i before or after e j with the assumption that e i and e j are received and replied by u. Now we suppose that E ET(u i ) = 0. 8 and E ET(u j ) = 0. 4. If we consider E ET to be a feature (f 1), the comparative feature of f 1 is \({E}^{ET}(u_{i}) - {E}^{ET}(u_{j}) =\,\)0.8–0.4  = 0.4. Furthermore, if we suppose t(e i ) − t(e j ) < 0, the feature vector used in SVM can be represented as { − 1 f 1:0.4 …}.

10.3.1 Non-behavior Features

10.3.1.1 Email Features \(\mathbb{E}\)

As shown in Table 10.4, Feature No. 1 represents the order in which emails e i and e j are received. For simplicity, let us denote by f 1 the feature value. \(f_{1} = -1\) if e i arrived before e j . f 1 = 1 if e i arrived after e j . f 1 = 0 if e i and e j arrived at the same time. Feature No. 2–4 measures the amount of effort required by a replier in terms of reading the content of a received email and writing the content of a reply email. The size of an email e represents the reading effort, while the size of the reply email r(e) stands for the writing effort. Feature No. 5 counts the number of recipients in an email e, based on the fact that an email sent to many recipients is unlikely to be replied. Feature No. 6–8 measure indegree, outdegree, and total degree, respectively. Given a sender Sdr(e) in an email e, the indegree of the sender is the number of users who send emails to the sender. The outdegree of the sender is the number of neighbors who receive emails from the sender. The total degree of the sender is the total number of users who exchange emails with the sender. Feature No. 9 and 10 are the total number of emails sent or received by a user. On the other hand, Feature No. 11 and 12 are the average number of emails sent or received by a user per day. Feature No. 13 and 14 estimate the proportion of reply emails in a user’s sent and received emails. On the other hand, Feature No. 15 and 16 compute the proportion of emails that a user replies or receives a reply for. Feature No. 17 and 18 represent the average response time for the reply emails sent or received by a user.

10.3.1.2 Interaction Features \(\mathbb{I}\)

Recall the framework of our email reply order prediction task, where u receives the emails from u i and u j , and then u will reply to either e i or e j first. Feature No. 19 counts the number of emails from u i to u. We expect that u is likely to reply to e i earlier than e j if u i usually sends more emails to u than u j does. Similarly, Feature No. 20 counts the number of emails from u to u i . Feature No. 21 counts the total number of emails exchanged between u i and u. Feature No. 22 counts the number of reply emails exchanged between u and u i and the total number of emails from u i to u. Feature No. 23 counts the number of reply emails from u to u i . Similarly, Feature No. 24 counts the total number of reply emails exchanged between u and u i . Feature No. 25 estimates the proportion of emails replied. Feature No. 26 computes the proportion of emails replied out of the emails sent by u to u i . We expect that u is likely to quickly reply to u i who also responds to most of the emails received from u. Feature No. 27 measures the ratio of the total number of replies by the total number of emails exchanged between u and u i . Feature No. 28 computes the average response time over all reply emails from u i to u. Feature No. 29 also computes the average response time from u to u i . Feature No. 30 counts the number of threads shared between u i and u. It is because users who are involved in many threads are likely to be co-workers. Feature No. 31 counts the threads in which u and u i actively participate. We define active participants to users who send at least one email in a thread.

10.4 Empirical Study

10.4.1 Set-Up

10.4.1 Dataset

For our task, we used the Enron email dataset that is publicly available at http://www.cs.cmu.edu/~enron. This dataset provides 517,431 emails for 151 Enron employees. Each email message contains a unique message_ID, header information such as the date and time when the message was sent/received, sender, recipients (To and Cc lists), subject and body in plain text format.

Using the email thread assembly algorithm (please see Appendix), we created a link database that stores a pair of emails linked via Reply or Forward relationships. The database consists of 34,008 links which includes 27,730 Reply links and 6,278 Forward links. These binary links make up a total of 18,593 threads that connect 52,601 emails.

10.4.1 Data Characteristics

We have conducted some analysis on the preprocessed email dataset to derive some statistics of Enron employees using and replying/forwarding emails. The interesting findings obtained include:

  1. 1.

    52.6 K emails are involved in some threads.

  2. 2.

    Large majority ( > 90 %) of 18.5 K threads are short with two email messages each.

  3. 3.

    Large majority of threads last for at most 1 day.

  4. 4.

    Large majority of emails are replied within a day.

  5. 5.

    User response time is correlated with number of emails received, number of users he emails to, and number of users emailing him.

10.4.1 Evaluation Metric

To validate the effectiveness of our proposed models, note that we are not able to perform a direct evaluation on our behavior models because the ground truth is absent in the Enron dataset. Instead, we indirectly evaluate them, comparing the four types of behavior models on Enron dataset. To compare the ranked user lists produced by two models, we utilize the Kendall τ distance measure. In each ranked list, first and last ranked users represent the most and least engaging (or responsive) users respectively. Formally, we denote the rank of a user u i in a ranked list L k by l k (u i ). The Kendall τ distance between two ranked lists L 1 and L 2 is defined as \(\frac{K(L_{1},L_{2})} {\frac{1} {2} n(n-1)}\) such that \(K(L_{1},L_{2}) = \vert (u_{i},u_{j}) : u_{i} < u_{j},(l_{1}(u_{i}) < l_{1}(u_{j})\wedge l_{2}(u_{i}) > l_{2}(u_{j}))\vee (l_{1}(u_{i}) > l_{1}(u_{j})\wedge l_{2}(u_{i}) < l_{2}(u_{j}))\vert \). Note that Kendall τ distance is 0 if l 1 = l 2 for all users, and 1 if there is no correlation between l 1 and l 2 [3, 5].

10.4.2 Analyzing Behavioral Models

10.4.2 Correlation Between Engagingness and Responsiveness

We first show the correlation between engagingness and responsiveness in our proposed models. Table 10.6 illustrates the Kendall τ distance between two lists, where the Enron employees in one list are ordered by engagingness scores and the same employees in the other list are ordered by responsiveness scores in each model. By definition, if the Kendall τ distance is 0, the two lists stand for perfect match, while there is no correlation between the two lists in case of the Kendall τ distance is 1. Interestingly, our proposed models show that most τ distances range in between 0.4 and 0.5. These results indicate that engaging employees are not necessarily the same as responsive employees in the Enron email data.

Table 10.6 Kendall τ distance between engagingness and responsiveness

10.4.2 Correlation Between Different Models

Tables 10.7 and 10.8 show the correlations of pairs of different models by engagingness and responsiveness respectively. For instance, we calculate the Kendall τ distance between two lists, where employees in one list are ordered by E EC and the same employees in the other list are ordered by E ER. The Kendall τ distance between E EC and E ER is 0.14 as shown in Table 10.7. In particular, our proposed models are more correlated by responsiveness rather than by engagingness. The email based models such as ER, ET, ES, and TS are highly correlated in both engagingness and responsiveness. On the other hand, the social cognitive approach is not highly correlated with the other models. For example, the Kendall τ distances between RW and the other models are 0.26 on average, while the distances between other models are considerably small. According to the results in Tables 10.7 and 10.8, the social cognitive approach shows low correlation with the other models. For example the Kendall τ distance between E ES and E RW is 0.24 and the Kendall τ distance between R RG and R RW is 0.27. In the social cognitive approach, each user u k perceives a user u i to be more engaging than another user u j if more emails from u i are replied ahead of emails from u j based on the emails in the mailbox of u k . Our further investigation reveals that most emails tend to be replied by the last-in-first-out principle. While some users may reply emails in the same order as they arrive (follow the first-in-first-out), most users exhibit a strong recency bias towards more recently received emails that appear higher in the inbox. Indeed, there are a few emails from u i which are replied ahead of emails from u j based on the emails in the mailbox of u k . For instance, let us present that Sean Crandall (u k ), Fran Chang (u i ), and Alan Comnes (u j ) in the Enron dataset. u i has a set of replied emails {e′ 126, e′ 127, e′ 15, ​126, e′ 15, ​129, e′ 15, ​456, e′ 15, ​457, e′ 15, ​458, e′ 15, ​459, e′ 27, ​518}, where subscripts stand for email ID. Similarly, u j has a set of replied emails {e′ 400, e′ 3, ​065, e′ 9, ​321, e′ 12, ​248, e′ 17, ​495, e′ 19, ​143, e′ 19, ​144, e′ 19, ​672}. Then, u k has a sequence of emails ordered by time {e 9, ​321, e 19, ​675, e′ 19, ​672, e 15, ​126, e 15, ​129, e′ 15, ​129, e 126, e′ 126, e 127, e′ 127, e 19, ​144, e′ 19, ​144, e 19, ​143, e′ 19, ​143, e 400, e 15, ​495, e 3, ​065, e′ 3, ​065, e 27, ​518, e′ 27, ​518, e 15, ​457, e 15, ​456, e 15, ​458, e 12, ​248, e 17, ​495}. Some emails such as e 15, ​129 and e 126 are replied right after the email arrives at a recipient. On the other hand, for each replied email of Alan Comnes (e.g., e 3, ​065), there is no emails from Fran Chang that comes before an email from Alan Comnes and replies after the reply to Alan Comnes.

Table 10.7 Kendall τ distance between two models by engagingness
Table 10.8 Kendall τ distance between two models by responsiveness

Interestingly, the email thread based model shows similar result to that of the email count model regardless of engagingness or responsiveness. This is because there are fewer number of forwarded emails among 151 Enron employees. For instance, from our email thread assembly, we obtained 7,291 email threads, each of which has more than or equal to three emails. In addition, we observed that an email sent by a sender is forwarded by recipients, and the sender finally receives reply emails by not the recipients but some other users. The total number of such emails is 313 among 151 Enron employees. For E TC, only one thread contains eight forwarding emails, but most threads include at most one or two forwarding emails. Such small number of forwarding emails causes TC to be similar to EC.

10.4.2 Most Engaging and Responsive Users

Table 10.9 shows the top five engaging users and top five responsive users after averaging the ranks of our proposed models. The table shows that the two sets of top users are different, consistent with our earlier results. It is interesting to note that most engaging users are traders. Other than CEO John Lavorato, the top responsive users are general employees. Interestingly, there exists no common actors between the two top-five employee lists by engagingness and responsiveness. In other words, there is no both high engaging and responsive actor among 151 Enron employees. This result is consistent with that in Table 10.6.

Table 10.9 Top-five users by engagingness and responsiveness. Note that we derive the overall engagingness and responsiveness of each user by averaging the engagingness and responsiveness of different models

10.4.2 Role Analysis in Correlation

Figure 10.5 shows the scatter plot of engagingness and responsiveness scores of Enron employees with different roles. In the figures, we just present that 93 Enron employees among them are 3 chief executive officers, 9 directors, 35 employees, 3 house lawyers, 8 managers, 2 managing directors, 4 presidents, 12 traders, and 17 vice presidents. Since the job positions of the remaining employees from 151 Enron users are not known, we exclude them from Fig. 10.5. In particular, we show the engagingness and responsiveness scores of Enron employees with different roles in terms of the email reply time model. Note that most employees, managers, and traders tend to have higher engagingness scores than the other employees. In other words, employees, managers, and traders can effectively solicit responses from other actors. In contrast, vice presidents show wide range of engagingness scores. Unlike engagingness, we observed that responsiveness is not correlated to particular job appointments. Rather than actor roles, responsiveness is more related by actors’ individual personality. Some actors are willing to respond to other actors, while other actors are not.

Fig. 10.5
figure 5

Actor role in the email reply time model (X-axis and Y-axis denote E ET and R ET, respectively). Since the other models show similar results to the email reply time model, we omit those of the other models. (a) CEO and house lawyer. (b) President and vice president. (c) Employee. (d) Manager. (e) Director and managing director. (f) Trader

10.4.3 Email Reply Order Prediction

The goal of this experiment is to evaluate the performance of our proposed classification approach to predict email reply order. We also want to examine the usefulness of engagingness and responsiveness behaviors in prediction task. There are five SVM classifiers trained, namely: (a) using comparative email and interengaging features (denoted by \(\text{SVM}_{\mathbb{E}+\mathbb{I}}\)); (b) using comparative behavior features only (denoted by \(\text{SVM}_{\mathbb{B}}\)), (c) using all features (denoted by \(\text{SVM}_{\mathbb{U}}\)), (d) using comparative email and interengaging features except t(e)Footnote 3 (denoted by \(\text{SVM}^prime _{\mathbb{E}+\mathbb{I}}\)), and (e) using all features except t(e) (denoted by \(\text{SVM}^prime _{\mathbb{U}}\)). Classifiers (d) and (e) are included as earlier study [4] has shown that email replies often follow the last-in-first-out principle. \(\text{SVM}^prime _{\mathbb{E}+\mathbb{I}}\) and \(\text{SVM}^prime _{\mathbb{U}}\) allow us to find out if we can predict without knowing the email time information. From the 27,730 email reply relationships, we extracted a total of 19,167 email pairs for the prediction task. The emails in each pair have replies that comes after the two emails are received by the same user. For each email pair, we computed feature values based on only email data occurred before the pair. In addition, we used complement email pairs in training. The complement of an email pair (e i ,e j ) with class label c is another email pair (e j ,e i ) with class label \(\bar{c}\). Fivefolds cross validation was used to measure the average accuracy of the classifiers over the fivefolds. The accuracy measure is defined by \(\frac{\#\ \mathit{correctly}\ \mathit{classified}\ \mathit{pairs}} {\#\ \mathit{email}\ \mathit{pairs}}\).

Table 10.10 illustrates the results of all the five SVM classifiers. \(\text{SVM}_{\mathbb{U}}\) produces the highest accuracy of 77.04 % due to the use of all available features. By excluding the email arrival order feature, the accuracy (of \(\text{SVM}^prime _{\mathbb{U}}\)) reduces to 68.97 %. This performance is reasonably good given that random prediction gives an accuracy of 50 %. The above results show that email arrival order feature is an important feature in the prediction task. We however notice that behavior features contribute to prediction accuracy especially when the email arrival order feature is not available.Footnote 4

Table 10.10 Results of email reply order prediction

Table 10.11 depicts the top ten features for the \(\text{SVM}_{\mathbb{U}}\) classifier. The table shows that engagingness based on the email reply time model ET is the most discriminative feature. This suggests that engagingness and responsiveness are useful in predicting email reply order.

Table 10.11 Top-ten features for \(\text{SVM}^prime _{\mathbb{U}}\)

10.5 Related Work

We first review related work on engagingness and responsiveness behavior modeling. Engagingness and responsiveness behaviors have not been well studied in the past. There is one work on responsiveness [1] (even though it is not sufficient) but no work on engagingness. In [1], responsiveness behavior of a user (in the context of Enron email data set) was defined as the average deviation in response time of user from the other users. Users with positive deviations are known to be lethargic and those with negative deviations are responsive.

Since we are using the Enron dataset, we also review other research on the data set comparing their works with ours. These works can be divided into:

  • Knowledge extraction: Rowe et al. present an automatic method for extracting social hierarchy data from user communication behavior on the Enron dataset [12]. Such communication patterns are captured by computing the social score based on a set of features: number of emails, average response time, number of cliques, degree centrality, clustering coefficient, mean of shortest path length from a specific vertex to all vertices in the graph, betweenness centrality, and Hubs-and-Authorities importance. Then, by performing behavior analysis and determining the communication patterns, their method ranks main users of an organization, groups similarly ranked and connected users to reproduce the organizational structure, and understand relationship strengths among users. Pathak et al. investigate socio-cognitive networks based on email communication in an organization [11]. Socio-cognitive network analysis involves understanding who knows who knows who in a social network. For analysis, the authors propose a model using probability distributions for communication probabilities, in which a Bayesian inference technique is used for updating the probabilities.

  • Email thread detection: To exploit parent-child relationships from email messages, grouping messages together based on which messages are replies to which others, Yeh and Harnly propose email thread detection using undocumented header information from the Microsoft Exchange Protocol and string similarity metrics [14]. Then, their method recovers missing messages from email threads.

  • Email label prediction: Karagiannis and Vojnovic study various parameters including the email size, the number of recipients per email, role of the sender and recipient in the organization, information load on the user, etc., and their effect on reply probability and response time [4]. While their results shed some interesting insights into how these parameters affect users’ replying behavior, further research is required to actually implement a learning model that can automatically prioritize emails based on these findings. Interestingly, through our experimental analysis, we found that email replies often follow the last-in-first-out principle which has been reported by Karagiannis and Vojnovic [4]. The study of [15] builds a supervised classifier to automatically label emails with priority levels on the scale of 1–5. Their model primarily focuses on graph-based metrics such as node degree, centrality, clique count, etc. derived from the underlying social networks of users. McCallum et al. present the author-recipient-topic model which learns topic distributions based on the direction sensitive messages sent between users [7]. In particular, this model works based on Latent Dirichlet Allocation and the author-topic model in which distribution over topics is conditioned distinctly on both the sender and recipient according to the relationships between users. Unlike our models, the authors have explored Enron dataset mostly from a Natural Language Processing (NLP) perspective. Recently, B. On et al. conducted preliminary study of behavior models on mobile social networks [10].

  • Email interaction prediction: To predict whether emails need replies, Dredze et al. present a logistic regression model with a variety of features e.g., dates and times, salutations, questions, and header fields of emails [2].

10.6 Conclusion

In a nutshell, we study user engagingness and responsiveness behaviors in an email network. We have developed four types of behavior models based on different characterization principles. Using the Enron dataset, we evaluate these models. We also apply the models to email reply order prediction task.

The work is a significant step beyond the usual node and network statistics to derive node behavior measures for a given network. While our results are promising, there are still much room for further research. We will develop new behavior models based on probability and email content. We also plan to conduct a more comprehensive study of engagingness and responsiveness behaviors on a much larger and complete information exchange dataset (e.g., twitters, blogs, SMS, etc.). This will remove some shortcomings of the existing Enron dataset which does not have complete emails of each user. We will also expand our work to apply the behaviors to other interesting email prediction tasks.

10.7 Appendix

Email Thread Assembly The algorithm to identify reply and forward relationships among emails is as follows:

  • Step 1: Group all emails with matching subjects after ignoring the prefixing Reply (RE, Re) and Forward (FW, Fw, FWD, Fwd) tags from the subject field. Emails with blank subject or subject matching “no subject” are ignored.

  • Step 2: Sort emails in each subject group by date and time. We use the Perl module Time: Local to convert the message date and time into an integer number that indicates the number of seconds since the system epoch (Midnight, January 1, 1970 GMT).

  • Step 3: For each email e 1 whose subject starts with one of the Reply or Forward tags (Re, RE, Fw, FW, Fwd, FWD)

    • Step 3a: Scan all the previous emails in its group

    • Step 3b: Find the most recent email e 2 such that the sender of e 1 is one of the recipients of e 2

    • Step 3c: If the subject of e 1 begins with a Reply tag, also check that the sender of e 2 is one of the recipients of e 1

  • Step 4: Compute the time difference t(e 1) − t(e 2)

  • Step 5: If t(e 1) − t(e 2) ≤ ξ, add link e 1 → e 2 to indicate that e 1 is a reply or forwarded email of e 2

Here, the parameter ξ specifies a time-window between emails e 1 and e 2 to consider it as a valid thread link. In our experiments, we set ξ = 90 days (3 months) and discard pairs that have a time difference larger than that.