Keywords

1 Introduction

Enterprises constantly share and exchange digital documents containing sensitive information - internally with employees, externally with partners and with customers. With the increase in data sharing, the incidents of data breaches have also increased significantly [2]. A data breach is an incident in which sensitive, protected or confidential data, such as personally identifiable information (PII), personal health information (PHI), trade secrets and enterprise financial information is maliciously or inadvertently viewed, stolen or used by unauthorized entities. Until recently, the most common concept of a data breach embodied only malicious attackers hacking into enterprise networks to steal sensitive information. Data Loss Prevention (DLP) e.g. McAfee, Symantec are a class of software that detect and prevent such data breaches by malicious attackers by continuously monitoring sensitive data and providing appropriate encryption based on the sensitivity of the data.

However, according to a study by Johnson [14] in 2008, inadvertent disclosure of sensitive information represents one of the largest classes of security breaches exceeding even the number of malicious data attacks. The consequences of such inadvertent data disclosures for enterprises could be severe: compromising customer’s privacy, losing market share or damaging intellectual property. DLP software cannot tackle the inadvertent data disclosures since these occur outside the enterprise boundary. This has led to the development of Digital Rights Management (DRM) solutions e.g. Microsoft which are designed to protect sensitive data outside the enterprise boundary as well by ensuring only intended recipients can access the shared sensitive information regardless of their location.

The data being shared in DRM often vary in the degree of sensitive information. A security policy is applied to protect it against unauthorized access. A security policy is a collection of information that includes the confidentiality settings and a list of authorized users corresponding to the confidentiality settings. The confidentiality setting in the policy determines how the recipient can use the shared data, for example whether recipients can print, copy, or edit text in a protected document is dictated by the confidentiality settings corresponding to those recipients.

There are two major pitfalls with existing DRM solutions in identifying data breaches. First, the data breach identification in DRM is heavily based on keyword matching with a manually curated dictionary, thus limiting their capabilities severely [13]. Further, the policies in DRMs are manually assigned, which can lead to an error prone process [20]. With the increase in online document transactions, such a manual process is incapable of scaling to the enterprise document volumes, calling for an automated approach to protect against unauthorized access of information.

In this paper, we present an algorithm that analyzes the sensitive information and suggests appropriate access to the documents thus mitigating the risk from inadvertent disclosures. Our algorithm analyzes the historic information and extracts the semantics of the underlying content. The access permissions (that constitutes the document’s DRM policy) associated with the information is then analyzed to identify content-access correspondence via a multi-label classifier formulation. The classifier thus modeled is capable of recommending policies/access permissions for any new document.

This paper is organized as follows. In Sect. 2, we describe existing literature in the light of our problem. In Sect. 3, we formulate the DRM policy modeling in a multi-label classification framework [17] and adapt the cost function to simultaneously optimize for precision and recall to suit the needs of the policy modeling. In Sect. 4, we compare the proposed framework against several alternative frameworks along with state-of-the-art security modeling systems showing the viability and superior performances of the proposed framework. Section 5 concludes the paper.

2 Related Work

While automatic recommendation of security/DRM policies for documents is less studied, one direction of explorations that is close to our problem is the prediction of email recipients based on its content. Carvalho and Cohen [8] have proposed an algorithm to generate a ranked list of intended recipients of an email via several algorithmic formulations including an expert search framework and a multi-class framework. Graus et al. [12] address the same problem via a graphical model framework of already composed email messages. Carvalho and Cohen [7] and Liu et al. [16] model the problem of identifying unintended recipients of emails as an outlier detection problem and a binary classification problem respectively based on the textual analysis of the email content overlayed with a correlation based on social network analysis. Zilberman et al. [22] propose an algorithm that extracts topics for all recipients and approves recipients for an email based on its topics and the common topics between the sender and the recipient. All these works focus just on identifying who are the right recipients of an e-mail from a curated list of recipients. However, in the context of DRMs, it is required not only to identify the ‘recipients’ of a content but also to determine the access permissions of the identified recipients on the content.

Evaluating the sensitive nature of information in a content is another direction of exploration that is related to our work. Existing DLP systems use regular expressions [6] and keyword matching to identify PII and other confidential information. Cumby and Ghani [9] present a semi automatic method to identify and redact private content from documents. However, they do not factor in the intended recipients of these documents while making such decisions. Hart et al. [13] propose a binary text classification algorithm to categorize a document as sensitive or non-sensitive. Geng et al. [11] use association mining between different types of PIIs to predict PIIs in emails. All these works are purely based on the document content, without any emphasis on the intended recipients of the content, which is a key factor in our problem.

To the best of our knowledge, there are no prior works that addresses the problem of suggesting appropriate DRM policy by a joint modeling between the document content and its intended recipients (along with their access permissions) which is the novel contribution of our work here.

3 Method

Consider an intra-organizational document repository that comprises of a set of documents with appropriate DRM policies attached to each of them. Let D denote the set of documents in the system, U be the set of all authorized users in the system and ACL be the set of access rights (permissions) in the system. For example, ‘read’, ‘read-modify’, ‘read-modify-delete’ are potential access rights for documents. Each document \(d \in D\) is assigned a security policy p which describes the access rights of the authorized users for the document. A security policy is represented as a set of (userpermission) pairs, where each pair defines an authorized \(user \in U\) and its corresponding \(permission \in ACL\), i.e., \(p = \{(user_i, permission_j)| 1 \le i \le |U|,1 \le j \le |ACL|\}\). Let P represent the set of all such policies available in the system. Further, we denote L to be the set of all (userpermission) pairs, thus, \(|L| = |U|\times |ACL|\).

Given such a repository with documents and their respective security policies, we propose an algorithm to suggest a set of (userpermission) pairs for any new document based on its textual content. More formally, for a new document d, the algorithm aims to provide a ranked list \(L^{'} = \big [ (user_1,permission_1),\ (user_2,permission_2), \cdot \cdot \cdot , (user_k,permission_k) \big ]\) as the policy to protect the document from unauthorized access.

Feature Extraction: For a given document, we first extract features that capture the personal/sensitive information present in the document. These features include mentions of individuals and organizations, locations, dates, references to currency/money, phone number, social security number and email addresses. We use the Stanford Named Entity Tagger [3] to identify individual and organization names, locations, dates and currency. We further define regular expressions to identify phone numbers, social security numbers and email addresses in the document text. The feature set is further expanded to include individual words found in the document (after removing stop words and stemming the root words) using a “bag of words” representation. TFIDF (Term-Frequency-Inverse-Document-Frequency) [15] is extracted based on the bag-of-words representation for every document d.

With increasing size of document repository, the dimensions of the feature space also increases due to introduction of new words. We therefore reduce the dimensionality by removing features that might be irrelevant to the policy modeling to improve both the overall accuracy and scalability of the model. We calculate the information gain [10] of each feature f in the feature set as,

$$\begin{aligned} IG(f) = \sum _{l \in L} \sum _{f' \in \{f, \tilde{f}\}} P(f', l). log \dfrac{P(f', l)}{P(f') P(l)} \end{aligned}$$
(1)

where, l is a (userpermission) pair. Information Gain measures the amount of information in bits obtained for (userpermission) pair prediction by knowing the presence or absence of a feature, f. We retain the top \(k\%\) (70% in our experiments later) informative features for our modeling.

Policy Modeling: Our framework to model security policies aims to suggest (userpermission) pairs for any new document. A brute-force way to formulate this could be as a multi-class classification task where each class represents a security policy. A classifier \(g: R^d \longrightarrow P\) can be learned based on the training set \(\{(x_i, p_i) |\ 1 \le i \le |D|\}\), where \(x_i \in R^d\) is the set of features for document \(d_i\), and \(p_i \in P\) is the corresponding security policy. For a new document, the above classifier can provide a probability of the existing policies \(p_i \in P\) being suitable for the given document which can be used to decide on the final security policy for the document. However, modeling the problem as a multi-class framework problem captures neither the relation between the various security policies nor the overlap between security policies in terms of common users and permissions. Moreover, such a multi-class framework is also incapable of suggesting new policies outside what already exists in the repository and hence can be restrictive.

The above shortcomings can be addressed by formulating the problem in a multi-label classification framework, considering each unique (userpermission) pair as a separate label on the document. In this framework, a function \(f: R^d \longrightarrow 2^L\) is learned from the training set \(\{ (x_i, y_i)\ | 1 \le i \le |D| \}\), where \(x_i\) is defined as before and \(Y = \{ y_1,\ y_2,..., y_{|L|}\}\) denotes the label space. For a new document \(d_i\), the multi-label classifier f predicts the set of labels, i.e., identifies the most relevant (userpermission) pairs that must be assigned to a document as its policy. Unlike the multi-class framework, such a formulation is capable of recommending any set of (userpermission) pairs in the repository.

There are several alternatives for multi-label classification. One class of algorithms transforms the multi-label classification problem to a binary classification, either by training binary classifier for each label in data-set (binary relevance method) [18] or by training binary classifier for multiple label subsets in data-set [21]. Another class of algorithms adapt traditional classification frameworks to the multi-label task [18]. However, both these methods are highly sensitive to the distribution of labels and do not perform well when the labels have very few training examples. Also, as these methods rely on independent models for each label/label sets, prediction cost increases with increasing number of labels.

In light of these drawbacks, a recent exploration in this space is centered on Fast Extreme Multi-label Learning (FastXML) which deals with large number of labels having skewed label distributions. FastXML learns a hierarchy over the feature space by recursively partitioning a parent’s feature space between its children. The partitioning at each node is done by optimizing a ranking loss function, i.e., normalized Discounted Cumulative Gain (nDCG). Agrawal et al. [5, 17] observe that in the case of XML problems, only a small number of labels are active in each partition of feature space, thus improving the modeling capacity.

However, the ranking-loss function in FastXML [17] only includes a reward for a high recall (correctly predicting all the relevant labels) without accounting for the precision (reducing incorrectly predicted labels) in the prediction. For modeling security policies, it is also important to penalize wrongly predicted (userpermission) pairs, as that means that an ineligible user has been given permission to sensitive information. Given a ranking r of (userpermission) pairs and the ground truth vector \(y_i\), the discounted cumulative gain \(L_{DCG}(r, y_i)\) is modified as,

$$\begin{aligned} L_{DCG@k}(r, y_i) = \sum _{l=1}^{k} \dfrac{(y_{r_{l}}) + (y_{r_{l}} - 1)}{log(1 + l)}, \end{aligned}$$
(2)

where, \(y_{r_{l}}\) is the binary ground truth for the \(l^{th}\) label according to ranking r (as defined in [17]), i.e. it has the value 1 if the \(l^{th}\) label is attached to document \(d_i\). We add \((y_{r_{l}} - 1)\) term to the definition of \(L_{DCG}\) [17] to introduce a \(-1/log(1+l)\) term for each wrongly predicted (userpermission) pair in the top-k labels. This ensures that apart from positive labels predicted with high ranks being rewarded, highly ranked negative labels are also penalized. Algorithm 1 outlines the steps required to obtain the hierarchy.

figure a

Finally, the FastXML algorithm learns a linear separator w at each node of the tree, which divides the feature space into the positive and negative partition respectively, by minimizing a ranking loss function given by,

$$\begin{aligned} min ||w||_1 + \sum _{i}log(1 + \exp (-\delta _{i}w^Tx_i )) - \sum _{i}(1+\delta _{i})L_{nDCG}(r^+, y_i) \nonumber \\ - \sum _{i}(1-\delta _{i})L_{nDCG}(r^-, y_i) \end{aligned}$$
(3)

where, \(w \in R^d, \delta \in \{-1, +1\}, r^+, r^-\) is the ranked list of labels in the positive and negative partitions respectively. The labels are ranked in decreasing order of the number of documents in the partition that they are assigned to.

Prediction: To suggest (userpermission) pairs for a new document, first a feature representation (\(x_i\)) of the new document \(d_i\) is extracted as before. The algorithm starts with the root node of each tree in the model and traverses down the tree till it reaches a leaf node. For traversal at each node, it calculates the value of the term \(w^Tx\) where w is the linear separator at that node. Since the linear separator at each node divides the feature space into two parts, depending on the sign of \(w^Tx\), the document \(d_i\) is passed down to the left child node (if \(w^Tx < 0\)) or the right child node (if \(w^Tx >0\)) till it reaches a leaf node. Each leaf node contains a subset of points from the training data. The algorithm returns the ranked list of the top \(k\ (user, permission)\) pairs active in the leaf nodes of all trees, where the rank is defined as:

$$\begin{aligned} r(x) = rank(\frac{1}{T}\sum _{t=1}^{|T|}{P_{t}^{leaf}(x_i)}) \end{aligned}$$
(4)

where T is the number of trees, \(P_{t}^{leaf}(x_i) \propto \sum _{S_{t}^{leaf}(x_i)}{y_i}\) and \(S_{t}^{leaf}(x_i)\) are the label distributions of the set of points in the leaf node that \(x_i\) reaches in tree t.

4 Experimentation and Results

The problem of automatic DRM policy assignment can be framed as a multi-class classification problem, with each policy being considered as a separate label. However, we had articulated the shortcomings of such a formulation and overcame these with our multi-label framework. In our experiments, we first compare our formulation with a multi-class framework and show the superior performance of our formulation to test our hypothesis. We also test our algorithm against several multi-label frameworks to show the viability of the proposed algorithm. Finally, we compare the performance of the proposed framework against existing security modeling in the context of email protection.

Dataset: All our experiments were performed on the Wikileaks Cablegate [4] data which includes diplomatic cables sent by the US State Department to its consulates, embassies, and diplomats around the world. Each cable is marked with a group of recipients to whom it was addressed, and a classification scale denoting the security level of the document. The classification scale on each document was one of Unclassified, Limited Official Use, Confidential or Secret.

In order to replicate an intra-organization document collection, we considered only the unique documents sent by Department of State for our experiments. Documents with insufficient textual content were discarded. The document’s classification scale was used as the access permission given to its corresponding recipients. The set of all such (recipient, classification level) combinations for a document yielded the document’s policy for our experiments. The filtered data-set contained 11,760 unique documents, 842 unique policies, 114 unique users/recipients and 452 (userpermission) pairs across all documents. This included 3,301 unclassified, 3,318 limited official use, 3,676 confidential and 1,465 secret documents. Table 1(a) provides additional details about the data-set, at the user level and the (userpermission) level.

Table 1. Wikileaks cablegate dataset

The average cardinality of labels, i.e., (userpermission) pairs, for documents is 5.11, which indicate that the security policies on each document constitutes a set of 5 (userpermission) pairs on an average. The average number of documents for each (userpermission) pair is 47. However, the (userpermission) pair distribution is highly skewed as seen in Table 1(b) and ranges between a maximum 371 documents with the same user-permission pair to a minimum of 1 document for a user-permission pair.

For all our experiments, the FastXML was built with 50 trees and the maximum number of instances allowed in a leaf node (MaxLeaf) was set at 10. The number of labels k in a leaf node whose probability scores are retained was set to 10.

4.1 Comparison with Multi-class Frameworks

Here, we compare our proposed method against two multi-class formulation to check the viability of our formulation over the multi-class formulation. The first framework is a frequency based approach, where all candidate policies are ranked in decreasing order according to the number of times they were assigned to any document in the training set. The top k policies, thus ranked, constitute the set of policies suggested to the user. The second framework is a \(\mathbf {1}\) -vs-All approach where a separate SVM [19] is trained for each policy. For any new document, the scores from each of the corresponding classifiers decides the policy ranking. To obtain a ranking of the policies for our proposed approach, we obtain a binary vector representation (\(\{0,1\}^{|L|}\)) of the algorithmically suggested (userpermission) pairs with value 1 for entries corresponding to the top k pairs. Also, we represent all existing security policies present in the system into equivalent binary representations. Then, we use \(cosine\ similarity\) metric to identify the nearest security policies and rank them accordingly.

To evaluate the performance of different approaches, we define the Accuracy@k which measures the probability of the actual policy being in the top k predicted policies,

$$\begin{aligned} Accuracy@k =\frac{1}{|test\_set|} \sum _{doc \in test\_set}{\mathbbm {1}_{p \in f_k(doc)}}, \end{aligned}$$
(5)

where p is the actual policy of document doc and \(f_k(doc)\) is the set of top-k policies predicted by the algorithm f for the document doc.

Because of the training data requirements of an SVM model, we evaluated only on those policies that had at least n corresponding documents and report the accuracy for various values of n. Note that the proposed approach does not suffer from this limitation since it does not try to generate individual models for each (userpermission) pair.

Table 2. Comparison against multi-class approaches

Table 2 shows the performance of the proposed framework against the multi-class baselines. The results of our experiments indicate the superiority of our proposed multi-label approach in comparison to the multi-class baselines. In particular, the proposed approach performs significantly better in identifying the relevant policy with smaller values of n. For instance, the Accuracy@10 for our proposed approach is almost \(6\%\) higher than that achieved by 1-vs-All for the data-set with \(n=30\). This indicates that our algorithm is relatively agnostic to lower number of documents per policy in the training data. As the number of documents per label increases, the performance of the 1-vs-All approach is at par with the proposed approach - indicating that a multi class framework requires a lot of training data per policy to perform on par with the proposed framework.

Real world data-sets often contain numerous policies that might have very few supporting documents to train on as reflected in the Wikileaks data-set as well where only 95 out of 842 policies have been applied to more than 20 documents. Thus, any algorithm that aims to suggest policies by training on such a data-set needs to be robust enough to provide reasonably accurate predictions with lesser documents per policy.

4.2 Comparison with Multi-label Approaches

Next, we compare our approach against multi-label frameworks, 1-vs-All and Rakel [21]. For both these approaches, each (userpermission) pair is considered as a separate label, similar to the proposed algorithm. In the 1-vs-All framework, a separate linear SVM classifier is trained for each unique (userpermission) pair different from the multi-class framework where the modeling was done at the policy level. For Rakel, linear SVM classifiers are built for random ensembles of labels. For any new document, the final score for any label (user-permission pair in our context) is obtained by taking the cumulative score of all ensembles that the label is a part of.

In multi-label frameworks, output is generated at a lower granularity - (user, permission) pairs instead of the entire policy. We therefore use a different set of metrics to evaluate the algorithms in this framework. Since the candidate set of labels is typically large and only a small fraction of those labels is attached to any document, the algorithm needs to be measured on both its ability to correctly predict the highly ranked labels (precision), as well as the ability to retrieve all relevant labels (recall). Here, we define these metrics in the context of our problem. The Precision@k counts the correct pairs in the top k predicted pairs, whereas the Recall@k counts the fraction of the actual pairs that the algorithm is able to predict in the top k pairs. More formally,

$$\begin{aligned} Precision@K(y',y) = \frac{1}{k}\sum _{i \in rank_k(y')}{y_i},\text { and } Recall@K(y',y) = \frac{\sum _{i \in rank_k(y')}{y_i}}{\sum _{i \in |y|}{y_i}}. \end{aligned}$$
(6)

where \(y'\) is the vector of predicted (userpermission) pair vector, y is the ground truth permission set. In our evaluations, we report the average Precision@k and Recall@k across all documents in the test set.

Table 3. Precision for different values of k evaluated at user and user-permission level
Table 4. Recall for different values of k evaluated at user and user-permission level

We first evaluate our algorithm’s prediction capability at the user level by retaining only the unique users from the k predicted (userpermission) pairs for each document. These are compared against the actual users that have been given some permission by the policy. Modeling at the user level provides a measure of how correctly the set of users who should have permission to a given document is identified. High precision in these cases suggests that fewer irrelevant users are being given access to a certain document whereas high recall suggests that most users that must be given some level of access to the document have been identified. Tables 3 and 4 summarize the results at the user level. Table 4 shows that the algorithm is able to surface 85% of all users if it returns a list of the top 25 labels. Thus, the administrator has to only go through this much condensed list instead of the entire list of 114 users in the system.

Next, we evaluate the performance of the algorithms at the user-permission level. As previously elaborated, each policy assigns a unique permission to a user. That is, a policy cannot contain multiple (userpermission) pairs with the same user. However, in multi-label classification, the predicted list of labels may contain multiple labels/(userpermission) pairs corresponding to the same user. We leverage the nature of our problem to solve this. Since the algorithm aims to suggest policies to safeguard against data leak, we hypothesize that a stricter policy is preferable to a more lenient one which may provide extraneous rights to some users. Hence, if a particular user is a part of multiple (userpermission) pairs finally predicted, we make use of the inherent hierarchy in the permissions and assign the strictest permission to that user. For instance, if both \((u_i, p_m)\) and \((u_i, p_n)\) are in the top k predicted labels, we compare the permissions \(p_m\) and \(p_n\) and retain the stricter permission. In ACL permissions, for example, a Read permission is stricter than a Modify permission, as the latter gives more rights to the user. Tables 3 and 4 shows the results at the userpermission level. Table 4 shows that the algorithm is able to surface 78% of all userpermission pairs if it returns a list of the top 25 labels. This is greatly reduced from the 452 total userpermission pairs that exist in the system.

The proposed algorithm performs the best among all compared frameworks. One reason for this is the ability of our model to handle the skewed distribution of (userpermissions). There exists some userpermission pairs that have been assigned to a very few documents as reflected in the Wikileaks dataset and our proposed algorithm is robust in such scenarios. Traditional methods like 1-vs-All and Rakel, which depend on training independent models for each label/label sets do not perform well in policy modeling with very few training examples.

Table 5. Results on the Enron dataset

4.3 Evaluation on Enron Email Dataset

While none of the existing literature addresses the exact problem of assigning security policies for documents, a related exploration is the task of predicting recipients for an email based on its content, e.g. [8]. Due to the robustness of our framework, our approach is capable of handling these tasks as well and here we evaluate the performance of our algorithm for the aforementioned task on the Enron Email Dataset [1] and compare it against the best performing approach in [8]. Our experiments are run on emails sent by 30 Enron users, selected based on the volume of emails sent, after discarding emails having insufficient textual content. Table 5 provides a summary of the results. The proposed approach clearly outperforms the baseline in [8] thus proving the robustness of the proposed formulation.

5 Conclusion

In this work, we have addressed the problem of automatically recommending appropriate access permissions for users by deciding the appropriate DRM security policies for the document. We have proposed an algorithm that analyzes the sensitive information in the document and determines the right policies for it. Experiments on a real world dataset against several alternate frameworks establish the viability of the proposed approach. Comparison with existing baseline modeling approaches further establishes the superiority of the proposed approach across several evaluation criteria.