Keywords

1 Introduction

Insider threat refers to the threat arising form the organizational insiders who can be employees, contractors or business partners etc. These insiders usually have an authorization to access organizational resources such as systems, data and network etc. A popular approach to detect malicious insiders is by analyzing the insider activities recorded in the audit data [14] and applying supervised learning models. Usually, the insider audit data is unbalanced because only a few malicious insiders are detected. Hence, applying supervised learning models on such unbalanced datasets can result in poor detection accuracy. To address this limitation, we present a framework, CLDet, to detect malicious sessions (containing malicious activities from insiders) by using contrastive learning.

Our CLDet framework has two components: self-supervised pre-training and supervised fine tuning. Specifically, the self-supervised pre-training component generates encodings for user activity sessions by utilizing contrastive learning whereas the supervised fine tuning component classifies a session as malicious or normal by using these encodings. Contrastive learning requires data augmentations for generating augmented versions of an original data point and ensures that these augmented versions have close proximity with each other when compared to the augmented versions of the other data points. Since each user activity session can be modelled as a sentence and each activity as a word of this sentence [14], we adapt sentence based data augmentations from the Natural Language Processing (NLP) domain [10] in our framework. We conduct an empirical evaluation study of our framework and evaluation results demonstrate noticeable performance improvement over state-of-the-art baselines.

2 Related Work

Insider Threat Detection. Traditional insider threat detection models employ handcrafted features extracted from user activity log data to detect insider threats. Yuan et al. [11] argued that utilizing hand crafted features for detecting insider threats can be tedious and time consuming and hence proposed to utilize deep learning model to automatically learn the features. Specifically they employed a LSTM model to extract encoded features from user activities and then detected malicious insiders through a Convolutional Neural Network (CNN). Similarly, Lin et al. [5] used unsupervised Deep Belief Network to extract features from user activity data and applied one-class Support Vector Machine (SVM) to detect the malicious insiders. Lu et al. [6] modeled the user activity log information as a sequence and extract user specific features through a trained LSTM model. Yuan et al. [13] combined a RNN with temporal point process to utilize both intra- and inter-session time information. The closely related work is [14] where the authors proposed a few-shot learning based framework to specifically addresses the data imbalance issue in insider threat detection. The developed framework applies the word-to-vector model for generating encoded features from user activity data and then uses a trained BERT language model to refine the encoded features. We refer readers to a survey [12] for other related works.

Contrastive Learning. Contrastive learning has been extensively studied in the literature for image and NLP domains. Jaiswal et al. [3] presented a comprehensive survey on contrastive learning techniques for both image and NLP domains. Marrakchi et al. [7] effectively utilized contrastive learning on unbalanced medical image datasets to detect skin diseases and diabetic retinopathy. The developed algorithm utilizes a supervised pre-training component, which is designed by employing a Residual Network, and generates image representations. These generated image representations are further fed as input to a fine tuning component which is designed by using a single linear layer. In our framework, we utilize some of data augmentation concepts presented in [10] and [9]. Wu et al. [10] presented a contrastive learning based framework for analyzing text similarity. Their framework employs sentence based augmentation techniques for self-supervised pre-training. Wang et al. [9] presented a new contrastive loss function for the image domain.

3 Framework

User activities are modeled through activity sessions. Specifically, each session consists of multiple user activities. Let \(S_k\) denote the \(k^{th}\) activity session of a user. Here, \(S_k=\{e_{k_1}, e_{k_2},\)...\(,e_{k_T} \}\), where \(e_{k_i}(1\le i\le T)\) is the \(i^{th}\) user activity. Let \(D=\{ S_i,y_i \}_{i=1}^{m}\) denote the insider threat dataset where m denotes the number of sessions, \(y_i\) is the label of \(S_i\). Here, \(y_i=1\) and \(y_i=0\) denote that \(S_i\) is malicious and normal session respectively. The two main components of our CLDet framework are self-supervised pre-training and supervised fine tuning. The pre-training component is responsible for generating session encodings and the fine tuning component, using these session encodings as input classifies a given input session as a malicious or normal session.

3.1 Self-supervised Pre-training Component

3.1.1 Encoder and Projection Head

Each activity in the session is represented through trained word-to-vector model. Let \(\mathbf {x}_{k_{i}}\in \mathbb {R}^{d}\) denote the word-to-vector model representation of activity \(e_{k_{i}}\), where d denotes the number of representation dimensions. Each activity of an input session is converted to its corresponding word-to-vector representation and it is fed as an input to a specially designed Encoder. We choose Recurrent Neural Network (RNN) to design our encoder. The encoder is responsible for generating the session encoding \(\mathbf {x}_k\in \mathbb {R}^{d}\) of session \(S_k\). Finally, a projection head will project \(\mathbf {x}_k\) to a new space representation \(\mathbf {z}_k\in \mathbb {R}^{d}\). The projection head is only used in the training of the self-supervised component. After this training, the projection head will be discarded and only the encoded session representation will be used as an input to the supervised fine tuning component.

The encoder consists of a RNN and a linear layer. The RNN consists of two hidden layers denoted as \(H^{(1)}\) and \(H^{(2)}\) respectively. The first hidden layer \(H^{(1)}\) is represented as \(\mathbf {h}^{(1)}_{k_{t}}=tanh(W^1_1 \mathbf {x}_{k_{t}}+\mathbf {b}^1_1+W^1_2 \mathbf {h}^{(1)}_{k_{t-1}}+\mathbf {b}^1_2)\) where \(1\le t\le T\), \(W^1_1\) and \(W^1_2\) are \((d\times d)\) weight matrices, \(\mathbf {b}^1_1\in \mathbb {R}^{d}\) and \(\mathbf {b}_2^1\in \mathbb {R}^{d}\) are the bias vectors, and \(\mathbf {h}^{(1)}_{k_{t}}\) denotes the encoded output of \(H^{(1)}\) for the input \(\mathbf {x}_{k_{t}}\). The second hidden layer \(H^{(2)}\) is similarly represented as \(\mathbf {h}^{(2)}_{k_{t}}=tanh(W^2_1 \mathbf {h}^{(1)}_{k_{t}} +\mathbf {b}^2_1+W^2_2 \mathbf {h}^{(2)}_{k_{t-1}}+\mathbf {b}^2_2)\) where \(W^2_1\) and \(W^2_2\) are \((d\times d)\) weight matrices, \(\mathbf {b}^2_1\in \mathbb {R}^{d}\) and \(\mathbf {b}^2_2\in \mathbb {R}^{d}\) are the bias vectors, and \(\mathbf {h}^{(2)}_{k_{t}}\) denotes the encoded output of \(H^{(2)}\) for the input \(\mathbf {h}^{(1)}_{k_{t}}\). Finally \(\{ \mathbf {h}_{k_i}^{(2)} \}_{i=1}^T\) is flattened to denote the output of RNN as \(\mathbf {v}_k\in \mathbb {R}^{Td}\), which is then fed to the linear layer \(L^{(1)}\) to obtain the session encoding \(\mathbf {x}_k\). This linear layer is represented as \(\mathbf {x}_k= A_1\mathbf {v}_k+\mathbf {b}_1\) where \(A_1\) is a \((d\times Td)\) weight matrix and \(\mathbf {b}_1\in \mathbb {R}^{d}\) is a bias vector. The projection head is denoted as \(L^{(2)}\) and is represented as \(\mathbf {z}_k=A_2\mathbf {x}_k+\mathbf {b}_2\) where \(A_2\) is a \((d\times d)\) weight matrix and \(\mathbf {b}_2\in \mathbb {R}^{d}\) is a bias vector.

3.1.2 Contrastive Loss

A contrastive learning loss function is used for a contrastive prediction task, i.e., predicting positive augmentation pairs. We adapt the SimCLR contrastive loss function [10] in our framework and augment each batch of sessions. Let \(B_s=\{S_1,S_2,...,S_N\}\) denote a batch of sessions. Each \(S_k\in B_s\) is subjected to data augmentation and two augmented sessions denoted as \(S^1_k\) and \(S^2_k\) are generated. Let \(B^a_s=\{ S^1_1,S^2_1, S^1_2,S^2_2,...,S^1_N,S^2_N \}\) denote a batch of augmented sessions. The augmented sessions \((S^1_k, S^2_k)\) form a positive sample pair and all the remaining sessions in \(B^a_s\) are considered as the negative samples. Let \(\mathbf {z}^1_k\) and \(\mathbf {z}_k^2\) denote the projection head representations of the augmented sessions \(S^1_k\) and \(S^2_k\) respectively. The loss function for the positive pair \((\mathbf {z}^1_k,\mathbf {z}_k^2)\) is represented as

$$\begin{aligned} l(\mathbf {z}^1_k,\mathbf {z}_k^2)= -log\frac{exp(cos(\mathbf {z}^1_k,\mathbf {z}_k^2)/\alpha )}{exp(cos(\mathbf {z}^1_k,\mathbf {z}_k^2)/\alpha )+\sum _{i=1}^{N}\mathbf {1}_{[i\ne k]}\sum _{j=1}^2 exp(cos(\mathbf {z}^1_k,\mathbf {z}_i^j)/\alpha )} \end{aligned}$$
(1)

Here, cos() denotes the cosine similarity function, \(\mathbf {1}_{[i\ne k]}\) denotes an indicator variable, and \(\alpha \) denotes the tunable temperature parameter. This pair loss function is not symmetric, because \(l(\mathbf {z}^1_k,\mathbf {z}^2_k)\ne l(\mathbf {z}^2_k,\mathbf {z}^1_k)\). For the batch of augmented sessions \(B^a_s\), we can easily see there are N positive pairs. The contrastive loss function for \(B^a_s\), which is defined as the sum of all positive pairs’ loss in the batch, is represented as \(CL(B^a_s)= \sum _{i=1}^{N}l(\mathbf {z}^1_i,\mathbf {z}^2_i)+ l(\mathbf {z}^2_i,\mathbf {z}^1_i)\).

For session \(S_k\), we adapt three basic NLP based sentence augmentation techniques [10]: 1) Activity Replacement (Rpl), we generate the augmented session \(S_k^1\) (\(S_k^2\)) by randomly replacing \(g_1\) (\(g_2\)) number of activities with a set of token activities; 2) Activity Reordering (Rod), we generate the augmented session \(S_k^1\) (\(S_k^2\)) by randomly selecting a sub-sequence with length \(g_1\) (\(g_2\)) and shuffling all activities in the chosen sub-sequence while keeping all other activities unchanged; 3) Activity Deletion (Del), \(g_1\) and \(g_2\) number of activities in \(S_k\) are deleted to generate the augmented sessions \(S^1_k\) and \(S^2_k\) respectively. We will investigate the effectiveness of other complex data augmentation techniques in our future work.

3.2 Supervised Fine Tuning Component

The supervised fine tuning component has two layers denoted as \(L^{(3)}\) and \(L^{(4)}\). The first layer \(L^{(3)}\) is represented as \(\mathbf {m}_k= A_3\mathbf {x}_k+\mathbf {b}_3\). Here, \(A_3\) is a \((d\times d)\) weight matrix, \(\mathbf {b}_3\in \mathbb {R}^{d}\) is a bias vector, and \(\mathbf {m}_k\in \mathbb {R}^{d}\) denotes the output encoding of \(L^{(3)}\). The output layer \(L^{(4)}\) is represented as \(\mathbf {o}_k= Softmax(A_4\mathbf {m}_k+\mathbf {b}_4)\). Here, \(A_4\) is \((2\times d)\) weight matrix, \(\mathbf {b}_4\in \mathbb {R}^{2}\) is a bias vector, and \(\mathbf {o}_k\) denotes the output of the supervised fine-tuning component. We use the Softmax activation function in the output layer. The supervised fine tuning component is trained by using the cross entropy loss function.

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets

The empirical evaluation study of our proposed framework is conducted on two datasets: CERT Insider Threat [2] and UMD-Wikipedia [4]. In CERT, each user activities are chronologically recorded over 516 days. To perform our empirical analysis, we split the dataset into training and test sets using chronological ordering. Specifically, the user activities recorded until the first 460 days and between 461 to 516 days are used in the training and test sets respectively. Additionally, we further split the training set for training the pre-training and fine tuning components, wherein, the user activities recorded until the first 400 days and between 401 to 460 days are used for training the pre-training and fine tuning components respectively. For the supervised fine-tuning component, four scenarios are utilized in the training phase. Each scenario involves different number of malicious sessions. Specifically, 5, 8, 10 and 15 malicious sessions are utilized in the training phase. The UMD-Wikipedia dataset is relatively more balanced than CERT dataset. Since our framework is specifically designed to effectively operate on unbalanced datasets, we only use a limited number of malicious sessions for training the supervised fine tuning component. The training set is split between pre-training and fine tuning components, wherein, 4436 and 50 normal sessions are used for training the pre-training and fine tuning components respectively and similarly, 3577 and 50 malicious sessions are used for training the pre-training and fine tuning components respectively. Again, we use four scenarios in the training phase of the supervised fine-tuning component. Specifically, 5, 15, 30 and 50 malicious sessions are utilized in the training phase. We show the detailed settings in Table 1.

Table 1. Training and test sets

4.1.2 Training Details

The activity features are extracted through a trained word-to-vector model. Specifically, the word-to-vector model is trained through the skip-gram approach with the minimum word frequency parameter as 1. The hyper-parameters \(g_1\) and \(g_2\) employed in our data augmentation techniques control the amount of distortion caused by augmenting the original session. For activity replacement and deletion-based data augmentation techniques we set \(g_1=1\) and \(g_2=1\), and for activity reordering based data augmentation technique we set \(g_1=3\) and \(g_2=3\). We set the number of dimensions of activity and session encodings d as 5 and the temperature parameter \(\alpha \) in the contrastive loss function as 1. Four metrics are utilized to quantify the performance of our framework: Precision, Recall, \(F_1\) and FPR.

4.1.3 Baselines

We compare our CLDet framework with three baselines: Few-Shot [14], Deep-Log [1] and BEM [8]. Few-Shot has a similar design as our framework, wherein, it has both self-supervised pre-training and supervised fine tuning components. Specifically, the self-supervised pre-training component is used for generating session encodings and these encodings are utilized to detect malicious sessions through the supervised fine tuning component. The session encodings are generated through the BERT language model. The self-supervised pre-training component is trained by using the Mask Language Modeling (MLM) loss function. We train both self-supervised pre-training and supervised fine tuning components by using the same settings shown in Table 1. BEM employs LSTM to model user activity sessions. Specifically, it considers the past user activities and predicts the probabilities of future activities through LSTM. If the predicted probability of an activity in the session is low, then that session is flagged as a malicious session. The LSTM model employs a single hidden layer and the model training is performed by using cross entropy loss. We train this baseline by using the same training set which we have used for training the fine-tuning component of our framework. Deep-Log differs from BEM in two ways: (1) It employs two hidden layers in its LSTM model. (2) It predicts the probabilities of the top-K future activities, if some activity in the session is not in the list of predicted top-K activities, then that session is flagged as a malicious session. Deep-Log training is performed by using cross entropy loss. We use the same training settings which was used for BEM to train this baseline.

Table 2. Performance of our framework and baselines under different scenarios. The higher the better for Precision, Recall, and F1. The lower the better for FPR. The cells with—indicate the extreme scenario where all sessions are predicted as normal. Best values are bold highlighted. M denotes the number of malicious sessions.

4.2 Experimental Results

We consider the three versions of our CLDet framework based on the specific data augmentation technique employed for pre-training. The performance of the three versions of our framework and the baselines w.r.t four different scenarios is shown in Table 2. Our CLDet framework consistently shows better overall performance than the baselines in all the considered scenarios and datasets. The main reason for this performance is that the self-supervised pre-training component by utilizing contrastive learning generates favorable encoding for each session and by using these favorable encodings as inputs, the supervised fine-tuning component can effectively separate the malicious and normal sessions. We would point out that we purposely introduce the first scenario where the number of malicious sessions used in the training is only 5 for both CERT and UMD-Wikipedia datasets. Under this extreme setting, all baselines completely fail (all sessions in the test data are predicted as normal). On the contrary, our framework can still achieve reasonable performance except the version of using the activity deletion on CERT. There is a no clear winner among the three data augmentation techniques used in our framework when all the scenarios and datasets are considered. However, all the three data augmentation techniques can be considered as quite effective in achieving the main goal of our framework.

Table 3. Ablation analysis results. M denotes the number of malicious sessions.

Ablation Analysis. We conduct one ablation study by removing the self supervised pre-training component from our framework and only utilizing the supervised fine-tuning component. The supervised fine-tuning component consists of only linear layers and cannot model sequence data. To resolve this limitation for our ablation study, we suitably format the input data and layer \(L^{(3)}\) of the fine tuning component. Consider the word-to-vector representations of the activities belonging to the session \(S_k=\{e_{k_1}, e_{k_2},...,e_{k_T}\}\) which are denoted as \(\{ \mathbf {x}_{k_1}, \mathbf {x}_{k_2},..., \mathbf {x}_{k_T}\}\). We flatten this sequence \(\{\mathbf {x}_{k_t}\}_{t=1}^{T}\) into a vector and feed this flattened vector as an input to layer \(L^{(3)}\) of the fine-tuning component. Table 3 shows the results of this ablation study. For both datasets, the supervised fine tuning component when used in isolation for detecting malicious sessions, under-performs against our framework in all the four scenarios. Clearly, this ablation study demonstrates that self-supervised pre-training component is crucial for our framework to achieve good performance.

5 Conclusion

We presented a contrastive learning-based framework to detect malicious insiders. Our framework is specifically designed to operate on unbalanced datasets. Our framework has self-supervised pre-training and supervised fine tuning components. The former is responsible for generating user session encodings. These session encodings are generated through the aid of contrastive learning and are then used by the supervised fine tuning component to detect malicious sessions. We presented an empirical study and results demonstrated our framework’s better effectiveness than the baselines.