Keywords

1 Introduction

The ever-increasing prevalence of cyber threats compels network administrators and information security specialists to constantly monitor ongoing activity inside sensitive networks. Such monitoring often boils down to collecting and analyzing event logs, which can be generally defined as structured records reflecting all sorts of actions. These logs are continuously produced in massive amounts by nearly all devices within a monitored network, which is both a blessing and a curse: while wide and systematic log collection provides a detailed situational picture, extracting actionable insights from such volumes of data is highly challenging. Data mining algorithms are thus needed to efficiently analyze and summarize all the available information.

Interestingly, various types of event logs can be naturally represented as heterogeneous graphs. Consider for instance a network flow recorded at the border of a protected network, indicating that a connection has occurred between internal host u and external host v. A collection of such events can be abstracted into a bipartite graph whose top (resp. bottom) nodes are the internal (resp. external) hosts. In addition, each network flow is further characterized by additional categorical variables, such as the transport layer protocol (e.g. TCP or UDP) or the source and destination ports. Therefore, each internal–external host pair can be linked by various kinds of edges. This suggests that a multiplex graph [17] might be an appropriate representation. However, when considering a real-world network, such a graph could easily have tens of thousands of nodes, making it hard to interpret without any further processing.

To alleviate this issue, we propose to use a model-based multiplex graph biclustering algorithm. This algorithm extends previous work on latent block modelling [10, 12] by factoring in the existence of multiple types of edges. It allows us to extract relevant clusters of entities along with their interactions across various network layers, thereby creating a simplified view of the information contained in the logs. While graph biclustering has already been used for event log analysis [20], to the best of our knowledge, the use of multilayer models to account for the heterogeneity of events is novel. In addition, we present detailed case studies on two publicly available datasets consisting of network flow records and authentication logs, respectively.

The rest of this paper is structured as follows. We first introduce the statistical model and inference procedure we use for multiplex graph biclustering in Sect. 2. The two case studies are then presented in Sects. 3 and 4, respectively. Finally, we review some related work in Sect. 5 and discuss potential areas of improvement in Sect. 6.

2 The Multilayer Latent Block Model

Before diving into the specifics of authentication logs and network flows, we first introduce the statistical tools we use to analyze them. Section 2.1 presents our generative model for multiplex bipartite graphs, and Sect. 2.2 describes the algorithms we use for model inference and selection.

2.1 Model Description

Key notations. Let \(\mathcal {U}=\{u_1,\ldots ,u_I\},\mathcal {V}=\{v_1,\ldots ,v_J\}\) be two node sets of size I and J, respectively. We consider a multiplex bipartite graph \(\mathcal {G}=(\mathcal {U},\mathcal {V},\mathcal {E})\), where \(\mathcal {E}\subset \mathcal {U}\times \mathcal {V}\times [L]\) denotes the edge set (with \([L]\overset{\text {def}}{=}\{1,\ldots ,L\}\)). Specifically, each edge \((u,v,\ell )\in \mathcal {E}\) between a top node u and a bottom node v is further characterized by an edge type \(\ell \), and L denotes the number of possible edge types. The multiplex graph \(\mathcal {G}\) can then be characterized by L biadjacency matrices \(\textbf{B}^{(1)},\ldots ,\textbf{B}^{(L)}\), where for each \(\ell \in [L]\), the matrix \( \textbf{B}^{(\ell )}= \big (b_{ij}^{(\ell )}\big )\in \mathbb {R}^{I\times {J}} \) is defined by

$$ \forall (i,j)\in [I]\times [J],\,b_{ij}^{(\ell )}={\left\{ \begin{array}{ll} 1 &{} \text { if }(u_i,v_j,\ell )\in \mathcal {E} \\ 0 &{} \text { otherwise.} \end{array}\right. } $$

Finally, we aim to partition \(\mathcal {U}\) (resp. \(\mathcal {V}\)) into a fixed number H (resp. K) of clusters, which we denote \(\mathbb {U}=\{\mathcal {U}_1,\ldots ,\mathcal {U}_H\}\) (resp. \(\mathbb {V}=\{\mathcal {V}_1,\ldots ,\mathcal {V}_K\}\)). For each top node \(u_i\) (resp. bottom node \(v_j\)), the unique index h such that \(u_i\in \mathcal {U}_h\) (resp. k such that \(v_j\in \mathcal {V}_k\)) is denoted \(U_i\) (resp. \(V_j\)). In order to find the optimal partition, we adopt a model-based approach relying on the generative model described in the next paragraph.

Generative model. We propose a multilayer extension of the Poisson latent block model introduced by Govaert and Nadif [12]. This model relies on the fundamental assumption that the probability of an edge between two nodes depends on the cluster assignments of these nodes. The biadjacency matrices \(\textbf{B}^{(1)},\ldots ,\textbf{B}^{(L)}\) are then generated by the following hierarchical model:

  1. (i)

    for each \(i\in [I]\), sample \(U_i\sim \textrm{Multinomial}(\boldsymbol{\pi })\);

  2. (ii)

    for each \(j\in [J]\), sample \(V_j\sim \textrm{Multinomial}(\boldsymbol{\rho })\);

  3. (iii)

    for each \((i,j,\ell )\in [I]\times [J]\times [L]\), sample \(b_{ij}^{(\ell )}\sim \textrm{Poisson}\left( \mu _i\nu _j\theta _{U_iV_j}^{(\ell )} \right) \).

In words, the proportion of top (resp. bottom) nodes falling into each cluster is controlled by a parameter \(\boldsymbol{\pi }\in (0,1)^H\) (resp. \(\boldsymbol{\rho }\in (0,1)^K\)). The probability of an edge of type \(\ell \) linking a top node \(u_i\) and a bottom node \(v_j\) then depends on three parameters: the node-specific factors \(\mu _i\) and \(\nu _j\) represent the overall propensity of each node to form edges across all layers, and the rate \(\theta _{U_iV_j}^{(\ell )}\) controls the number of edges between clusters \(\mathcal {U}_{U_i}\) and \(\mathcal {V}_{V_j}\) in the \(\ell \)-th layer. Note that the Poisson distribution is only used here to facilitate calculations: in practice, the biadjacency matrices only contain zeros and ones.

2.2 Model Inference and Selection

Given an observed graph \(\mathcal {G}=(\mathcal {U},\mathcal {V},\mathcal {E})\), we now aim to find the optimal partitions \(\mathbb {U}\) and \(\mathbb {V}\) using the model introduced above. To that end, we need to infer the parameters \(\boldsymbol{\pi }\), \(\boldsymbol{\rho }\), \(\boldsymbol{\mu }=\big (\mu _i)\in \mathbb {R}_+^{I}\), \(\boldsymbol{\nu }=\big (\nu _j)\in \mathbb {R}_+^{J}\), and \(\mathbf {\varTheta }=\big (\mathbf {\varTheta }^{(1)},\ldots ,\mathbf {\varTheta }^{(L)}\big )\) (where \(\mathbf {\varTheta }^{(\ell )}=\big ( \theta ^{(\ell )}_{hk} \big )\in \mathbb {R}_+^{H\times {K}} \) for each \(\ell \in [L]\)). The number of top clusters H and bottom clusters K must also be set in a principled manner. These two problems are addressed in the next paragraphs.

Model inference. The parameters of the model and the cluster assignments are obtained by maximizing the complete data log-likelihood

$$\begin{aligned} L_{\textrm{C}}(\mathbb {T},\mathbb {U},\mathbb {V}) =&\sum _{i=1}^{I}\log \pi _{U_i} + \sum _{j=1}^{J}\log \rho _{V_j} \\&+ \sum _{i=1}^{I}\sum _{j=1}^{J}\sum _{\ell =1}^L\left\{ b_{ij}^{(\ell )}\log \left( \mu _i\nu _j\theta _{U_iV_j}^{(\ell )} \right) - \mu _i\nu _j\theta _{U_iV_j}^{(\ell )} \right\} , \end{aligned}$$

where \(\mathbb {T}=\{\boldsymbol{\pi },\boldsymbol{\rho },\boldsymbol{\mu }, \boldsymbol{\nu },\mathbf {\varTheta }\}\) denotes the complete set of parameters. To that end, we use the block expectation-maximization algorithm described in [12], with minor adjustments to factor in the multilayer nature of the data. This algorithm first introduces soft cluster assignment matrices \(\textbf{U}=\big (\tilde{u}_{ih}\big )\in [0,1]^{I\times {H}}\) and \(\textbf{V}=\big (\tilde{v}_{jk}\big )\in [0,1]^{J\times {K}}\), with \(\sum _{h=1}^H\tilde{u}_{ih}=\sum _{k=1}^K\tilde{v}_{jk}=1\) for all \(i\in [I]\) and \(j\in [J]\). It then maximizes the fuzzy criterion

$$ G(\mathbb {T},\textbf{U},\textbf{V})= L_{\textrm{S}}(\mathbb {T},\textbf{U},\textbf{V}) + H(\textbf{U}) + H(\textbf{V}), $$

where \(H(\textbf{U})=-\sum _{i=1}^{I}\sum _{h=1}^H\tilde{u}_{ih}\log \tilde{u}_{ih}\) denotes the total entropy of the soft cluster assignments of the top nodes, \(H(\textbf{V})\) denotes the total entropy for the bottom nodes, and

$$\begin{aligned} L_{\textrm{S}}(\mathbb {T},\textbf{U},\textbf{V})=&\sum _{i=1}^{I}\sum _{h=1}^H\tilde{u}_{ih}\log \pi _h + \sum _{j=1}^{J}\sum _{k=1}^K\tilde{v}_{jk}\log \rho _k \\&+ \sum _{h=1}^H\sum _{k=1}^K\sum _{i=1}^{I}\sum _{j=1}^{J}\sum _{\ell =1}^L \tilde{u}_{ih}\tilde{v}_{jk}\left[ b_{ij}^{(\ell )}\log \left( \theta _{hk}^{(\ell )} \right) - \mu _i\nu _j\theta _{hk}^{(\ell )} \right] \end{aligned}$$

is the fuzzy likelihood function. The criterion \(G(\mathbb {T},\textbf{U},\textbf{V})\) is maximized by alternatively performing two steps: optimizing \(\textbf{U}\) and \(\textbf{V}\) with \(\mathbf {\varTheta }\) fixed (E-step), and optimizing \(\mathbf {\varTheta }\) with \(\textbf{U}\) and \(\textbf{V}\) fixed (M-step). These two steps are iterated until G stabilizes. The node partitions \(\mathbb {U}\) and \(\mathbb {V}\) can then be obtained by assigning each node to the most probable cluster according to \(\textbf{U}\) and \(\textbf{V}\). Algorithm 1 describes the detailed inference procedure. Note that since the result depends on the random initialization of the parameters, we run the whole procedure 50 times with different initializations and return the model with the highest likelihood.

figure a

Model selection. In order to run the aforementioned inference procedure, we must first set the number of top clusters H and bottom clusters K. In the absence of any prior knowledge, we use the integrated completed likelihood (ICL [3]) to pick the best values out of a predefined set of candidates. More specifically, for each \((H,K)\in \{2,\ldots ,16\}^2\), we run the inference procedure to obtain the optimal parameter set \(\mathbb {T}\) and node partitions \(\mathbb {U},\mathbb {V}\). We then compute

$$ \textrm{ICL}(\mathbb {U},\mathbb {V};H,K) = L_{\textrm{C}}(\mathbb {T},\mathbb {U},\mathbb {V}) - \frac{H-1}{2}\log I - \frac{K-1}{2}\log J - \frac{LHK}{2}\log (LIJ) $$

for each candidate model, and the highest-scoring model is selected. The ICL penalizes the complete data log-likelihood by subtracting terms proportional to the number of free parameters, thus enabling a trade-off between goodness-of-fit and model complexity. Note that it relies on several approximations, which makes it an arguably imperfect evaluation criterion. This has motivated many subsequent contributions on model selection for stochastic and latent block models (see e.g. [4, 19, 27]). However, for the sake of simplicity, we leave the use of more sophisticated criteria for future work.

3 First Case Study—Network Flows

Having described our modelling tools, we now move on to our first case study. Section 3.1 describes the dataset we consider and the preprocessing steps we apply to turn it into a multiplex network. We then discuss the results obtained by applying our methodology in Sect. 3.2.

3.1 Data Description

The dataset we use was originally created for the Mini-Challenge 3 of the VAST 2013 competition [26]. It represents two weeks of simulated network traffic between approximately 1400 hosts, most of which belong to an enterprise network. We refer to these hosts as internal ones, while hosts that are not part of the enterprise network are called external hosts. The dataset contains benign traffic as well as various kinds of attacks. Some of these attacks are rather noisy—e.g. distributed denials of service (DDoS), port scans—while others are more subtle (e.g. data exfiltration). Overall, 53 out of 200 external hosts are involved in malicious traffic, which is a rather high ratio. This, along with the small number of hosts and synthetic nature of the data, makes this dataset a somewhat easy test case for our methodology. Moreover, a thorough description of the hosts and their respective roles is available, as well as a complete list of attack-related events. This allows us to evaluate the relevance of our results.

Note that while the dataset encompasses several data sources, we only keep the network flow records between internal hosts and external hosts. We turn these records into a bipartite multiplex graph as follows. First, each internal (resp. external) host is represented by a top (resp. bottom) node. For each flow between an internal host IH and an external host EH, we then build an edge between IH and EH, of type (Protocol, Destination Port, Direction). Note that due to the large number of possible destination ports, we only keep 10 distinct values corresponding to well-known protocolsFootnote 1 and represent the other values as a single Other token. As for the direction of the flow, it is either inbound (the internal host is the destination) or outbound (the internal host is the source). Note that each observed (IH, EH, Type) triple is represented by one single edge, regardless of its number of occurrences. See Table 1 for some descriptive statistics about the dataset and the obtained graph.

Table 1. Description of the datasets: number of top nodes I, number of bottom nodes J, number of layers L, number of distinct edges M, and total number of events N.

3.2 Results

Our methodology yields three clusters of internal hosts and three clusters of external hosts. These clusters, as well as the aggregated edges between them, are shown in Fig. 1. Note that for the sake of readability, only edges representing at least 40 events are displayed. The size of each cluster is proportional to the logarithm of the number of hosts it contains. Similarly, the width of each edge grows logarithmically with the number of underlying events.

Fig. 1.
figure 1

Clusters found in the VAST dataset and interactions between them.

Starting with internal clusters, we observe that cluster 1 initiates HTTP and SSH connections, and that it has more outbound flows than inbound ones. Similarly, cluster 3 primarily interacts through outbound flows, including HTTP and FTP traffic. In contrast, cluster 2 receives many inbound connections, with HTTP, SMTP and DNS among the most represented protocols. It also sends SMTP traffic to external cluster 3. Overall, we can thus safely assume that clusters 1 and 3 contain workstations while cluster 2 contains servers, which is indeed confirmed by the ground truth description. As for external clusters, cluster 1 mostly receives HTTP connections from internal workstations, suggesting that it contains Web servers. This is consistent with the ground truth. Cluster 3 contains a mail server and an FTP server, which explains the inbound SMTP traffic coming from internal servers as well as the inbound FTP connections coming from internal workstations. Finally, cluster 2 contains the majority of external hosts, which are primarily observed connecting to the internal servers. In particular, attackers all fall into cluster 2, along with many benign hosts.

Even though partitioning the set of external hosts does not isolate malicious ones, studying the interactions between clusters allows us to uncover several attacks. For instance, all TCP traffic from external cluster 2 to internal cluster 2 on ports 20, 21, 22, 23, 53, 443, 465 and 587 originates from two hosts (10.9.81.5 and 10.10.11.15), which happen to be attackers. This also holds for all UDP traffic between these two clusters. These edges stand out in Fig. 1 because of the relatively small number of events they represent. Such scarce connections on many different ports suggest port scanning activity, which is confirmed by the ground truth. Similarly, ICMP traffic from external cluster 2 to internal cluster 2 also results from network scans. Finally, all SSH traffic from internal cluster 1 to external cluster 2 is directed towards one single host (10.0.3.77), which calls for further investigation. These flows actually represent beaconing activity from compromised internal hosts to a command and control server.

Overall, this first case study shows that our approach can indeed infer meaningful clusters, thereby revealing the functional roles of most hosts. It can also help identify some malicious behaviors by providing a reduced number of starting points for deeper investigation. Note, however, that some types of attacks are not easily detectable. Typically, data exfiltrations and DDoS attacks only differ from normal traffic through their volume. Since our graph-based representation does not include this characteristic, it does not allow us to distinguish exfiltrations from regular FTP traffic, or DDoS attacks from regular Web traffic. However, these high-volume attacks can be detected using other methods.

4 Second Case Study—Authentication Logs

Our second case study focuses on a different data source, namely authentication logs collected within a real-world enterprise network. The dataset, which we describe in Sect. 4.1, is significantly larger and more complex than the one studied in the previous section. It is thus more challenging to extract meaningful insights from it. However, as we show in Sect. 4.2, our methodology still allows us to infer some functional roles and uncover malicious behavior.

4.1 Data Description

We use the “Comprehensive, Multi-Source Cyber-Security Events" [14, 15] dataset released by the Los Alamos National Laboratory (LANL). This dataset represents 58 days of activity within the LANL’s enterprise network, recorded through several kinds of event logs. A red team exercise took place during this time span, meaning that penetration testers tried to breach the network in order to assess its security. The remote authentications performed by the red team are labelled, providing and interesting example of an advanced intrusion within an enterprise network. In particular, this attack is stealthier than those considered in the previous section, which makes it a more challenging test case for our method.

We focus on Windows authentication logs, more specifically on successful LogOn events. Each one of these events is described by a source user SU, a destination user DU, a source host SH, a destination host DH, a logon type LT and an authentication package AP. Note that SU and DU can be identical, and the same goes for SH and DH. As for LT and AP, they describe the type of session being created and the protocol used to authenticate the user, respectively. Examples of logon types include Interactive (local session with graphical interface) or Network (used for remote file accesses or remote procedure calls, for instance), and the most frequent authentication packages are Kerberos and NTLM. Note that these two additional fields are especially meaningful as they allow to distinguish different behaviors (e.g., remote administrative session versus simple file access). They can also help characterize users and hosts: for instance, since NTLM is considered less secure than Kerberos, NTLM authentication can sometimes be disabled for highly privileged accounts. Conversely, some legacy applications may not support Kerberos authentication, thus servers hosting such applications should frequently use NTLM.

We represent users as top nodes and hosts as bottom nodes, and we turn logon events into typed edges as follows. If SH and DH are identical, we create one edge between DU and DH with type (LT, AP, Local). Otherwise, we create one edge between SU and SH with type (LT, AP, From), and one edge between DU and DH with type (LT, AP, To). While this construction breaks the link between the source and destination of an authentication, it still highlights the difference between hosts receving many remote authentications (typically servers) and hosts from which these authentications originate (typically workstations). See Table 1 for some descriptive statistics about the dataset.

Fig. 2.
figure 2

Clusters found in the LANL dataset and interactions between them.

4.2 Results

The optimal model has thirteen user clusters and twelve host clusters. We display them in Fig. 2, along with the main interactions between them. Specifically, we display all edges \((h,k,\ell )\) such that \(\theta _{hk}^{(\ell )}\ge {0}.7\). Note that we select edges using the set of rate matrices \(\mathbf {\varTheta }\) rather than the number of underlying events because the behaviors we seek to detect generate relatively few events. Thus cluster–cluster edges with high rate, which represent many user–host edges occurring at least once, are more relevant than those representing many individual events.

The global picture is expectedly more complex than in the previous section. In addition, little information is available on the true functional roles of users and hosts. However, we can still infer the meaning of some clusters and confirm their relevance. For instance, user clusters 10, 11 and 13 mainly interact through local authentications with logon type Service, indicating that they contain service accounts. User cluster 1 opens remote interactive sessions on many hosts, which suggests the presence of administrator accounts. As for user clusters 3 and 5, they mostly consist of anonymous user credentials, which are typically used to access shared resources which do not require any access control (such as internal Web pages). This is consistent with the NTLM-authenticated network logons associated with these clusters. Interestingly, a significant proportion of the non-anonymous user names from these two clusters were used for red team activity. This suggests that looking for inconsistencies between the expected functional role of an entity and the cluster it falls into can help uncover malicious behaviors.

Inferring the functional roles of host clusters is more difficult, although it seems reasonable to assume that cluster 1 contains domain controllers and applicative servers. Indeed, this cluster is the one receiving the most remote authentications. Conversely, clusters 2, 3, 8, 9, and 11 are sources of remote authentications, suggesting that they mostly contain workstations. Finally, cluster 4 stands out as the source of remote NTLM authentications involving three user clusters. Further investigation reveals that most of these authentications were performed by the red team and originate from host C17693, which also happens to be the main source of red team activity. Once again, the simplified situational picture we generate thus preserves important clues that can lead to the detection of malicious behavior.

5 Related Work

Biclustering and latent block models. Biclustering, i.e., the idea of simultaneously partitioning the rows and columns of a matrix so as to form blocks of coefficients with similar values, can be traced back to the early 1970s [13]. Spectral methods [7, 18] are among the most popular approaches to this task. However, model-based biclustering has gained traction in the last two decades, following the introduction of the latent block model [10]. Various inference procedures have been proposed, most of which rely on the expectation-maximization (EM) algorithm. While we apply the simple variational approximation proposed in [10], other approaches include the addition of intermediary classification steps [11] or the use of Bayesian inference [16].

Multilayer stochastic block models. Even though we are not aware of any previous work on multilayer generalizations of the latent block model, a straightforward connection can be made with multiplex stochastic block models (SBMs). More specifically, our model is analogous to the degree-corrected SBM with independent layers described in [22]. Note, however, that many other multilayer generalizations of the SBM can be found in the literature. In particular, the existence of diversely complex statistical dependencies between layers has been addressed in various ways [2, 6, 21, 24].

Data exploration and visualization for cybersecurity. Finally, our work contributes to a vast research effort aiming to ease network security monitoring by providing insightful visualizations. In particular, displaying network flows between internal and external hosts is a recurring challenge. VISUAL [1], VisFlowConnect [28], NFlowVis [8], and FloVis [25] are some of the existing tools designed to perform this task. However, none of them focuses on statistical modelling to build a more condensed view. Note that FloVis and NFlowVis use the hierarchical nature of IP addresses to aggregate hosts by subnetwork, and NFlowVis also uses k-medoids clustering to find external hosts with similar communication patterns. Even so, the use of behavior-based biclustering makes our method more effective at reducing the number of elements to display while preserving essential information. As for authentication logs, APTHunter [23] also focuses on interactivity rather than automated data reduction, only letting the user manually apply filters to make the authentication graph legible. Finally, the work of Glatz et al. [9] adopts an approach somewhat similar to ours: they first extract frequent itemsets from network flow records, then display them as a bipartite itemset–item graph. While this does provide a simple summary of large volumes of logs, the extracted itemsets still cover a small minority of the total number of events, leading to significant information loss.

6 Conclusion and Perspectives

We propose a graph-oriented approach to event log exploration for network security monitoring. Through the use of model-based multiplex graph biclustering, we aim to extract meaningful clusters of entities, such as groups of users or hosts sharing functional roles or behavioral patterns. Our case studies demonstrate that such meaningful clusters can indeed be uncovered. In addition, displaying interactions between these clusters can facilitate malicious behavior detection.

Aside from investigating better model selection criteria (see Sect. 2.2), extending our model to factor in the temporal dimension could be an interesting lead for future work. Previous contributions on temporal latent block models [5] provide foundations for such an extension. Finally, looking for meaningful groups of edge types in addition to entity clusters, similarly to the strata multilayer SBM [24], is another promising direction. Indeed, it could lead to an even simpler situational picture, especially for large and complex datasets such as the one studied in Sect. 4.