Keywords

1 Introduction

Nowadays malware becomes prevalent everywhere: from conventional servers and PCs to mobile and IoT devices [1]. Among an average of 1 million malware instances appear every day, most of them are variants of the existing ones [2]. Polymorphism, obfuscation and encryption techniques are applied by malware developers to evade detection by anti-malware programs [3, 4]. They generate different malware instances from the same malware family to avoid traditional detection techniques. As a result, members of the same malware family are functionally similar to one another [5] although their binaries can be quite different. A recent study revealed that there was a decrease in the number of new malware families detected between the year of 2014 and 2015. However, there was noticeable increase with a percentage of 36% in the number of malware variants detected between 2014 and 2015 [6]. With the increased number of malware instances, effective and computationally efficient methods are required to detect similar malware behaviors in order to accelerate the malware analysis process. Leveraging data mining techniques like clustering can help us fulfill this requirement.

In this paper, we conduct a clustering analysis of malware based on their dynamic behavioral patterns represented by Application Programming Interface (API) call sequences. Although supervised learning (classification) approach, such as [7,8,9,10], is the most common data mining method for malware analysis (detection), unsupervised learning (clustering), such as [11,12,13,14], is also a useful malware analysis tool that can complement supervised learning. The benefits of a clustering analysis are as follows.

  1. 1.

    Organizing malware into homogeneous clusters can help us better understand the malware activities [15]. Visualization of clustering results can provide us with the “bird-eye view” of the similarities/differences of malware in their behavioral patterns such as host infection procedures, attack routines, and dissemination mechanisms.

  2. 2.

    Clustering analysis enables “variant detection”. It allows us to uncover behaviorally similar malware that should belong to the same family even though identified as distinct and sometimes even assigned to different malware categories by commercial anti-malware programs. This facilitates malware analysis experts to better sample malware variants of the same/different malware families when carrying an in-depth manual analysis process [16].

  3. 3.

    Clustering allow us to identify the emergence of new malware families by examining the formation of new clusters and/or sub-clusters. In fact, clustering is recommended as an adaptive technique to look for unknown patterns potentially corresponding to new types of attacks [15, 17].

In our study, we first transform API call sequences into symbolic strings to allow us to detect similar patterns. Then, we calculate all-against-all pairwise similarities (and hence distances) of malware samples in our dataset in order to construct the distance matrix and subsequently perform a hierarchical clustering using the Weighted Pair Group Method with Arithmetic Mean (WPGMA) algorithm.

The new contributions of this paper are as follows.

  1. 1.

    We have investigated three different similarity/distance functions, namely Optimal Matching (OM), Longest Common Subsequence (LCS) (as in [14]), and Longest Common Prefix (LCP) and compare the results in terms of clustering quality (silhouette coefficient) and time complexity.

  2. 2.

    It turns out that LCP offers the best results with the best clustering quality and the lowest time complexity (linear-time). To our best knowledge, we are the first to apply the LCP function in API sequence-based malware analysis and demonstrate its usefulness. The results suggest that prefix (i.e., initial API subsequence) could be used as a quick but reliable indicator for overall similar behaviors in malware.

This paper is a summary version of the master’s thesis [18] by the first author.

2 Related Work

Supervised learning (classification) and unsupervised learning (clustering) are the two main approaches used in data mining-based malware analysis. The works [7,8,9,10] are some examples of malware classification. Some examples of malware clustering are described below.

Bayer et al. [11] provided a scalable approach for malware clustering. They investigated malware applications that share similar behavioral traits through three phase model, namely, dynamic analysis phase, identification of behavioral signatures based on system calls phase, and the clustering phase. To suggest a feasible solution to the proximity search of nearest neighbor, the clustering algorithm they used in their study is based on locality sensitive hashing. Their experiment effectively and efficiently clustered samples sharing similar behavioral patterns.

Kim et al. [12] calculated the similarity between android malware applications by utilizing Androguard to construct control flow graphs. Similar behaviors were compared by reviewing the structural information collected from the flow graphs to match target applications. Moreover, clustering sub-families of each family and identifying representative malware samples help reduce the number of similarity calculation. The clustering algorithm used was DBSCAN.

Qiao et al. [13] clustered API sequences based on frequent pattern mining algorithms. The analysis framework consists of three steps, API call abstraction, frequent itemset mining, and analysis. For API call abstraction, dynamic analysis tools, namely, CWSandbox and Cuckoo Sandbox were used to extract behavioral reports of malware binaries. API pattern mining for frequent itemsets was identified using Apriori Program. Nevertheless, frequent API calls are used as an input to calculate the similarity between different malware binaries. The output served as an input for computing signature sequences. Instead of API calls, strings of opcodes representing malware behaviors are clustered using a suffix tree in the recent work by Oprişa et al. [19].

Zhong et al. [14] proposed analysis system, namely ARIGUMA, which is composed of four components, Sandbox, code analyzer, visualizer, and web User Interface (UI). The system use code analyzer to classify malware applications based on their behavior. Comparison between malware instructions was completed using LCS. Given that LCS algorithm requires polynomial time by dynamic programming, the computational time was reduced by filtering out the numeric features of codes.

3 Proposed Malware Clustering Framework

3.1 Generating Encoded API Call Sequences

Instead of looking directly into the characteristics of malware through its binaries (executable codes), we examine its Application Programming Interface (API) call information in order to better understand its behavioral pattern. This approach gives us three advantages. (1) It enables a dynamic analysis of malware though run-time API calls instead of a static analysis. (2) It is much more efficient to deal with API call sequences (whose lengths are typically less than 1000) than malware binaries (whose lengths are typically a few megabytes). (3) Two instances of malware from the same family may not have a detectable common signature in their binaries because of polymorphism and/or obfuscation; however, their behaviors in terms of API call sequences are expected to be quite similar.

The API-based malware detection system (APIMDS) dataset used in this study is obtained from Ki et al. [20]. It contains the API system call sequences for 6 nominal malware categories, namely, backdoor, packed, PUP (Potentially Unwanted Programs), Trojan, virus and worm. Those sequences were obtained by executing the malware samples using virtual machine environment such as VirtualBox. (The detailed process can be referred to in the paper [20].) From the whole dataset, we randomly select 1,797 of them for our clustering studies.

After obtaining the API call sequences of those selected malware, we preform “sequence encoding”. Its purpose is to reduce the dimensionality of sequence unique states as well as to reduce the time complexity for calculating distances between the different sequences. API code sequence generation step is carried out by referring to MSDN library [21] for categorizing API system calls into 26 categories as given in Table 1.

Table 1. Codes for API call categories.

3.2 Generating Distance Matrix

The concept of “similarity” can be applied to a pair of encoded API call sequences representing dynamic behaviors of malware. Similar malware instances share a certain amount of behaviors in common. The most common algorithms that used to measure the similarity between the two given sequences are: (i) Optimal Matching (OM), (ii) Longest Common Subsequence (LCS), and (iii) Longest Common Prefix (LCP) [22, 23].

The purpose of these algorithms are to extract the lengths of the “common patterns” of malware behaviors in their own unique ways. The longer the two sequences that follow a common pattern, the more similar they are and the lesser the difference (a.k.a. “distance”) between them. Distance can be regarded as the inverse of similarity and can be easily derived from it, and vice versa.

Given a dataset of n malware samples, for each distance function, we need to construct an all-against-all (i.e., \(n\,{\times }\,n\)) distance matrix, where each matrix entry \(d_{ij}\) is the pairwise distance between two malware sequences indexed i and j. In fact, we only need the compute the distance values in the upper triangle of the matrix (whose size is \(n(n-1)/2\)), because the lower triangle is the mirror image of the upper one. This distance matrix is to be used as an input in the subsequent clustering procedure.

In our research, we use R’s TraMineR package [23] to compute the OM, the LCS, and the LCP distances. Their brief descriptions are given below.

Optimal Matching (OM) Distance: OM produces edit distance by calculating the minimal amount of edits required to change a pair of malware behaviors X and Y into a different pair such that they are similar to some extent [14]. It is based on Needleman-Wunsch sequence alignment algorithm [24], which is commonly used in Bioinformatics. The insertion/deletion cost is single value to be specified by the user, while the substitution cost matrix (of replacing one API call category for another) can be produced by assigning constant values, or using the transition rate between sequences observed in the dataset. All the insertion/deletion/substitution costs are set to 1 in our experiment. Equation 1 defines the OM distance recursively [22].

$$\begin{aligned} d_{OM}(X_{i},Y_{j})= \min \left\{ \begin{array}{l} d_{OM}(X_{i-1},Y_{j}) + \text{ deletion } \text{ cost }\\ d_{OM}(X_{i},Y_{j-1}) + \text{ insertion } \text{ cost }\\ d_{OM}(X_{i},Y_{j-1}) + \text{ substitution } \text{ cost } \text{ if } X_{i} \ne Y_{j}\\ \end{array} \right. \end{aligned}$$
(1)

The OM distance of two sequences of lengths |X| and |Y| can be computed in a quadratic \(O(|X| \times |Y|)\) time [22, 24]. It should be noted that OM directly computes the distance without needing to compute the similarity first.

Longest Common Subsequence (LCS) Distance: Given two sequences X, and Y, LCS calculates the largest number of common sequential elements between them. For example, let the two sequences be X = A-B-A-C-A-D and Y = A-B-C-A-P-Q-R. Then, their LCS is A-B-C-A. LCS measures the similarity and difference of instructions between sequences that have same relative order, but not necessarily adjacent. For the two sequences \(X_{i}\), and \(Y_{j}\), where i and j represent the length of the longest common subsequence present in both sequences, Eq. 2 defines the length of LCS recursively [22].

$$\begin{aligned} LCS(X_{i},Y_{j})= \left\{ \begin{array}{ll} 0 &{} \text{ if } {i\mathord {=}0 \text{ or } j\mathord {=}0} \\ LCS(X_{i-1},Y_{j-1}) + 1 &{} \text{ if } X_{i} = Y_{j} \\ \max (LCS(X_{i},Y_{j-1}),LCS(X_{i-1},Y_{j})) &{} \text{ if } X_{i} \ne Y_{j}\\ \end{array} \right. \end{aligned}$$
(2)

The LCS distance \(d_{LCS}\) is derived from the LCS length using Eq. 3.

$$\begin{aligned} d_{LCS}(X, Y) = |X| + |Y| - 2{\cdot }LCS(X, Y) \end{aligned}$$
(3)

LCS was also used in the previous malware clustering study by Zhong et al. [14]. Computing LCS of two sequences with |X| and |Y| elements using dynamic programming is of a quadratic \(O(|X| \times |Y|)\) time complexity [14].

Longest Common Prefix (LCP) Distance: Given two sequences X, and Y, LCP calculates the length of the longest common prefix shared by them. For example, let the two sequences be X = A-B-A-C-A-D and Y = A-B-C-A-P-Q-R. Then, their LCP is A-B. LCP offers a quick first-order approximation of the similarity of two given sequences by looking at their prefixes only. For the two sequences X, and Y, Eq. 4 defines the length of LCP [22].

$$\begin{aligned} LCP(X, Y) = |\, \{u \ne \lambda : X^{|u|}=u=Y^{|u|}\}\, | \end{aligned}$$
(4)

where \(\lambda \) is an empty string, and \(X^{|u|}\) and \(Y^{|u|}\) (\(|u|\le \min (|X|,|Y|)\)) are the longest prefix sequences of length |u| of X and Y respectively. Just like the LCS distance, the LCP distance \(d_{LCP}\) can be derived from the LCP length in the same manner as in Eq. 3.

LCP of two sequences with |X| and |Y| elements can be calculated in linear time, which is \(O(\min (|X|, |Y|))\).

3.3 Clustering

Clustering enables grouping of similar encoded API sequences representing malware behaviors to detect repeating patterns. Moreover, sub-patterns in the same cluster can also be recognized and further analyzed. To evade binary code signature-based detection by anti-malware tools, malware developers make minor changes (polymorphism) or apply obfuscation to the source code of existing malware to produce new variants. This greatly increases the number of variants for the same malware with different signatures. API sequence-based clustering can help reveal those kinds of variants easily.

Clustering uses the distance matrix generated from the above step. In our case, we conduct 3 distinct clustering exercises using 3 distance matrices with OM, LCS, and LCP distance functions respectively. We employ a hierarchical clustering algorithm called Weighted Pair Group Method with Arithmetic Mean (WPGMA) [25] to generate clusters. (Note: we also tried k-means clustering, but the results are not as good as those by WPGMA. We do not show the k-means results in this paper because of the page limit.)

The WPGMA algorithm constructs a rooted tree (dendrogram) that reflects the structure present in a pairwise distance matrix. At each step, the nearest two clusters, say i and j, are combined into a higher-level cluster \(i\cup j\). Then, its distance to another cluster k is simply the arithmetic mean of the distances between k and members of \(i\cup j\) [25].

$$\begin{aligned} d_{(i{\cup }j),k} = (d_{i,k}+d_{j,k})/2 \end{aligned}$$
(5)

Clustering n elements using WPGMA takes \(O(n^{2})\) space and \(O(n^{2})\) time. There are \(n-1\) iterations, with O(n) work in each one [26]. We use R’s Cluster package [27] to perform WPGMA clustering.

4 Results and Discussions

In this section, we will discuss the results of our proposed framework on the 1,797 malware samples in our experimental dataset. The will identify the suspicious API calls and examine the suspicious call patterns based on the clustering results. Visualization and statistical validation of the clustering results are also performed along with a running time analysis.

4.1 Clusters of Similar Malware Behavioral Patterns

Clustering helps us to find similar groups among malware API sequences. Hierarchical clustering (WPGMA) algorithm is used to cluster sequences and identify similar patterns of malicious behavior. Hierarchical clustering divides data into clusters and produce a tree structure to display the results. As mentioned above, three distinct clustering tasks using OM, LCS, and LCP distance matrices are performed. However, only the results of LCP will be shown here as an example.

Figure 1 represents clustering results for LCP. By referring to the figure, we can notice that there is a clear distinction between the clusters, except for Cluster #1 which is not very homogeneous. Each cluster exhibits a different set of behaviors. For example, in Cluster #4, we can notice that there are 6 sub-clusters that are very similar in initial subsequences yet differ in overall sequence length. These sub-clusters follow the same pattern, namely, Networking and Internet > Get Info > Debugging and Error Handling > I/O Create > File I/O > Get Info > I/O Open > Window Class. This pattern is an example of the initial subsequence taken by all sub clusters within the same cluster.

Fig. 1.
figure 1

Clustering results with LCP distance function (number of clusters = 6).

Likewise, for a large and heterogeneous cluster like Cluster #1 (with 1278 members), we can also apply sub-clustering to generate more homogeneous sub-clusters. Figure 2 shows the sub-clustering result for Cluster #1. It should be noted that for the sub-clusters that are still heterogeneous and sufficiently large, such as Sub-Cluster #1 (with 984 members), it is still possible to further sub-divide them. (The sub-division results are not shown in this paper.)

Fig. 2.
figure 2

Sub-clustering of Cluster #1 in Fig. 1.

One notable observation is that malware variants which belongs to different nominal categories (like virus, backdoor, worm, and Trojan) can exhibit similar behaviour and thus belong to the same cluster. An example can be seen in Fig. 3, which shows malware variants with different nominal categories having similar API sequence patterns—indicating a strong possibility that they are variants of each other, albeit differently categorized by the anti-malware companies.

Fig. 3.
figure 3

Zooming into a subset of Cluster #2 of LCP clustering result, exhibiting different nominal categories of malware in the same cluster.

Fig. 4.
figure 4

Silhouette coefficients by 3 distance functions (OM, LCS, and LCP).

4.2 Clustering Quality

To statistically validate clustering results, silhouette coefficient (\(\bar{s}\)) is used to determine the optimal number of clusters which produces best clustering results. Given the distance matrix, the normalized distance \(m_{ij}\) between two sequences indexed i and j is calculated from the matrix element \(d_{ij}\) as: \(m_{ij} = d_{ij}/d_{\max }\), where \(d_{\max }\) is the maximum distance value in the matrix.

The average intra-cluster distance of the sequence i to other members of its own cluster C (donated as a), the average inter-cluster distance of the sequence i to the members in the nearest cluster \(C'\) (denoted as b), and finally the silhouette coefficient (\(\bar{s}\)) for the sequence i are defined in Eq. 6. The silhouette coefficient value always lies between \(-1\) and 1. The general guideline is that a coefficient greater than 0.5 represents a meaningful clustering result [28]. The higher the coefficient, the better the clustering quality.

$$\begin{aligned} a = \sum _{j \in C \setminus \{i\}} \frac{m_{ij}}{|C|-1}; \;\; b = \min \Big (\sum _{j \in C'} \frac{m_{ij}}{|C'|} \Big );\;\;\bar{s} = \frac{b-a}{\max (a,b)} \end{aligned}$$
(6)

Figure 4 shows the average silhouette coefficients for 3 WPGMA clusterings results of the 1,797 malware samples based on 3 different distance functions of OM, LCS, and LCP. Here, LCP is found to offer the best result with the highest average silhouette coefficient of 0.65 when the number of clusters is 3.

5 Conclusion

In this paper, we have proposed a clustering framework to detect similar malware behaviors based on the comparison of API call sequences. We have investigated the performances of three sequence–sequence comparison functions, namely, OM, LCS, and LCP, as a basis for clustering. To the best of our knowledge, we are the first to apply the LCP function in API sequence-based malware analysis, which offers the most promising results in terms of clustering quality and time complexity (linear-time). This gives us a strong indication that prefix (i.e., initial API subsequence) can serve as a quick and reliable descriptor for overall similar behaviors in malware.