1 Introduction

Ransomware, a class of self-propagating malware, uses encryption to hold victim’s data and has experienced a 750% increase in frequency in 2018 [1]. Recently, the majority of these ransomware attacks target local governments and small business [2]. For example, the 2018 SamSam ransomware hit the city of Atlanta, encrypted at least one third of users’ applications, disrupted the city’s vital services [3], and resulted in $17M of remediation to rebuild its computer network [4]. Unlike large multinational businesses, small cities and businesses usually face stricter financial constraints than larger enterprises and struggle to establish or keep pace with cyber defensive technology and adversary/malware advancements. Consequently, they are less capable to defend against cyber threats. More generally, SOC’s resource constraints and the shortage of cybersecurity talent [5,6,7] motivate us to develop an automated tools for SOCs.

Currently, manual investigation of logs is commonplace in SOCs and extremely tedious. E.g., our interaction with SOC operators revealed a 160 man-hour forensic effort to manually analyze a few CryptoWall 3.0 infected hosts’ logs [8] with the goal of (a) identifying the adversary/malware actions from user actions in their logs and (b) leveraging learned information to reconfigure tools for timely detection. This motivates our target use case—from SOC-collected logs from an attacked host (esp. a ransomware infection) and non-attack host logs, we seek to automated the (currently manual) process of identifying the attack’s actions. In the ransomware case, this should be used to provides a pre-encryption ransomware detector. For testing in a controlled environment, we use “artificial logs”, that is, logs obtained by running malware and ambient (emulated user) activities in a sandbox.

Note that this mirrors classical dynamic analysis—(a) performing dynamic malware analysis to (b) extract indicators or signatures—and, hence, dynamic analysis is a second use case. Malware analysis takes considerable time and requires an individual or a team with extensive domain knowledge or reverse engineering expertise. Therefore, malware analysts usually collaborate across industry, university and government to analyze the ransomware attacks that caused disruptive global attacks (e.g., WannaCry). However, the security community has insufficient resources to manually analyze less destructive attacks such as Defray, nRansom and certain versions of Gandcrab. Therefore, manual analysis reports of such malware do not provide enough information for early detection [9,10,11,12,13,14,15,16]. Our approach, regardless of the malware’s real-world impacts and potential damages, efficiently help to automate tedious manual analysis by accurately extracting the most discriminating features from large amount of host logs and identifying malicious behavior induced by malware.

While our approach holds promise for more general malware and other attacks, we focus on ransomware. Note that upon the first infection identified in an enterprise, the logs from the affected host can be automatically turned into a detector via our tool. The tool applies three machine learning algorithms, (1) Term Frequency-Inverse Document Frequency (TF-IDF), (2) Fisher’s Linear Discriminant Analysis (Fisher’s LDA) and (3) Extra Trees/Extremely Randomized Trees (ET) to (a) automatically identify discriminating features of an attack from system logs (generated by an automatic analysis system, namely, Cuckoo Sandbox [17]), and (b) detect future attacks from the same log streams. Using Cuckoo and set scripts for running ransomware and emulated user activity provides source data for experimentation with ground truth. We test the tool using infected system logs of seven disruptive ransomware attacks (i.e., WannaCry, DBGer, Cerber, Defray, GandCrab, Locky, and nRansom) and non-attack logs from emulated user activities, and present experiments varying log quality and quantity to test robustness. These system logs include files, folders, memory, network traffic, processes and API call activities.

Contributions of the pattern-extraction and early detection tool are

  1. 1.

    analyzing ransomware (esp. initial infection) using Cuckoo Sandbox logs (more generally, ambient collected host logs) and generating features from the host behavior reports.

  2. 2.

    extracting the sequence of events (features) induced by ransomware given logs from (a few) hosts that are infected and (a potentially large amount of) ambient logs from presumably uninfected hosts;

  3. 3.

    ranking the most discriminating features (unique patterns) of malware and identifying malicious activity before data is encrypted by the ransomware.

  4. 4.

    creating graph visualizations of ET models to facilitate malware forensic efforts, and allowing operators to visualize discriminating features and their correlations.

We compare outputs with ransomware intelligence reports, and validate that our tool is robust to variations of input data. TF-IDF is the best method to identify discriminating features, and ET is the most time-efficient approach that achieves an average of 98% accuracy rate to detect the seven ransomware. This work builds on preliminary results of our workshop paper [8], which only considered feature extraction, only used TF-IDF, and only tested with one ransomware.

2 Background and Related Work

Ransomware. In contrast to the 2017 ransomware WannaCry that infected 300K machines across the globe, the majority of ransomware attacks in 2018 and 2019 have been targeting small businesses. These crypto-ransomware attacks usually use Windows API function calls to read, encrypt and delete files. Ransom messages are displayed on the screen after the ransomware infecting the host. This paper selects and analyzes seven recently disruptive ransomware attacks.

  1. 1.

    WannaCry (2017), a ransomware with historic world-wide effect, was launched on May 12, 2017 [18]. The WannaCry dropper is a self-contained program consists of three components, an application encrypting and decrypting data; an encryption key file; and a copy of Tor. WannaCry exploits vulnerabilities in Windows Server Message Block (SMB) and propagates malicious code to infect other vulnerable machines on connected networks.

  2. 2.

    DBGer (2018), a new variant of the Satan ransomware [19], scans the victim local network for vulnerable computers with outdated SMB services. DBGer incorporates a new open-source password-dumping utility, Mimikatz, to store credential of vulnearble computers [20]. The dropped Satan file is then executed to encrypt files of the infected computers with AES encryption algorithm. A text file _How_to_decrypt_files.txt containing a note of demands from the attackers is displayed on victim’s screen.

  3. 3.

    Defray (2017), a ransomware attack targets healthcare, education, manufacturing and technology industries [16]. Defray propagates via phishing emails with an attached Word document embedding an OLE package object. Once the victim executes the OLE file, the Defray payload is dropped in the %TMP% folder and disguises itself as an legitimate executable (e.g., taskmgr.exe or explorer.exe). Defray encrypts the file system but does not change file names or extensions. Finally, it deletes volume shadow copies of the encrypted files [15]. Defray developers encourage victims to contact them and negotiate the payment to get the encrypted files back [14].

  4. 4.

    Locky (2016, 2017) has more than 15 variants. It first appeared in February 2016 to infect Hollywood Presbyterian Medical Center in Los Angeles, California. The ransomware attackers send millions of phishing emails containing attachments of malicious code that can be activated via Microsoft Word Macros [11]. Locky encrypts data using RSA-2048 and AES-128 cipher that only the developers can decrypt data. In this research, we analyze the malicious behavior of a new variant of Locky ransomware called Asasin, which encrypts and renames the files with a .asasin extension.

  5. 5.

    Cerber (2016–2018) infected 150K Windows computers in July 2016 alone. Several Cerber variants appeared in the following two years have gained widespread distribution globally. Once the Cerber ransonware is deployed in the victim computer, it drops and runs an executable copy with a random name from the hidden folder created in %APPDATA%. The ransomware also creates a link to the malware, changes two Windows Registry keys, and encrypts files and databases offline with .cerber extensions [21, 22].

  6. 6.

    GandCrab (2018, 2019), a Ransomware-as-a-Service (RaaS) attack has rapidly spread across the globe since January, 2018. GandCrab RaaS online portal was finally shut down in June, 2019. During these 15 months, GandCrab creators regularly updated its code and sold the malicious code, facilitating attackers without the knowledge to write their own ransomware [23]. Attackers then distribute GandCrab ransomware through compromised websites that are built with WordPress. The newer versions of GandCrab use Salsa20 stream cipher to encrypt files offline instead of applying RSA-2048 encryption technique connecting to the C2 server [24]. GandCrab scans logical drives from A: to Z:, and encrypts files by appending a random Salsa20 key and a random initialization vector (IV) (8 bytes) to the contents of the file. The private key is encrypted in the registry using another Salsa20 key and the IV is encrypted with an RSA public key embedded in the malware. This new encryption method makes GandCrab a very strong ransomware, and the encrypted files can be decrypted by GandCrab creators only [25].

  7. 7.

    nRansom (2017) blocks the access to the infected computer rather than encrypting victim’s data [13]. It demands ten nude photos of the victim instead of digital currency to unlock the computer. As recovery from nRansom is relatively easy, it is not a sophisticated malware but a “test” or a “joke”.

Ransomware Pattern Extraction and Detection Works. Homayoun et al. [26] apply sequential pattern mining to find maximal frequency patterns (MSP) of malicious activities of four ransomware attacks. Unlike generating behavioral features directly from host logs, their approach summarizes activity using types of MSPs. Using four machine learning classifiers, the team found that atomic Registry MSPs are the most important sequence of events to detect ransomware attacks with 99% accuracy.

Verma et al. [27] embed host logs into a semantically meaningful metric space. The representation is used to build behavioral signatures of ransomware from host logs exhibiting pre-encryption detection, among other interesting use cases.

Morato et al. introduces REDFISH [28], a ransomware detection algorithm that identifies ransomware actions when it tries to encrypt shared files. REDFISH is based on the analysis of passively monitored SMB traffic, and uses three parameters of traffic statistics to detect malicious activity. The authors use 19 different ransomware families to test REDFISH, which can detect malicious activity in less than 20 seconds. REDFISH achieves a high detection rate but cannot detect ransomware before it starts to encrypt data. Our approach, discovering ransomware’s pre-encryption footprint, promises a more accurate and in-time detection.

The Related Work section our preliminary work [8] includes works published previously to those above. As the more general topic of dynamic analysis is large and diverse, a comprehensive survey is out of scope, but many exist, e.g. [29].

Fig. 1.
figure 1

Flowchart of research methodology

3 Methodology

The proposed approach requires a set of normal (presumably uninfected) system logs and at least one log stream containing ransomware behavior. In this study, the seven ransomware executables introduced in Sect. 2 are deployed inside a realistic but isolated environment with a sandbox tool, Cuckoo [17], for harvesting reproducible and shareable host logs. The Cuckoo host logs are dynamic analysis reports outlining behavior (i.e., API calls, files, registry keys, mutexes), network traffic and dropped files Meanwhile, Cuckoo also captures logs from scripted, emulated normal user activity such as reading and writing of executables, deleting files, opening websites, watching YouTube videos, sending and receiving emails, searching flight tickets, and posting and deleting tweets on Twitter (see [8]). The normal user and the ransomware events/behavior in the raw host logs produced by Cuckoo are then converted to features, and the three machine learning techniques are used to automatically obtain the most discriminating features from normal and ransomware-including logs. Afterwards, we discard the features that have little or no influence, and update the feature vector to reduce the search space of ET decision tree models. The decision tree graphs are created to present the most discriminating features of ransomware attacks. See flowchart in Fig. 1.

3.1 Feature Generation

To build features we only use the enhanced category and part of the behavior category of Cuckoo-captured logging output. The details of the feature building can be found in our previous work [8]. As malware often uses random names to create files, modules and folders, in this study, we augment paths of specific files to emphasize their names only. For example, is converted to a string “c:..rsaenh.dll”. Here, “..” is used as a wild-card to avoid generating duplicated features that represent similar host behavior.

3.2 Discriminating Feature Extraction with Machine Learning

TF-IDF, Fisher’s LDA and ET are algorithms used in this research to automatically extract the most discriminating features of ransomware from host logs.

TF-IDF, was defined to identify the relative importance of a word in a particular document out of a collection of documents [30]. Our TF-IDF application follow our previous work for accurate comparison. Given two sets of documents let f(td) denote the frequency of term t in document d, and N the size of the corpus. The TF-IDF weight is the product of the Term Frequency, \(\text{ tf }(t,d)=f_{t,d}/\sum _{t'\in d} f_{t',d}\) (giving the likelihood of t in d) and the Inverse Document Frequency, \(\text{ idf }(t,D) = \log [ {N}/(1+|\{d\in D: t\in d\}|) ]\) (giving the Shannon’s information of the document containing t). Intuitively, given a document, those terms that are uncommonly high frequency in that document are the only terms receive high scores. We use log streams from infected hosts as one set of documents and a set of normal log streams as the other to apply TF-IDF; hence, highly ranked features occur often in (and are guaranteed to occur at least once in) the “infected” document, but infrequently anywhere else [8].

Fisher’s LDA is a supervised learning classification algorithm that operates by projecting the input feature vectors to a line that (roughly speaking) maximizes the separation between the two classes [31]. For our application we consider a binary classification where one class (\(C_1\)) is comprised of the feature vectors \(\{x_i\}_i \subseteq \mathbb {R}^m\) representing host logs that included ransomware, and the second class (\(C_2\)) are those vectors of ambient logs. We use this classifier for identifying the discriminating features between the classes. Consider the set \(\{v^tx_i : x_i \in C_1 \cup C_2\}\subset \mathbb {R}\), which is the projection of all feature vectors to a line in \(\mathbb {R}^m\) defined by unit vector v. Fisher’s LDA identifies the unit vector v that maximizes \(S(v):= [v^t (\mu _1 -\mu _2)]^2 / [v^t(\Sigma _1 + \Sigma _2)v]\) with \(\mu _j, \Sigma _j\) the mean and covariance of \(C_j, j = 1,2\), respectively. S(v) is the squared difference of the projected classes’ means divided by the sum of the projected classes’ variances. It is an exercise in linear algebra to see the optimal \(v \propto (\Sigma _1 + \Sigma _2)^{-1}(\mu _1 - \mu _2).\) Geometrically, v can be thought of as a unit vector pointing from \(C_1\) to \(C_2\); hence, ranking the components of v by absolute values sorts the features that most discriminate the ransomware and normal activity.

Extremely Randomized Trees (ET) is a tree-based ensemble algorithm for supervised classification and regression. “It consists of randomizing strongly both attribute and cut point choice while splitting the tree node” [32]. In the extreme case, the algorithm provides “totally randomized trees whose structures are independent of the output values of the learning sample” [32, 33]. The randomization introduces increased bias and variance of individual trees. However, the effect on variance can be ignored when the results are averaged over a large ensemble of trees. This approach is tolerant with respect to over-smoothed (biased) class probability estimates [32]. See the cited works for details.

4 Experimental Results

Experiment One: Extracting Discriminating Features from Host Logs. This experiment applies the machine learning approaches to extract the most discriminating features/behavior of each ransomware attack. In addition to obtaining a Cuckoo analysis report (raw behavior log) for each ransomware sample, Python scripts immitating various users’ normal activities (such as reading, writing and deleting files, opening websites, etc.) are submitted to the Cuckoo sandbox to generate a large volume of normal reports.

Table 1. The most discriminating features of the seven ransomware attacks

Table 1 illustrates the most discriminating features of the seven ransomware attacks. The first column of the table (#) lists the name of seven ransomware. The second column (Pattern) presents the pre-encryption patterns (activities) of each ransomware attack obtained from the detailed ransomware technical (static) analysis produced by cybersecurity companies (e.g., FireEye [34]), security help websites (e.g., Bleeping Computer [35, 36]) and malware research teams (e.g., The Cylance Threat Research [16]). The third column (Feature) presents the features extracted from the host logs using the proposed approaches that match the unique patterns of rasomware attacks. The last column (Rank) lists the TF-IDF, Fisher’s LDA and ET rankings of the features that represent the unique patterns of the seven ransomware attacks. The features that have the largest TF-IDF and Fisher’s LDA scores, or the non-leaf nodes (features) of the Extremely Randomized Trees that have smallest levels, are top-ranked discriminating features. For the ET algorithm, the features that are at the top of the tree contribute more to correctly classifying a larger portion of input logs. E.g., a feature with rank = 1 is one of the most indicative feature of the malware according to that algorithm. Ties are possible as the scores may be the same between multiple features. We use the rankings of these features to evaluate the efficiency of the proposed three machine learning methods. The methods that provide higher rankings of the selected features are more efficient than the approaches that yield a lower rank of the same feature.

We set a large class_weight parameter for the target class in ExtraTreesClassifier of Python’s Scikit-Learn library to make the ET classifier biased to learn the pattern of malicious logs more meticulously. Therefore, some features representing the ransomware patterns are not selected as the nodes to compose the tree. In this scenario, we use “NA” to present the rankings of the feature that are not nodes in the tree. Details are elaborated by ransomware:

  1. 1.

    WannaCry: The six patterns of WannaCry before the attack encrypting data are presented in Table 1. All of these patterns can find WannaCry-generated features from the host logs. A total of 1, 207 unique features have been extracted from host logs containing both normal and abnormal behavior, while only a small portion are resulting from WannaCry actions. The experimental results indicate that TF-IDF is better than the other two methods for identifying WannaCry’s behaviors. The rankings generated by the ET classifier are slightly lower than the TF-IDF’s. However, ET is more time efficient for extracting the most discriminating features from large volume of host logs, which requires only 215 features (nodes) to make decisions (i.e., WannaCry or Normal). Therefore, the results suggest using TF-IDF to analyze the few infected hosts’ logs in an attempt to produce shareable threat intelligence reports and using the ET algorithm to obtain pre-encryption detection capabilities. This experiment also illustrates that the top-ranked features generated by Fisher’s LDA are quite different from the other two techniques. Most of the top-ranked features are normal activities. Features representing WannaCry’s patterns are listed as low as #200. Additionally, we notice that the loading and reading events of the rsaenh.dll module are ranked highly (i.e., #2 and #4 for TF-IDF and #3 and #8 for ET). The module implements the Microsoft enhanced cryptographic service provider for WannaCry to encrypt the victim’s data with 128-bit RSA encryption. These two top ranked features are not listed in our table, as they are not discriminating features to identify WannaCry attacks from other crypto-ransomware attacks.

  2. 2.

    DBGer: The three unique patterns of DBGer ransomware reported by [37] are presented in Table 1. dbger.exe, the mother file of DBGer, first creates the folder, drops EternalBlue and Mimikatz executables in the new folder, and then saves satan.exe into the C drive. A file named KSession is dropped to for storing the host ID. TF-IDF and Fisher’s LDA rank 1, 104 features generated from normal and DBGer Cuckoo reports. The ET classifier builds the decision tree using 216 of the 1104 features. The three DBGer features are ranked highly. TF-IDF yields a highest ranking of the three features, which is better than the other two methods. ET is more time efficient. However, there are many features ranked higher than the ranking of the three features, but they are normal activity. E.g., dynamic link library (DLL) files kernel32.dll and advapi.dll are on the top of the three rankings, but are not discriminating features for DBGer.

  3. 3.

    Defray: The three unique patterns of Defray are loading the ole32.dll file, dropping and executing the ransomware executable file explorer.exe, and executing a shell command. The three machine algorithms rank the first feature “loading the ole32.dll file” #9 among the total 1, 243 features. As Defray’s executable file is disguised as a Windows Internet Explorer, all of the three methods struggle to distinguish it from the normal activities. The second feature therefore is not selected to build the ET model, and its TF-IDF and Fisher’s LDA weights are much lower than the first feature’s. The three machine learning approaches rank another three features (as shown in Table 2) highest among the 1243 features. These features represent unique malicious activities performed by Defray, thus, they are discriminating features to distinguish Defray from other ransomware. However, none of these three patterns are discussed in Defray manual analysis reports [14,15,16].

  4. 4.

    Locky: We execute Asasin Locky, a 2017 variant of Locky ransomware in the Cuckoo sandbox, collect and analyze its behavior using our tool. The static analysis reports [9, 11] indicate that after being deployed, Locky’s executable file disappears. Its dropped copy svchost.exe is executed from the %TEMP% folder. However, our tool generates features from the behavior logs and presents that Asasin Locky does not drop the executable file. Instead, the attack modifies the workstation services launched by the svchost.exe process. As a member of the Cryptowall family, Asasin Locky also modifies , a file communicates with the Local Security Authority subsystem [38]. The attack then reads network provider name and the path to the Network Provider DLL file from registry by loading the network provider ntlanman.dll. Registry is retrieved by Asasin Locky to obtain the name of the Security Identifier. TF-IDF and ET provides the same and higher rankings for these five features from a total 1, 047 normal and ransomware features. These two methods both rank rsaenh.dll as the top feature; however, this feature is not a unique pattern for Asasin Locky.

  5. 5.

    Cerber: This ransomware copies itself as cerber.exe to the hidden %APPDATA% folder, creates a directory with a random name, and drops two .tmp files [10]. Cerber also escalates its privilege to admin level and reads profiles from the users’ profile image paths. Afterwards, Cerber finds the image path of rsaenh.dll, reads and loads the DLL file to encrypt data. Cerber obtains the Machine GUID (globally unique identifier) and uses its fourth part as the encrypted files’ extension. The Cerber sample tested has an extension of 93ff. The three methods rank the total 1, 137 features. ET selects 145 features to composes the decision tree. TF-IDF and ET provides similar and higher rankings of the discriminating features than Fisher’s LDA’s.

  6. 6.

    GandCrab: This experiment uses Gandcrab V2.3.1, a variant that scans the victim machine and collects information of user name, domain name, computer name, session manager name and processor type [12]. The execution is terminated if the ransomware finds the system language is Russian or the victim machine installed specific anti-virus (AV) software. Otherwise, it copies the executable file into %APPDATA%/Microsoft and adds an entry of the copied executable file path to the RunOnce key as a one-time persistence mechanism. GandCrab then decrypts the ransom notes and generate RSA keys for encryption. After encrypting data, the malware uses Windows’ NSLOOKUP tool to (1) find IP address of the GandCrab’s C2 (command and control) server; and (2) communicate with the C2 server (i.e., sending information collected from the victim’s machines to the C2 server and/or receiving commands from the C2 server). Table 1 presents two unique pre-encryption patterns of GandCrab V2.3.1. TF-IDF and ET rank them highly among 1, 017 features. The rankings of these features are much lower by Fisher’s LDA.

  7. 7.

    nRansom: This attack first creates a subfolder in %TEMP% with a random name ended with .tmp. In our experiment, the subfolder is named 1.tmp. nRansom drops an executable file (i.e., nransom.exe) and two Windows Media Player control library files (i.e., Interop.WMPLib.dll and AxInterop.WMPLib.dll) in 1.tmp. An audio file your-mom-gay.mp3 is dropped in 1.tmp Tools. Then nransom.exe is executed through the command prompt cmd.exe. After locking the victim’s computer screen, nRansom plays a looped song from the dropped mp3 file, and deletes the subfolders and dropped files. TF-IDF and ET both rank the five discriminating features of nRansom highly among 1046 features. 55 features are used for composing ET.

Table 2. Static analysis missed unique patterns and their behavioral features
Table 3. WannaCry discriminating feature ranking with varying normal data

Ransomware Unique Patterns Missed from Manual Analysis. As discussed above, besides the patterns obtained from Defray’s threat intelligence reports, the three features shown in Table 2 are also unique behavior to distinguish Defray attacks. From the dynamic analysis provided by our methodology, we also found that many ransomware attacks have similar patterns. For example, Defray, Locky and Cerber all conduct an event to load the ole32.dll file. However, neither Locky nor Cerber’s static analysis have mentioned this pattern. Similarly, manual analysis of GandCrab does not discuss the malware sample has imported CryptoAPI from advapi32.dll, which is also a discriminating feature of WannaCry attacks. Thus, our tool provides automated—more efficient and without reliance on security experts—and better quality malware behavior analysis.

Experiment Two: Ransomware Feature Rankings with Varying Normal Activities. This experiment aims to validate that the rankings of the seven ransomware discriminating features are not influenced by varying the number of normal logs. To validate the hypothesis, we calculate the TF-IDF, Fisher’s LDA and ET weights of the ransomware features in the following three scenarios.

  • Case 1 (C1): Using Experiment One’s normal logs as the baseline.

  • Case 2 (C2): Adding 30% additional new normal host logs into training data.

  • Case 3 (C3): Adding 60% more new normal host logs into training data.

Fig. 2.
figure 2

Decision path based on the training logs showing how the most discriminating features are correlated in the decision making process.

Table 3 presents the top ten features of WannaCry that are calculated by the three machine learning methods when the ambient logging data are different. The experimental results present that the ET method is robust to provide the same rankings of the top ten features under the three tested scenarios. TF-IDF is less robust than ET, but Fisher’s LDA provides completely different rankings of the top ten features in three different scenarios. Similar results were found when analyzing the top-ranked features of the other six ransomware attacks. Therefore, the ET algorithm is more robust to varying training data containing different quality and quantity of normal activity.

Table 4. ET early detection results

Experiment Three: Ransomware Early Detection. The ET decision tree classifier is applied to detect the seven ransomware before encryption from a large majority of non-malicious activity. Table 4 presents the detection rate of the seven ransomware attacks. Note that while recall varies, meaning the method produces false negatives, precision is always perfect, meaning there are no false positives. In terms of overall performance metrics, the detection model Gandcrab performs the best and DBGer performs the worst. We also create graphs of each decision tree to better interpret and visualize the detection results. Using WannaCry attack as an example, Fig. 2 displays first three levels of the decision tree. The brown non-leaf nodes (rectangular boxes) represent the features of normal activity and the blue non-leaf nodes represent features induced by WannaCry. By retrieving the blue nodes on the top of the decision tree, we can identify WannaCry’s discriminating features. The correlation coefficients of these features are provided in non-leaf boxes. The graphs facilitate malware forensics analysis and allow operators to visualize disruptive activity and determine the damages induced by the malware for proposing an optimal protection and response plan.

5 Conclusion

We develop an automated ransomware pattern-extraction and early detection tool that extracts the sequence of events induced by seven ransomware attacks, identifies the most discriminating features using three machine learning methods, and creates graphs to facilitate forensic efforts by visualizing features and their correlations. The experimental results present that TF-IDF feature ranking yields the most accurate identification of the ransomware-discriminating features, while the ET method is the most time efficient and robust to the variation of inputs. Notable, discriminating features are automatically promoted by this method that malware analysis reports failed to identify.

As the target application is using this to analyze real host logs collected by SOCs, future research to test our tool using real-world host-based data captured in enterprise networks to determine conditions for success. Moreover, large enterprises generate large volumes of host data. The offline machine learning techniques used in this paper—creating features from host logs, determining malware discriminating features and detecting attacks—may not scale. Future research using online machine learning technique (e.g., incremental decision tree) and deep learning methods (e.g., LSTMs) can enhance the tool.