1 Introduction

Current mainstream and practical methods for antivirus and malicious behavior detection determine whether behavior is malicious based on virus signature databases. However, this approach has limitations, and the employed virus signature databases are derived from manual analysis and extraction, leaving open the possibility for the development of future malicious behaviors. Existing solutions to this problem rely on human experts to define features and often miss many vulnerabilities (i.e., incurring a high false negative rate). Moreover, because network technology is updated continuously, malicious behavior variants are also increasing rapidly; consequently, feature libraries must be continuously updated, resulting in low detection accuracy and high labor costs. Furthermore, the only malicious behavior these approaches can detect is behavior that exists in the feature database, whereas they are unable to detect new malicious behavior variants. In this study, we describe an approach that uses deep learning-based malware detection to relieve human experts from the tedious and subjective task of manually defining features and we specifically aimed to lower the false negative rate(FNR) and the false positive rate(FPR). By mining relationships in the data, the error rate of human experts is reduced, and the accuracy of the detection models is improved.

The behavior-based malicious behavior detection methods proposed by researchers can mainly be divided into static behavior detection and dynamic analysis detection. Among these, early malware detection efforts employed static analyses [10, 16], and static behavior detection has been the main technique used in code analysis to acquire information concerning software behavior; however these approaches are unable to detect files that adopt techniques such as packing or reverse decompression [15]. Dynamic behavior analysis refers to analysis that occurs while the software is actually running, capturing its behavior for analysis. This approach can effectively address problems that cannot be solved by static detection. Many experts in the field currently use dynamic behavior techniques to detect malicious behavior [3, 22]. Classification techniques are generally used to classify unknown malware into known types [25]. By considering the extracted malicious code behavior as a detection feature, this method avoids problems resulting from code obfuscation because it focuses on the actual behavior of malicious code. However, this behavior-based feature is still limited to grammatical features and is easily confused by equivalent behavior substitution. Interference such as that caused by the technique proposed by Sekar et al. [23], which involves a confusing attack that injects garbage behavior while allowing the attack to simulate a normal behavioral sequence, can bypass detection systems. Other researchers have used various APIs and other dynamic fields to detect certain types of malware [11] or use API function calls, API function parameters, and a collection of paired features to calculate detection sets based on the concept of information entropy. Identifying malicious information by distinguishing the differences in information gain between benign and malicious behavior was proposed in [27]. Although the above methods can detect most malicious behaviors, some of these methods are complex, or their analysis processes are extremely complex; others rely on a specific feature library for analysis, or the applied algorithm considers only problems with large probabilities but does not consider the occurrence of hidden events, making them impractical in real-world situations.

The rapid development of neural networks and deep learning has led many researchers to apply these models to malware detection because they are adept at recognizing complex and abstract patterns from large numbers of malware samples. There is always a certain probability of detecting malicious behavior—even in instances with frequent mutations [21, 28, 34, 35]; considering this aspect, deep learning offers some advantages. Nevertheless, deep learning itself is vulnerable to what is known as “adversarial samples” [2, 26], which means that these systems can be easily deceived by dangerous manipulation [8]. Recent research has already demonstrated that a malware author can leverage feature amplitude inequilibrium to bypass malware detectors powered by deep neural networks [1, 9]. To address this problem, this paper monitors API call sequences at program runtime and proposes a malware depth detection method based on behavior chains. Current operating systems provide a large number of APIs; essentially all programs must call the APIs that correspond to specific tasks. In this paper, we analyze the API call sequence and study how to perform feature extraction and implement detection methods for malicious behavior. Next, a method for describing behavior that uses the behavioral associations based on the program’s API call sequences is established to construct corresponding behavior chains. Then, the behavior chains are used to train a long short-term memory (LSTM) model. Finally, the trained model can be used to detect malicious behavior. The presented experiments demonstrate that this method corresponds to improved detection ability.

The remainder of this paper is organized as follows. In Section 2, we discuss the relevant background information concerning malicious behavior detection and the LSTM model. Section 3 describes the proposed work and the methodology in detail. Section 4 presents the experiments and an analysis of the results, and Section 5 concludes the paper and proposes directions for future work.

2 Background

The main contribution of this paper is a method for constructing behavior chains based on malicious behavior and their use in a detection method. To construct a behavior chain, the first step is to analyze the malicious behavior in a running process. Based on this analysis, the behavior and its descriptive characteristics are extracted, and a corresponding behavior chain is constructed that can be used for malicious behavior detection based on the LSTM model.

2.1 Literature survey

Behavior-based malicious behavior detection can be generally divided into two types of approaches: static analysis detection and dynamic analysis detection [31]. Static analysis involves disassembling malicious code using disassembly tools such as IDA Pro or W32Dasm. This approach does not require executing the program; instead, it obtains malicious behavior information is obtained solely through code analysis. For example, Wang [32] proposed a method for comparing code in a malicious behavior file with related data. First, this approach determines the code block; then, it compares the data block. However, this method is only applicable to malicious behavior code that has not undergone large changes. The authors of [7] proposed a system for detecting malicious Android applications by statically analyzing application behavior. This system extracts static features from secure applications to detect malicious behavior; however, this approach cannot protect devices from transient attacks or modified malware. Vida Ghanaei [12] presented a static block analysis of the basic block frequencies of malware samples to classify malware families; however, this study used only static analysis and did not consider the actual operation of the malware. Dullien [4] analyzed the execution semantics of malicious behavior programs based on the control flow of the program. This method improved the accuracy of malicious behavior judgment, but could not deal with obfuscated code because it only compared the basic program blocks. Because static analysis relies on disassembly techniques, some malicious behavior code protects itself through techniques such as compression, encryption [29], and so on. Moreover, because statically analyzed code does not accurately represent real code that implements real functionality, it is extremely difficult to judge the true results. Therefore, dynamic analysis methods have emerged.

Dynamic analysis methods generally monitor malicious behavior through system monitoring or debugging tools. The analysis is performed while the program is running to judge whether its actions represent malicious behavior. This method circumvents interference by techniques such as packing malicious behavior code or confusion; therefore, it is more suitable for environments in which malicious behaviors occur. Most recent studies have focused on dynamic behavior analysis. For example, Wang Rui [33] proposed a semantically-based approach to extract malware behavioral signatures and perform detection. This approach extracts critical malware behaviors and the dependencies among these behaviors and then acquires anti-interference malware behavior signatures using an anti-obfuscation engine to identify semantically irrelevant and semantically equivalent behaviors, which improves the ability to recognize malicious code. Compared with any pattern-based approach for detection, the code similarity-based approach is advantageous in that a single code instance can detect the same malicious behavior in the target program. However, it can only detect malicious behavior that identical or almostidentical code clones [24]. To achieve a higher effectiveness of malicious behavior detection, experts need to define features to automatically select the correct code similarity algorithms for different kinds of malicious behavior [19]. Zhen Li [20] studied the used of deep learning-based vulnerability detection to relieve human experts from the tedious and subjective task of manually defining features. Yujie Fan [6] proposed an effective sequence-mining algorithm to discover malicious sequence patterns. They performed malicious detection using an ANN classifier, achieving good results. Yanfang Ye [14] proposed a HinDriod system architecture that first generates smali code through decompilation and then analyzes the resulting smali code. A complete Android API call list, representing two entity types and four types of relationship characteristics can be extracted in this manner. Then, the relationships among the extracted API calls can be further analyzed. The heterogeneous information network HIN is used to solve the complex relationships, and relationships between applications are found through the meta-path method. This approach capitalizes on the idea of using multicore learning to build a classification network that makes binary “safe” or “malicious” judgments concerning applications. In [13] a new dynamic analysis method called component traversal was proposed. This method also automatically decompiles a given Android application as much as possible and then analyzes its code to determine whether it is a malicious program.

However, the above studies did not consider the call sequences of the application, nor was the overall behavior trajectory of the software operation analyzed, thus, the accuracy of the test results is not high. Software vulnerabilities are detected in [20], although the method does not detect whether software with vulnerabilities exhibits malicious behavior, resulting in some hidden attacks that can escape detection. This paper proposes a new method and using a deep learning model to detect malicious behavior with the aim of achieving a lower false negative rate, and ultimately obtaining a higher malware detection efficiency.

2.2 Long short-term memory (LSTM)

On one hand, because malicious behavior code mutations occur quickly and the variants have multiple styles, many malicious behavior detection software implementations cannot be altered in time, resulting in huge losses. On the other hand, because many current malicious attacks have good latent characteristics, it is imperative to find a way to quickly and accurately detect malicious attacks.

Traditional neural networks do not consider chronological factors and cannot remember previous content. To address this problem, the recurrent neural network (RNN) was developed. The logical architecture of an RNN is depicted in Figure 1. The hidden state ht is obtained from the input Xt at time t and the from the output ht − 1 at the previous moment. The latter is used to calculate the model loss of the current layer and to calculate the ht + 1 of the next layer. However, because an RNN suffers from gradient decay, the hidden structure of its sequence index position t was improved to avoid the gradient disappearance problem. Thus, a special RNN model called an LSTM can learn long-term dependency information. An LSTM is somewhat different from the typical neural network module A of RNNs. In an RNN, the repeated neural network module A has a very simple structure, such as a tanh layer:

$$ {\mathrm{h}}_t=\tanh \left({W}_h\left[{h}_{\left(t-1\right)},{X}_t\right]+{b}_h\right). $$
Figure 1
figure 1

Recurrent neural networks structure

In contrast, an LSTM has four neural network layers that interact in a special way, as shown in Figure 2. An LSTM has the ability to delete or add information to memory through a specially designed structure called a “gate.” The gate is actually the place to select the operational data information; it contains a sigmoid neural network layer and a multiplication operation. The sigmoid layer changes the input through the sigmoid function and outputs a value between 0 and 1, describing how much input can pass through that network part. A “0” indicates that no data are allowed to pass, while a “1” indicates that all data are allowed to pass. The gate structure of an LSTM at each sequence index position t generally includes a forgetting gate, an input gate and an output gate. The output ft of the sigmoid is a value in the range [0, 1].

Figure 2
figure 2

Neural network LSTM structure

The forget gate decides what information to discard or retain from the memory of the previous moment:

$$ {f}_t=\sigma \left({W}_f\left[{h}_{\left(t-1\right)},{X}_t\right]+{b}_f\right). $$

The input gate determines the information that should be saved:

$$ {i}_t=\sigma \left({W}_i\left[{h}_{\left(t-1\right)},{X}_t\right]+{b}_i\right). $$

A tanh layer creates a new candidate value vector:

$$ {C}_t=\tanh \left({W}_c\left[{h}_{\left(t-1\right)},{X}_t\right]+{b}_c\right), $$

which will be added to the status. The input gate determines the updating of the candidate value vector, and the forget gate determines whether the information should be retained or discarded to construct the final memory:

$$ {C}_t={f}_t\ast {C}_{t-1}+{i}_t\ast {C}_t. $$

Finally, the output gate determines which part of the memory is ultimately output:

$$ {o}_t=\sigma \left({W}_o\left[{h}_{\left(t-1\right)},{X}_t\right]+{b}_o\right). $$

Then, the passed data flows into the tanh layer for processing. The output is a value between [−1, 1], which is multiplied by the output gate. Finally, the output is determined by

$$ {\mathrm{h}}_t={o}_t\ast \tanh \left({C}_t\right). $$

All the above methods from the literature survey achieved quite good experimental results, However, some used only static analysis, which has certain limitations, and some adopted both static and dynamic analysis methods, but focused on the process of dynamic analysis. Most existing research pertained to analysis and malware detection for Android applications; these methods first perform decompilation, then analyze the API calls in the decompiled code, and then analyze the employed APIs to determine whether the application is malicious. In contrast, this paper analyzes and detects malware applications under a Windows environment. The analysis process adopted here is simpler and more convenient than the processes in the research described above. Here, we avoid the decompilation process; instead, we extract the corresponding API behavior points from the running process of the monitored program and build a behavior chain to generate the dataset. Then, using the deep learning LSTM model to train detection models, higher accuracy can be obtained.

3 Construction of the MALDC model

Malicious behavior detection is actually a binary classification problem. We extract the behavior analysis from the collected data and divide it into two groups: malicious behaviors and benign behaviors. We denote the sequence of behaviors that we collect by X = {X1, X2, ⋯, Xn}, where Xi represents one behavior in multiple behavior sets and n represents the number of behavior sequences. Y = {Y1, Y2} represents the behavior categories, where Y1 and Y2 represent malicious behavior and benign behavior, respectively. Therefore, our goal is to find a suitable mapping relationship f(Xi) → Yj, where i ∈ (1, n),j = {1, 2}, and f is the mapping function of the classification model. We collected the sample data and trained it using an appropriate LSTM model and finally judged whether a sample was malicious based on the trained model results. Figure 3 shows the processing flow of this paper.

Figure 3
figure 3

Overview of the system

3.1 Related definition

3.1.1 Behavior point

All programs are executed to achieve a certain goal. Each operational step constitutes a “behavior point” in the process of reaching that goal. From the perspective of the operating system, the behavior points are calls by the program to certain API functions, which can be represented by the triplet, B = (R, P, Pro), where R is the return value of the behavior point call, P denotes the input and output of the behavior point, and Pro is an attribute that represents the purpose of the behavior point, which can be a file behavior point, a registry behavior point, a network behavior point, and so on.

The elements P{PI(P1 : V1; P2 : V2); PO(P3 : V3; P4 : V4)}in the triplet contain multiple parameters, where PI represents the input of the behavior point, PO represents the output of the behavior point, and {P1, P2} and {V1, V2} represent the input parameters and parameter values of the behavior point, respectively, while {P3, P4} and {V3, V4} respectively represent the output parameters and parameter values of the behavior point.

3.1.2 Behavior

During a running process, sequences occur between each behavior point. The time and position of each behavior point can be different, the meaning can be different, and the final goal can be different. Therefore, we call a collection of one or more behavior points a “behavior”. A behavior is represented by an N-tuple A = (B1, B2, ⋯, Bn), where B1 indicates the behavior point of the program running at time t, B2 indicates the behavior point of the program running at time t + 1 and Bn indicates the behavior point of the program running at time t + n. Overall, from time t to time t + n, the API call constitutes a behavior.

3.1.3 Description of association behavior

A behavioral relationship involves the relationship between behavior points and acts and refers primarily to the timing relationship. For example, the behavior point B1 indicates the first API call when the program is running, and the behavior point B2 indicates the second API call. The parameters and values of B2 are inherited from the previous behavior point B1, that is, the output of B1 forms the input for behavior point B2. Therefore, at the output of behavior point B1, B2 depends on the behavior point B1. Consequently, the relationship between the two behavior points can be expressed as: B1 → B2.

3.2 Construction of association behavior

Before a behavior chain can be constructed, it is first necessary to determine the corresponding behavior point features from the collected data, a process called “behavior feature extraction”. The collected data will include much useless information that constitutes interference. To remove the interference, the data must be preprocessed and the required behavior points extracted. Then, these behavior points are combined into behaviors according to the call sequence. The behavior chain is composed of these behavior combinations.

3.2.1 Feature extraction

During the course of the experiment, the API call format is relatively standardized; the API names, parameters and parameter values appear in the regular monitoring log and can be expressed as follows:

$$ \mathrm{File}\left\{{B}_1\left({\mathrm{P}}_1:{V}_1;{\mathrm{P}}_2:{V}_2;{\mathrm{P}}_3:{V}_3;\cdots \right),{\mathrm{B}}_2\left({\mathrm{P}}_3:{V}_3;{\mathrm{P}}_4:{V}_4;\cdots \right),\cdots \right\} $$

We first need to extract the API, the behavior point, the corresponding parameters and their parameter values from the collected log files. During the extraction process, because the irrelevant interference data appear in each log file, it is difficult to extract only the desired parameters and parameter values. Accomplishing this task requires the use of string processing techniques from text analysis. The raw data holds three different types of data, API names, API names and parameters, and API names and parameters along with parameter values. This article mainly addresses the behavior points and considers only API functions themselves; it ignores their parameters and parameter values. A behavior point may appear multiple times in a log file, but number of occurrences is not considered, only the order in which the APIs are called during program execution and the behavior points with which it has a sequential relationship.

The purpose of behavior extraction is to combine the behavior points obtained through monitoring into behaviors, such as the monitored behavior points B1, B2, and B3, into obtain behaviors, such as A = (B1, B2, B3). When integrating behavior points, only the same types can be combined; the type of a behavior point depends on behavior point classification. Therefore, the extracted behavior points must first be mapped to their corresponding behavior categories . Then, we can address the problem of merging the behavior points in the same category. In this paper, the corresponding behavior points B and A are extracted by Algorithm 1 (shown in Figure 4). Based on this process, we can see that behavior extraction actually involves data extraction. For example, during read-file behavior, the data in the file must be read into memory; therefore, the application will call APIs such as CreateFile, ReadFile, OpenFile, WriteFile, and so on.

Figure 4
figure 4

Feature extraction

3.2.2 Construction of the behavior chain

Program execution mainly calls various APIs to achieve a certain purpose, that is, the relationship between the behaviors is represented by the transfer relationship between the behavior points; therefore, the relationships can be established through the transmission relationships between the behavior points.

Assume that the following sequences are extracted in three log files:

  • File1: {A1(B1, B2), A2(B1, B3, B5), A5(B3, B4, B6), ⋯}

  • File2:{A1(B1, B3, B4), A3(B2, B4, B5), ⋯}

  • File3:{A2(B2, B3), A3(B1, B4, B6), A4(B3, B4, B5), ⋯}

In these three files, based on the feature extraction operations mentioned above, we can extract behavior points and behaviors from the original files to construct a behavior chain that includes the following three behavior chains:

  • {A1(B1, B2) → A2(B1, B3, B5) → A5(B3, B4, B6) → ⋯}

  • {A1(B1, B3, B4) → A3(B2, B4, B5) → ⋯}

  • {A2(B2, B3) → A3(B1, B4, B6) → A4(B3, B4, B5) → ⋯}

Note that the three behavior chains constructed above do not consider the parameters and parameter values used when calling the API during program execution; only the behavior point APIs are considered: the parameters and their parameter values are ignored. The final construction result is shown in Figure 5. Each behavior chain consists of different behaviors, and each behavior contains different behavior points.

Figure 5
figure 5

Behavior chain construction

Each time the program runs, there will be a series of calls to system APIs until the program reaches its desired purpose. Therefore, we construct the corresponding behavior chain by extracting the API function calls.. We construct the behavior chain with temporal characteristics to express the intrusive malicious behavior process. When a certain program is known to call an API, a certain behavior is triggered; then the probability of the next behavior being malicious or benign is determined through the behavior chain, which can be used to prepare for interception in advance. In this study, we use the dataset collected above and extract the behavior chain constructed by the features. Then, we combine the behavior chain with the deep learning LSTM model to predict whether a program exhibits malicious behavior.

3.3 MALDC model construction based on behavior chains

The previous section described extracting the required feature data from the log file and constructing a behavior chain temporal characteristics. Here, we construct trained models based on these behavior chains and the deep learning LSTM. Because the anomalous behavior of latent unknown attacks is quite subtle, attackers often try to obscure their attack behaviors. Usually, a single behavior appears normal, but when behaviors are related, it is possible to combine them into abnormal behavior. Therefore, this paper analyzes abnormal behaviors from the perspective of system API calls. Normal or malicious programs will both make API calls. Therefore, we establish the API behavior chain and then use the LSTM recursive neural network model as an effective recognition method. The overall architecture of the algorithm is shown in Figure 6. The processed behavior chains are input into the LSTM individually for detection and recognition, and the hidden states obtained at each moment are aggregated. Averaged pooling is then used to reduce the dimensions to obtain a converted data expression and finally, the model makes a classification from the converted data using classification algorithms.

Figure 6
figure 6

MALDC model based on behavior chains

As a simple behavior example, we introduce the cooccurrence feature of behavior points in behavior actions into the LSTM network design and use it as a parameter learning constraint for the network to optimize the recognition performance. The purpose of malicious behavior is often related to some specific set of behavior points, and the interactions of behavior points in this set are closely related. To judge whether a program is a malicious Trojan virus, behavior points such as “open port”, “receive remote host connection”, “receive remote host information”, “send information to remote host”, “start program”, and “end process”, and behavior series such as “reading files” and “screening” are very important. Different malicious programs feature different combinations of such closely related behavior points, but the order in which they call APIs during execution varies. Generally, malicious programs will call APIs such as “CreateFile”, “ReadFile”, and “WriteFile”, forming a set of nodes with discriminative properties. We have designated these behavioral characteristics that can be discriminated into cooccurrences.

In the model training phase, we introduce a constraint on the weight of the behavior point and the neuron in the objective function; thus, the same group of neurons has a greater weight connection to a subset containing certain behavior points along with other behavior points. There are smaller weight connections that reflect the cooccurrences of behavior points. As shown in Figure 7, an LSTM layer is composed of multiple LSTM neurons. These neurons are divided into M groups. Each neuron in the same group has a greater connection weight with certain behavior points (behavior points that are associated with a certain class or with certain types of malicious behavior constitute a subset of behavior points) but have smaller connection weights with other behavior points. Different groups of neurons have different sensitivities to different behaviors, and the subsets of behavior points in different groups of neurons corresponding to larger connection weights are also different.

Figure 7
figure 7

Cooccurrences of behavior points

Because various latent unknown attacks exist, the LSTM model is used for analysis and processing. Using the preprocessed behavior chain data with temporal characteristics as the training set, a mapping relationship between the input and output is found using the LSTM. The MALDC model training algorithm is shown in Figure 8. In the behavior chain, the calling order of the APIs and the context of the API calls before and after an API call are highly important factors.

Figure 8
figure 8

MALDC model-training algorithm

4 Experiments and analysis

This paper selects a certain number of benign and malware samples to generate sample data sets. The benign software is selected from the Windows system and consists of widely used and well-known software. The malware samples contain viruses, Trojans, worms, etc.; the main source for the malicious Windows system programs was acquired from https://virusshare.com/. A selection of 578 benign and 950 malware samples is used as the dataset in this study, and Table 2 shows the different categories of malware samples and the number and percentage of samples in each category(raw data). Because the dataset provided by the site is relatively small, we expand the dataset according to each malicious type in the malware dataset to ensure that all types of malicious features remain intact. Then, we randomly sample 60% of the data for each type from the malicious sample set and combine these same types of data to expand the dataset. Similarly, all the other malicious types and 40% of the benign sample data are expanded by this method, and a total of 54,324 malware and 53,361 benign samples are acquired. The experiment was executed in a virtual machine, whose configuration is shown in Table 1:

Table 1 VM configuration

The log data were obtained by running the sample program in the virtual machine and monitoring it using the WinAPIOverride64 tool [27, 30]. Each sample is executed until all its processes terminate or a 90-s timeout period elapses. This timeout value was selected because 90 s is generally to enough time for most malware programs to execute their immediate payloads. Generally, the monitoring tool ended the process after the monitoring timeout was reached, generated a corresponding monitoring log, and saved the log. The system was then reverted to its initial state, and the next program was run and monitored in a clean environment until all the data were collected (Table 2).

Table 2 The different type of malware samples(raw data)

The Windows operating system contains a large number of API calls in the form of dynamic link libraries (DLLs). If these libraries are monitored, the amount of data collected would be very large. Moreover, some APIs have no relevance to this research and would simply be noise during the analysis. By referring to DLLs selected by researchers worldwide [5, 17, 27], this paper identified six important dynamic link libraries to be monitored; these DLLs include advapi32.dll, rasapi32.dll, kernel32.dll, ntdll.dll, shell32.dll, and user32.dll. Each DLL’s functions are demonstrated in Table 3. The APIs in these libraries can create, delete or open registry keys and set or save values to them; create a process; directory or file; search, delete or move a file; create a network connection; etc.

Table 3 A summary of the selected DLLS

An API call situation of the malicious program from the log of the detection tool WinAPIOverride64 is shown in Figure 9. The Call column specifies the API call, including the name, parameters, and parameter values. Other columns specify the process ID, thread ID, address, registration value, and so on.

Figure 9
figure 9

APIOVERRIDE64 monitoring data

The collected log files are preprocessed by Algorithm 1, which extracts the features from the logs. Then, the behavior chain is constructed according to the behavior chain construction method described earlier. The behavior chain is coded, and finally, the LSTM network is applied to train the model to determine whether the analysis is malicious.

This paper uses the word2vec continuous bag-of-words (CBOW) to encode and train the collected sample data, finally generating a 50-dimensional feature vector representation for each API. Each behavior point in the behavior chain has a strong contextual relationship. If the traditional word bag model (BOW) were used, the relationship between each behavior point and other behavior points could not be considered, and the context of the behavior point would be ignored; therefore, the document encoding version of the word vector is used to encode the behavior chain, and the vector representation of each API can be obtained by training word2vec. During the process of training the word vector, we set the word vector to 50 dimensions. The number of iterations was set to 5 by default, and the size of the training window was 3.

The preconstructed behavior chain is input into the LSTM network for training. The LSTM consists of a three-layer network: an input layer with 256 input units, a hidden layer with 128 LSTM units, and an output layer. The sigmoid activation function is used in the output layer to normalize the value as the output of the neural network.

Under the same conditions as the experimental sample, we use the control variable method to adjust the optimal parameters. First, we fixed the number of units in each LSTM layer and then changed the number of samples used for training and for the batch_size and analyzed the results. The number of training iterations was set to 30, and the batch_size was set to 4, 8, 16, 32, 64, and 128. The final accuracies levels are shown in Figure 10 and Table 4.

Figure 10
figure 10

Comparison of different batch-sizes

Table 4 Comparison of different batch-sizes

Where val_loss and val_acc represent the loss value and accuracy on the test set with 10 iterations, and val_loss_30 and val_acc_30 represent the loss value and accuracy on the test data with 30 iterations. As Figure 10 shows, the loss value and accuracy obtained by the two different iteration values are different. In general, when the batch_size has a smaller value, the loss value and accuracy obtained are relatively stable, but when the batch_size is relatively large, the loss value and accuracy are unstable, which is highly related to the amount of data. Because the data set used in this paper is relatively small, considering the overall performance, a batch_size of 32, at which the loss value and accuracy are relatively stable, was selected as the experimental condition.

Next, with the batch_size fixed at 32, the number of iterations set to 30, and other conditions left unchanged, the activation function is changed, and the results are analyzed. In Figure 11, val_loss represents the loss value on the test set, and val_acc is the accuracy rate on the test set. In this experiment, a total of six different activation functions were tested. According to the results, the sigmoid function had the highest accuracy and the smallest loss value under the given conditions. Therefore, this study used the sigmoid function was used as the activation function.

Figure 11
figure 11

Comparison of different activation functions

The above experiments determined the parameters that need to be adjusted, and subsequent model training was based on these established conditions. Among the sample data collected, 80% were selected as a training set, and 20% were used as the test set. These are input into the LSTM model. The final experimental results are shown in Figure 12, where loss represents the loss value on the training data, acc represents the accuracy on the training data, val_loss is the loss value on the test data, and val_acc is the accuracy on the test data. The abscissa shows the number of iterations, and the ordinate shows percentages.

Figure 12
figure 12

Accuracy and loss values

As Figure 12 shows, the accuracy on the training set stabilizes when the number of iterations reaches 10. When the number of iterations reaches 15 times, the rate of loss reduction is also very slow. The accuracy and loss values in Figure 12 and Table 4 show that the accuracy rate is 98.64%. In the following experiments, we compare the detection results of using traditional processing algorithms and various deep learning models. In addition, the experimental results of each model highlight the advantages of the deep learning model and prove the effectiveness of the behavior chains. Table 5 summarizes the evaluation measures such as: False Positive Rate(FPR),False Negative Rate(FNR), Precision(P), area under ROC(AUC), Recall(R) and F1-measure(F1) for each experiment. The experimental results are presented and discussed in the next paragraphs.

Table 5 Comparison of experimental results of each model

The following observations can be made from Table 5 and Figure 13. First, the deep learning model substantially outperforms the other traditional algorithms overall, as the CNN (3.86% and 5.99%), the DNN(4.73% and 6.90%), the GRU(3.17% and 1.71%) and the LSTM(1.05% and 3.42%) have much smaller FPR and FNR values than the other traditional algorithms. Considering the other models, we find that the FRP and the FNR are also relatively small, for the KNN because the distributions of malware and benign data are not uniform, which may be due to sample imbalance. Second, for the traditional algorithms, i.e.NB and LR, low FPRs(4.76% and 1.59%, respectively) are achieved only with high FNRs(58.12% and 88.03%, respectively), which lead to lower F1-measures. Similarly, the P values of the SVM and MLP algorithms are not high for the following reasons: Traditional algorithms usually calculate frequency by statistics, do not consider the context of the data, and cannot process time series data. Therefore, the accuracy will be lowe for data with context and hidden attributes, which explains why traditional algorithms incur high FNRs.

Figure 13
figure 13

Comparison of the experimental results

In contrast, deep learning algorithms can effectively extract feature values based on powerful nonlinear feature representation. Such algorithms can process time series data using RNNs with feedback and time parameters. Therefore, the results of the deep learning model are better than those of the traditional algorithm model overall. Among them, LSTM led to better results than CNN, DNN and GRU, and LSTM corresponded to a much higher F1-measure(i.e.,98.01% vs .97.46% for GRU, 97.18% for DNN and 96.96% for CNN) because it had much lower FNR, we also note that the LSTM model had an FPR of 1.05% and a Precision of 98.64%.

Considering previous studies that reported traditional machine learning NB algorithm [27] and deep learning RNN and CNN algorithm [18] for malware detection, Table 5 and Figure 13 show that the NB accuracy rate is 79.4%, and that the DNN and CNN accuracy rates are similar. However, these algorithms have accuracy that are lower than that of the LSTM algorithm and FPR and FNR values that are higher than those of the LSTM algorithm The LSTM method used in this paper achieved an accuracy of 98.6%, the FPR of 1.05% and the FNR of 3.42%. However, the results of the entire experiment, indicate that our research still has much room for improvement. Because the amount of raw data used in this study was relatively small, the behavior relationships obtained from the mining process may be limited. Our goal is to find combinations of abnormal behaviors from multiple associations; consequently, only with a larger dataset can we better explore these relationships. Nonetheless, the experiment demonstrates the effectiveness of our approach.

5 Conclusions and future work

In this paper, we propose a depth-detection method for malware based on behavior chains. The behavior points required for the experiment are extracted from application monitoring log files by monitoring API call sequences. Based on these API call sequences, behavior chains with temporal characteristics are constructed and then input into the LSTM network to train the MALDC model, which is finally used to make malware or benign software classifications. In the final experimental results, the model’s accuracy on the test data reached 98.64%. Because the construction of the MALDC model requires a large amount of training data, the accuracy of the experiment will likely improve substantially with more train data.

Moreover, this study analyzed only individual APIs; it did not attempt to consider the impact of the parameters or parameter values that were input to or output by these APIs on detecting malicious behavior. Therefore, in future work, we plan to continuously collect malicious data to expand the data set and to consider the API parameters and parameter values. Through future experiments, we will be able to analyze whether the parameters and parameter values have a significant impact on malicious behavior judgments and improve malware identification.