Keywords

1 Introduction

Malware or malicious software is any software which has the potential to damage the information stored in a computer. There are different categories of malware including viruses, worms, spyware and so on. There has been a profound increase in the number of malicious samples due to the deployment of morphing techniques, which prevents the anti-malware software from detecting them. Researchers and anti-malware vendors face the challenge of how to detect previously unseen malware (also called zero-day attack).

Malware detection involves identification of both instances and variants of the existing malware. For this, it is essential to observe and study the organization of the malicious code and its behavior. This gives an understanding about the nature of infection caused by the code and thus identifies similar ones.

1.1 Basics of Malware Analysis

Malware analysis can be considered as an art of dissecting the malware to find out the way it works, methods to identify it and how to trash and wipe it out. Malware analysis is critical area as it is a threat to security of computer systems.

1.1.1 Static Analysis

Static analysis consists of examining the executable file without viewing the actual instructions. It can be applied on different representations of a program like binary code, source code etc. It can be used to find memory corruption issues and prove the correctness of models if the source code is available.

Some of the techniques used for static malware analysis are file fingerprinting, extracting hard coded strings, file format inspection, disassembling of machine code etc.

Islam et al. presented an automatic malware classification method based on function length frequency and Printable String Information [1]. Their results demonstrated that a combined approach of string and function length features into a single test provides a superior result than they were used individually. Even though these features are easier to collect, the use of static features alone, fails when obfuscation and packing is performed.

Santos et al. proposed a malware detection system that uses semi-supervised learning to detect malicious programs [2]. This method is useful when a limited amount of labeled data exists for benign and malware classes. N-gram distribution is used to represent the executables. The main contribution of this paper is the reduction in the number of required labeled instances while maintaining a good precision of above 90%. However, due to the use of static nature of features, the system will be unable to counter packed malware.

Ye et al. proposed a detection system based on static analysis to generate signatures for clustering malwares [3]. Two features have been considered here, namely Instruction Frequency and Function-based sequences. A Hybrid hierarchical clustering algorithm which combines the advantages of hierarchical clustering and k-medoids algorithm was used to cluster the malwares.

Cesare et al. has proposed a static method of malware detection [4]. Initially the system unpacks the executables and disassembles the code. Malware signature is generated based on the set of control flow graphs (from high level source code) the malware contains. Feature vector is a decomposition of the set of graphs into either fixed size k-subgraphs, or q-gram strings. Similarity with the known malwares is computed using string distance matching methods like edit distance. This method is more efficient than signature based methods as the control flow overcomes the limitations of byte level and instruction level classification but cannot properly handle packing and obfuscation issues. The major drawback of the approach is that it considers the static control flow of the program and this can be easily bypassed by the code obfuscation techniques.

The key advantage of static malware analysis is that it permits a comprehensive analysis of a given binary i.e., it can cover all possible execution paths of a malware sample. Additionally, as the source code is not actually executed, static analysis is generally safer than the dynamic analysis. But the drawback is that static analysis is usually conducted manually and thus consumes time and requires expertise.

1.1.2 Dynamic Analysis

Executing a given malware sample within a controlled environment and monitoring its actions to analyze the malicious behavior is called dynamic malware analysis. Zhao et al. proposed a vector space model where API sequences were translated into features [5].

Another approach is a graph based model where API calls and OS resources are represented as graph nodes and edges representing reference between them [6]. The Graph Edit Distance algorithm was used to find the match between different graphs.

Anderson et al. proposed a malware detection algorithm that analyzes the graphs constructed from dynamically collected instruction traces of that executables [7]. The nodes of the graph represent instructions and the transition probabilities are estimated by the data in the trace. Using the concept of graph kernels, similarity matrices between instances in training set were constructed. Gaussian kernel and spectral kernel were the two measures used to construct kernel matrix.

Borojerdi and Abadi proposed MalHunter, which is a method for generating behavioral signature [8]. Sequence-based clustering algorithm is based on the Basic Sequential Algorithmic Scheme (BSAS). The steps involved in generating the signature were identifying semantic behaviors, clustering the behavior sequences, generalizing these sequences and generating multiple behavioral signatures.

As the analysis and detection is during runtime and malware unpacks itself, dynamic malware analysis avoids the restrictions of static analysis like unpacking and obfuscation issues. The main drawback is the issue of dormant code. Also if the analysis environment is not properly managed, the system itself may get damaged. Furthermore, malware samples may change their behavior or stop the execution when they detect that execution is taking place in a controlled analysis environment.

Observing the runtime-behavior of an application is currently the most promising approach. It is mostly conducted utilizing sandboxing. A sandbox refers to a controlled runtime environment which is partitioned from the rest of the system in order to isolate the malicious process. This partitioning is typically achieved using virtualization mechanisms on a certain level.

1.1.3 Integrated Approach

Integrated approaches are the combination of static and dynamic malware analysis techniques. Sharma proposed a combination of dynamic representation of program calling structures, with a static analysis applied to a region of that structure with observed performance problem [9]. Suspicious behavior showing portions are taken as signatures and searched for similar patterns. This was a theoretical work and does not contain any experimental results for evaluating the actual performance.

Nguyen et al. proposed a method to identify real target of an indirect jump [10]. This work was aimed at reducing the false target identification that could happen when processing such jumps and during CFG construction. The framework consists of two phases: Static and Dynamic. In the static phase, the program was divided into regions and sub-CFGs were generated. During the dynamic analysis Intermediate labeled transition system is generated and test cases are executed and then the CFG is updated, followed by first step and this continues.

2 Proposed Architecture

Figure 1 shows the proposed architecture of the malware detection system. It consists of a static, dynamic and integrated approach for fast and accurate detection of malware.

Fig. 1.
figure 1

Proposed architecture of malware detection method

2.1 Static Feature Extraction

2.1.1 Disassemble the Executable

A disassembler is a computer program that translates machine language into assembly language. Objdump command in Linux is used to disassemble the codes.

2.1.2 Feature Vector Formation

The disassembled code contains the opcode sequence of the program. Opcodes from the disassemble input has been separated using a python program and 3-gram patterns were generated using the Linux utility text2ngram.

Initially all the opcodes from the disassembled code were extracted and frequency of every possible 3-gram sequence were computed. The 3-grams with frequency above a threshold were filtered from both malware and benign samples. To get a feature vector that represent malware family more precisely, we extracted all frequent 3-grams and filtered out that are common in both classes. Various length feature vectors (size 10, 20, 50, 80 and 100) were considered. The feature vector consists of strings of 1 s and 0 s depending on the presence of 3-gram patterns. This was done for each pattern and feature vectors corresponding to each sample was generated. Algorithm 1 describes the steps in Static feature extraction.

figure a

2.2 Dynamic Feature Extraction

The core idea is to make malware detection system more resilient to byte- and instruction-level modifications and obfuscations. A call graph derived from the binary code provides a reasonable approximation of the program’s run time behavior. So we translate the problem of finding similar graphs into that of extraction of different features of the graph and analyze the similarities in them.

The executable files were analyzed using Callgrind, a profiling tool that records the call history among functions in a program’s run as a call-graph. By default, the collected data consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions and the numbers of such calls.

We used Callgrind to analyze the executables. This is a profiling tool which records the call history of functions in a program during execution as a call-graph. The output generated by Callgrind was visualized using KCachegrind. Figure 2 shows a section of a sample call graph. The nodes of the graph contain the function name, its memory location, number of instructions and cost and the edges represent time consumed. The graph needs to be exported to Dot format for further analysis. The following features were extracted from graph corresponding to each of the sample. We ranked the features based on the information gain and Table 1 shows the features in descending order of ranks.

Fig. 2.
figure 2

A section of a sample call graph

Table 1. Ranking of features extracted from graph

2.3 Classifier Model

Dataset consisted of 787 malicious and 425 benign samples. Malware samples were collected from malware repositories Virusshare [13] and AVCaesar [14] and benign samples were obtained from Windows system directory. BayesNet and RandomForest classifiers were chosen to implement the classification model. BayesNet has the capability to find interdependencies between different attributes. Unlike decision tree, in Random forest the best parameter at each node in the tree is selected from a randomly chosen set of features. This helps Random Forest to perform well and makes it less vulnerable to noise in the data. An open source machine learning tool, Weka was used to implement the models [12].

3 Experimental Results and Discussion

Performance of the classification models were evaluated using measures like True Positive Rate (TP Rate), False Positive Rate (FP Rate), Precision, Recall, F-Measure (Harmonic mean of precision and recall), ROC Area (ROC curve is the curve created by plotting TPR against FPR at various threshold settings) and Accuracy.

3.1 Static Approach

Tables 2 and 3 show the classification results obtained using RandomForest and BayesNet classifiers respectively. Experiments have been repeated with varying number of static features from 10 to 100. Classification results of static analysis using RF classifier indicate that the features do not show much variation in accuracy between lengths 20 and 50. 95.87% is the best accuracy observed from RF classifier when feature length was 20.

Table 2. Weighted average of classification results with RandomForest classifier
Table 3. Weighted average of classification results with BayesNet classifier

3.2 Dynamic Approach

We conducted experiments with complete set of dynamic features and Tables 4 shows the classification results obtained using RF classifier with dynamic approach. Dynamic feature analysis shows an improvement of 1.65% over static feature analysis with RF classifier.

Table 4. Weighted average of classification results with RFclassifier

3.3 Integrated Approach

We conducted experiments by taking complete set of dynamic features and varying number of static feature lengths. Tables 5 and 6 shows the classification results obtained using RF and BayesNet classifiers using integrated approach. Figure 3 depicts the variation in ROC curves obtained from RF classifier with change in static feature vector length. Classification results obtained after integrating the features obtained from static and dynamic analysis shows an improvement of 2.89% than static and 0.82% than dynamic approach.

Table 5. Weighted average of classification results with RFclassifier
Table 6. Weighted average of classification results with BayesNet classifier
Fig. 3.
figure 3

Variation in ROC curves obtained from RF classifier with change in static feature vector length (Integrated approach)

On analyzing the classification results, we observed that the accuracy shows a declining trend on increasing the static feature vector length beyond 50. On an average both classifiers gives best accuracy rates for vector length between 30 and 50. The average time taken to build the model was 0.07 s. This is faster than the Control flow based malware variant detection [4] which took 0.7 s and Hybrid concentration based feature extraction approach for malware detection [11] which took 0.99 s on an average.

4 Conclusion

Dynamic malware analysis is done by watching and logging the behavior of the malware while running on the host. As most of the malwares pack themselves or morph their code during execution, static analysis alone is not efficient in malware detection. So here we have proposed a method that extracts features by analyzing the dynamic behavior of the malware and also static properties of the binaries and identifying whether the new sample possess similar features. The experimental results show that the method is more efficient than the static and dynamic analysis methods alone and also will be able to identify the polymorphic variants of the malware. The detection accuracy of integrated approach is comparatively better than several existing approaches. Integrated approach shows an improvement of 2.89% than static and 0.82% than dynamic approach. A practical extension to this work is to improve the static analysis part by incorporating the concept of function length analysis and also to consider files other than executables while constructing the model.